You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, What is a good way of handling a language which in general is newline insensitive, but in certain rules is not?
Disclaimer
I have looked at #884 , and I know that my question is basically a repeat of #1421. I also know that #931 has been talked about.
I post because I was still hoping to get some feedback for my particular language if possible, and to continue the extras conversation in general.
Background/Language Spec
I am working on writing a tree sitter grammar for a language that
Usually allows newlines to be interspersed within a statement
EXCEPT, if a "single-line" if statement is encountered, the language will force that the consequence statement be on the same line as the if statement (no part of consequence can be interrupted by a newline).
In a single line if statement, all statements which are on the same line as the if construct are associated with the if construct. So if 1 let x = 1 let x = 2 gets parsed as (if-stmt (let-stmt) (let-stmt)) as opposed to (if-stmt (let-stmt)) (let-stmt) and
if 1 x = 1
x = 2
gets parsed as (if-stmt (let-stmt)) (let-stmt)
Examples
Here are some examples of strings my language accepts and rejects
ACCEPTS as (let-stmt) (let-stmt):
let x =
1
let x = 1
ACCEPTS as (if-stmt (let-stmt) (let-stmt)) (let-stmt):
if 1 let x = 1 let x = 2
let x = 3
ACCEPTS as if-stmt (let-stmt) (this is what I call a "block-type" if-statement, as opposed to single line)
if 1 {
let x =
1
}
REJECTS
if 1 let x
= 1
What is a "good way" to define such an if statement?
Solutions
I have two solutions so far that I have come up with.
1: Using an external scanner
The idea here is that we use a zero-width external token to start the single-line if statement, which notifies our scanner struct that we are "single-line mode", and then terminate the rule with single-line end zero width token to toggle our scanner back to its original state.
Then, in my scanner.c, my void * payload would be a struct Scanner { bool is_newline_sensitive; };
Advantages
One of the main advantages, if this approach could even work at all, (which I have yet to get it to, and will talk about more below), is that it allows me to keep \n in the extras rule. The other advantage, which I argue is more attractive, is that I could potentially reuse this construct in other parts of my language which are whitespace insensitive.
Disadvantages
First off, I have not been able to get this to completely work, I can attach my full code if people become interested in this post, but figured I would ask for feedback first. Another disadvantage, which is more about my skill level, is that I do not know how to handle error recovery gracefully in my external scanner, and I find that having a stateful scanner gets hard to maintain. More on that second point, my understanding is that tree sitter serializes the state of the scanner inside each token that is recognized, which makes sense, but its just hard for me to keep track of when/if I need to reset my state in case of an error. For example,
say ts has the following sequence of tokens: if ... single_line_senstive_mode_start ... singe_line_sensitive_mode_end, I understand that the state of my scanner is is_newline_senstive = true in from token range [singe_line_sensitive_mode_start, single_line_sensitive_mode_end], but what would happen if I deleted the single_line_senstive_mode_start token from the token sequence, then the scanner would be is_newline_sensitive = true from the first token in the consequence statement to $.singe_line_sensitive_mode_end. It is unclear to me what the state of my scanner should be in this case. I know also that this is general problem. As such I try to avoid state in my external scanner in general.
2: Remove \n from extras and handle it manually everywhere in my grammar.js
This approach is easier for me to reason about but becomes harder to maintain as more statements get added.
The idea is that in my grammar.js, we would instead have:
What I would do in this case is have a function which takes in the no_newline_stmt rule, traverses it recursively, and inserting MAYBE_NEWLINE tokens between all terminals.
In fact I have done this, and it wasn't too bad, but still adds a lot of new rules to the parse table
/// <reference types="tree-sitter-cli/dsl" />// @ts-check/** * * Utility functions to aid in the generation of * grammar.js*//** * This function takes in a rule, * and recursively traverses it, * wrapping each token rule with token.immediate * The recursion continues for non terminals * that are not the value of the current terminal * Recursion is done with DFS and a visited dictionary * is used to avoid cycles * FIXME: excludeList is not actually implemented * * @param {GrammarSchema<string>} baseGrammar * @param{SymbolRule<string>[]} excludeList * @returns {Record<string, RuleBuilder<string>>} */module.exports.unspace=function(baseGrammar,excludeList){// first build up a map of unvisted rules,// mark everything in the excludeList as visited/** @type { { [x: string]: boolean; } } */letvisited={}for(consteRuleofexcludeList){visited[eRule.name]=true;}/** @type {Record<string,RuleBuilder<string>>} */letruleMap={}// Now, given the rule we would like to unspace,// access its definition from the grammarlet_=_unspace(baseGrammar,visited,sym("expression"),ruleMap);// console.dir(ruleMap, { depth: null, colors: true });returnruleMap}/** * @param {RuleOrLiteral} rule */module.exports.repeat_with_commas=function(rule){returnseq(rule,repeat(seq(',',rule)))}/** * @template {string} T * @param {GrammarSchema<T>} baseGrammar * @param {{ [x: string]: boolean; }} visited * @param {Rule} rule * @param {Record<string,RuleBuilder<string>>} ruleMap * @returns {Rule} */function_unspace(baseGrammar,visited,rule,ruleMap){if(rule.type=='SYMBOL'){if(visited[rule.name]){returnalias(sym(rule.name+"_spaceless"),rule);}letunwrappedRule=baseGrammar.rules[rule.name];visited[rule.name]=true;constnodes=_unspace(baseGrammar,visited,unwrappedRule,ruleMap);ruleMap[rule.name+"_spaceless"]=_$=>nodes;returnalias(sym(rule.name+"_spaceless"),rule);}if(rule.type=='CHOICE'){/** @type {ChoiceRule} rule *//** @type {Rule[]} */letmembers=[];for(constrofrule.members){members.push(_unspace(baseGrammar,visited,r,ruleMap));}returnchoice(...members);}if(rule.type=='SEQ'){/** @type {SeqRule} rule *//** @type {Rule[]} */letmembers=[];for(constrofrule.members){members.push(_unspace(baseGrammar,visited,r,ruleMap));}returnseq(...members);}if(rule.type=='REPEAT'){letunwrappedRule=rule.content;returnrepeat(_unspace(baseGrammar,visited,unwrappedRule,ruleMap));}if(rule.type=='REPEAT1'){letunwrappedRule=rule.content;returnrepeat1(_unspace(baseGrammar,visited,unwrappedRule,ruleMap));}if(rule.type=='ALIAS'){/** @type {AliasRule} rule */returnalias(_unspace(baseGrammar,visited,rule.content,ruleMap),rule.value);}if(rule.type=='PATTERN'){/** @type {PatternRule} rule */returntoken.immediate({ ...rule});}if(rule.type=='PREC_LEFT'){/** @type {PrecLeftRule} rule */returnprec.left(rule.value,_unspace(baseGrammar,visited,rule.content,ruleMap));}if(rule.type=='PREC_RIGHT'){/** @type {PrecRightRule} rule */returnprec.right(rule.value,_unspace(baseGrammar,visited,rule.content,ruleMap));}if(rule.type=='STRING'){/** @type {StringRule} rule */returntoken.immediate({ ...rule});}if(rule.type=='TOKEN'){/** @type {TokenRule} rule */returntoken.immediate({ ...rule})}if(rule.type=='PREC'){/** @type {PrecRule} rule */returnprec(rule.value,_unspace(baseGrammar,visited,rule.content,ruleMap));}if(rule.type=='FIELD'){/** @type {PrecRule} rule */returnfield(rule.name,_unspace(baseGrammar,visited,rule.content,ruleMap));}if(rule.type=='BLANK'){/** @type {BlankRule} rule */returnblank();}if(rule.type=='PREC_DYNAMIC'){/** @type {PrecDynamicRule} rule */returnprec.dynamic(rule.value,_unspace(baseGrammar,visited,rule.content,ruleMap));}if(rule.type=='IMMEDIATE_TOKEN'){/** @type {ImmediateTokenRule} rule */return{ ...rule}}/// This should NEVER happenreturnrule;}})
Update on the above: It seems like masaeedu, #931 (comment), also had this generator idea!
Thanks and thanks for the wonderful software!
Michael
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
What is a good way of handling a language which in general is newline insensitive, but in certain rules is not?
Disclaimer
I have looked at #884 , and I know that my question is basically a repeat of #1421. I also know that #931 has been talked about.
I post because I was still hoping to get some feedback for my particular language if possible, and to continue the extras conversation in general.
Background/Language Spec
I am working on writing a tree sitter grammar for a language that
if 1 let x = 1 let x = 2
gets parsed as(if-stmt (let-stmt) (let-stmt))
as opposed to(if-stmt (let-stmt)) (let-stmt)
andgets parsed as
(if-stmt (let-stmt)) (let-stmt)
Examples
Here are some examples of strings my language accepts and rejects
(let-stmt) (let-stmt)
:(if-stmt (let-stmt) (let-stmt)) (let-stmt)
:if-stmt (let-stmt)
(this is what I call a "block-type" if-statement, as opposed to single line)What is a "good way" to define such an if statement?
Solutions
I have two solutions so far that I have come up with.
1: Using an external scanner
The idea here is that we use a zero-width external token to start the single-line if statement, which notifies our scanner struct that we are "single-line mode", and then terminate the rule with single-line end zero width token to toggle our scanner back to its original state.
Then, in my scanner.c, my
void *
payload would be astruct Scanner { bool is_newline_sensitive; };
Advantages
One of the main advantages, if this approach could even work at all, (which I have yet to get it to, and will talk about more below), is that it allows me to keep
\n
in the extras rule. The other advantage, which I argue is more attractive, is that I could potentially reuse this construct in other parts of my language which are whitespace insensitive.Disadvantages
First off, I have not been able to get this to completely work, I can attach my full code if people become interested in this post, but figured I would ask for feedback first. Another disadvantage, which is more about my skill level, is that I do not know how to handle error recovery gracefully in my external scanner, and I find that having a stateful scanner gets hard to maintain. More on that second point, my understanding is that tree sitter serializes the state of the scanner inside each token that is recognized, which makes sense, but its just hard for me to keep track of when/if I need to reset my state in case of an error. For example,
say ts has the following sequence of tokens:
if ... single_line_senstive_mode_start ... singe_line_sensitive_mode_end
, I understand that the state of my scanner isis_newline_senstive = true
in from token range [singe_line_sensitive_mode_start, single_line_sensitive_mode_end], but what would happen if I deleted thesingle_line_senstive_mode_start
token from the token sequence, then the scanner would beis_newline_sensitive = true
from the first token in the consequence statement to$.singe_line_sensitive_mode_end
. It is unclear to me what the state of my scanner should be in this case. I know also that this is general problem. As such I try to avoid state in my external scanner in general.2: Remove
\n
fromextras
and handle it manually everywhere in my grammar.jsThis approach is easier for me to reason about but becomes harder to maintain as more statements get added.
The idea is that in my
grammar.js
, we would instead have:What I would do in this case is have a function which takes in the no_newline_stmt rule, traverses it recursively, and inserting MAYBE_NEWLINE tokens between all terminals.
In fact I have done this, and it wasn't too bad, but still adds a lot of new rules to the parse table
Update on the above: It seems like masaeedu, #931 (comment), also had this generator idea!
Thanks and thanks for the wonderful software!
Michael
Beta Was this translation helpful? Give feedback.
All reactions