Replies: 1 comment 2 replies
-
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Beta Was this translation helpful? Give feedback.
-
I have been working on a grammar for an existing language -
tree-sitter-logstash
.The Logstash Config language is used to define Logstash pipelines. Its grammar is defined in a
grammar.treetop
file. Treetop grammars are written in a custom language based on parsing expression grammars. The grammar is not very complex, and, I believe, can be parsed with proper Tree-sitter parser.At first, I tried defining every grammar rule at once. The resulting grammar compiled but was not able to parse non-trivial files without errors. After spending some time debugging, I understood that the "all at once" approach did not work, so I decided to implement the grammar rule by rule, starting from the most trivial ones.
Relatively quickly I encountered, I believe, a lexical precedence problem that I am unable to resolve:
Grammar involved:
The
$.plugin
rule is recursive, as$.plugin
can appear in an$.attribute $._value
.Failing test:
Actual CST:
The issue arises because of the
$.name/$.bareword
collision - the parses thefalse
token as a$.name
token, does not find either{
or=>
, errors out, and instead of backtracking to the beginning offalse
and parsing it as$.bareword
it backtracks all the way to the$.attribute
and fails the node.Neither
$.name
, nor$.bareword
can be constrained to a set of keywords. I am not sure that both of the tokens can share the same regex either. Source grammar explicitly sets them differently, and the language spec is a bit vague and has type definitions slightly different from tokens defined in the grammar.My intuition tells me that the issue arises because
$.name/$.bareword
uses similar, but not the same regexes, and Tree-sitter gets confused. And I am unable to unconfuse it.Things I tried to fix the issue:
Setting lexical precedence explicitly (
token(prec(...))
) or implicitly (changing the order of the rules in thegrammar.js
).This does not solve the issue, as setting the lexical precedence of the
$.bareword
above the$.name
produces the inversed error. In such a case, the "Attributes - plugin" test fails. The attributefoo => bar {}
fails to parse, asbar
is parsed as$.bareword
.Using an external scanner to create the
$._name_end
token.The
$.name
token only makes sense in a context where it is followed by brackets or an arrow. I tried to use the external scanner to produce a zero-width lookahead-based token by checking for characters after the$.name
token. This version of a fix can be seen in this grammar.Sadly, it does not solve the issue or affect the parse tree very much. The impact of the change is negligible.
Using
conflicts
to trigger GLR parsing.Tree-sitter rejects conflicts such as:
Produced error message:
Using an external scanner to create the
$._name_start
.I stumbled upon this discussion which describes a very similar issue: due to lexical precedence conflict only one of the tokens gets parsed. The implied solution is to use the external scanner to check for token terminators, but do it before the token. I implemented such a scanner, and it does solve the issue. However, it solves the issue only for the simplest case of the
$.name
token. In source grammar, the$.name
token also includes single or double-quoted strings.I am not really confident that I will be able to implement such a lookahead-based token for the general case of the
$.name
token, based only on a single character of lookahead. It is, undoubtedly, possible, but I want to be sure it is the only way, before doing it.Many other permutations of using parsing precedence, lexical precedence, mixing, matching, and rewriting this small set of rules.
I checked most of the discussions regarding lexical or parsing precedence and tried a lot of ad-hoc solutions, but neither of them worked. I was not able to find a conclusive answer.
Questions I have regarding the issue:
Why is GLR parser not triggering when parsing the
$._value
part of a$.attribute
token?Reading through the docs, I got the impression that GLR parsing is there specifically to resolve ambiguities like this one. But for me, it is not triggering in the place I expect it to trigger, and it seems there is no way to force it to do so. Tree-sitter rejects the
conflicts
I am trying to define.Why did error recovery backtrack to the beginning of the
$.attribute
token, and not to the$._value
part?Mostly a follow-up to the previous question. I got the impression that the parser will try every possible token in the
choice()
block, instead of failing on one of them and then failing the parent node. I even tried "unrolling" the choice into:However, it did not change the parser's behavior.
Is this example even parsable via the
LR(1)
parser?I think it is, as to differentiate the
$.name
from the$.bareword
only one token of lookahead is needed: if the next token is either{
or=>
then it is the$.name
token. Am I understanding it right?Can this issue be solved without the external scanner?
Maybe I am missing some crucial parts of the Tree-sitter DSL or other concepts. I really am not excited about writing any complicated C code.
However, if you can think of an example of such or similar scanners, that can get this job done (even with some tweaking required), I would gladly take a look.
Beta Was this translation helpful? Give feedback.
All reactions