Looking for help understanding and solving lexical precedence issue in a new grammar (`tree-sitter-logstash`) #2710

yehorb · 2023-10-15T18:35:57Z

yehorb
Oct 15, 2023

I have been working on a grammar for an existing language - tree-sitter-logstash.

The Logstash Config language is used to define Logstash pipelines. Its grammar is defined in a grammar.treetop file. Treetop grammars are written in a custom language based on parsing expression grammars. The grammar is not very complex, and, I believe, can be parsed with proper Tree-sitter parser.

At first, I tried defining every grammar rule at once. The resulting grammar compiled but was not able to parse non-trivial files without errors. After spending some time debugging, I understood that the "all at once" approach did not work, so I decided to implement the grammar rule by rule, starting from the most trivial ones.

Relatively quickly I encountered, I believe, a lexical precedence problem that I am unable to resolve:

Grammar involved:

...
plugin: $ => seq(
  $.name,
  '{',
  repeat($.attribute),
  '}'
),

name: $ => /[A-Za-z0-9_-]+/,

attribute: $ => seq(
  $.name,
  '=>',
  $._value
),

_value: $ => choice(
  $.plugin,
  $.bareword,
),

bareword: $ => /[A-Za-z_][A-Za-z0-9_]+/,
...

The $.plugin rule is recursive, as $.plugin can appear in an $.attribute $._value.

Failing test:

=====================
Attributes - bareword
=====================

input { stdin { foo => false } }

---

(config
  (plugin_section
    (plugin_type)
    (plugin
      (name)
      (attribute
        (name)
        (bareword)))))

Actual CST:

(config
  (plugin_section
    (plugin_type)
    (plugin
      (name)
      (ERROR           <- actual
      (attribute       <- expected
        (name)
        (name)))))     <- actual
        (bareword))))) <- expected

The issue arises because of the $.name/$.bareword collision - the parses the false token as a $.name token, does not find either { or =>, errors out, and instead of backtracking to the beginning of false and parsing it as $.bareword it backtracks all the way to the $.attribute and fails the node.

Neither $.name, nor $.bareword can be constrained to a set of keywords. I am not sure that both of the tokens can share the same regex either. Source grammar explicitly sets them differently, and the language spec is a bit vague and has type definitions slightly different from tokens defined in the grammar.

My intuition tells me that the issue arises because $.name/$.bareword uses similar, but not the same regexes, and Tree-sitter gets confused. And I am unable to unconfuse it.

Things I tried to fix the issue:

Setting lexical precedence explicitly (`token(prec(...))`) or implicitly (changing the order of the rules in the `grammar.js`).

This does not solve the issue, as setting the lexical precedence of the $.bareword above the $.name produces the inversed error. In such a case, the "Attributes - plugin" test fails. The attribute foo => bar {} fails to parse, as bar is parsed as $.bareword.

Using an external scanner to create the `$._name_end` token.

The $.name token only makes sense in a context where it is followed by brackets or an arrow. I tried to use the external scanner to produce a zero-width lookahead-based token by checking for characters after the $.name token. This version of a fix can be seen in this grammar.

Sadly, it does not solve the issue or affect the parse tree very much. The impact of the change is negligible.

Using `conflicts` to trigger GLR parsing.

Tree-sitter rejects conflicts such as:

conflicts: $ => [
  [$.bareword, $.name],
  [$.bareword, $.plugin],
]

Produced error message:

Warning: unnecessary conflicts
  `bareword`, `name`
  `bareword`, `plugin`

Using an external scanner to create the `$._name_start`.

I stumbled upon this discussion which describes a very similar issue: due to lexical precedence conflict only one of the tokens gets parsed. The implied solution is to use the external scanner to check for token terminators, but do it before the token. I implemented such a scanner, and it does solve the issue. However, it solves the issue only for the simplest case of the $.name token. In source grammar, the $.name token also includes single or double-quoted strings.

I am not really confident that I will be able to implement such a lookahead-based token for the general case of the $.name token, based only on a single character of lookahead. It is, undoubtedly, possible, but I want to be sure it is the only way, before doing it.

Many other permutations of using parsing precedence, lexical precedence, mixing, matching, and rewriting this small set of rules.

I checked most of the discussions regarding lexical or parsing precedence and tried a lot of ad-hoc solutions, but neither of them worked. I was not able to find a conclusive answer.

Questions I have regarding the issue:

Why is GLR parser not triggering when parsing the `$._value` part of a `$.attribute` token?

Reading through the docs, I got the impression that GLR parsing is there specifically to resolve ambiguities like this one. But for me, it is not triggering in the place I expect it to trigger, and it seems there is no way to force it to do so. Tree-sitter rejects the conflicts I am trying to define.

Why did error recovery backtrack to the beginning of the `$.attribute` token, and not to the `$._value` part?

Mostly a follow-up to the previous question. I got the impression that the parser will try every possible token in the choice() block, instead of failing on one of them and then failing the parent node. I even tried "unrolling" the choice into:

...
attribute: $ => choice(
  seq($.name, '=>', $.plugin),
  seq($.name, '=>', $.bareword)
)
...

However, it did not change the parser's behavior.

Is this example even parsable via the `LR(1)` parser?

I think it is, as to differentiate the $.name from the $.bareword only one token of lookahead is needed: if the next token is either { or => then it is the $.name token. Am I understanding it right?

Can this issue be solved without the external scanner?

Maybe I am missing some crucial parts of the Tree-sitter DSL or other concepts. I really am not excited about writing any complicated C code.

However, if you can think of an example of such or similar scanners, that can get this job done (even with some tweaking required), I would gladly take a look.

ahlinc · 2023-10-16T02:53:18Z

ahlinc
Oct 16, 2023

Try to define your grammar as:

Option 1

module.exports = grammar({
  name: 'logstash',

  extras: $ => [
    /\s/,
    $.comment
  ],

  rules: {
    config: $ => repeat1($.plugin_section),

    comment: $ => token(seq('#', /.*/)),

    plugin_section: $ => seq(
      $.plugin_type,
      '{',
      repeat($.plugin),
      '}'
    ),

    plugin_type: $ => choice('input', 'filter', 'output'),

    plugin: $ => seq(
      alias($._name, $.name),
      '{',
      repeat($.attribute),
      '}'
    ),

    // $.name token only makes sense in context where it is followed by brackets or arrow
    // this rule is expressed in external scanner which produced zero-width lookahead
    // based token.
    _name: $ => seq(/[A-Za-z0-9_-]+/),

    attribute: $ => seq(
      alias($._name, $.name),
      '=>',
      $._value
    ),

    _value: $ => choice(
      $.plugin,
      $.bareword,
    ),

    bareword: $ => choice(
      /[A-Za-z_][A-Za-z0-9_]+/,
      $._name,
    ),
  }
});

Option 2 - with clearer name references

module.exports = grammar({
  name: 'logstash',

  inline: $ => [
    $.name,
  ],

  extras: $ => [
    /\s/,
    $.comment
  ],

  rules: {
    config: $ => repeat1($.plugin_section),

    comment: $ => token(seq('#', /.*/)),

    plugin_section: $ => seq(
      $.plugin_type,
      '{',
      repeat($.plugin),
      '}'
    ),

    plugin_type: $ => choice('input', 'filter', 'output'),

    plugin: $ => seq(
      $.name,
      '{',
      repeat($.attribute),
      '}'
    ),

    // $.name token only makes sense in context where it is followed by brackets or arrow
    // this rule is expressed in external scanner which produced zero-width lookahead
    // based token.
    _name: $ => seq(/[A-Za-z0-9_-]+/),
    name: $ => alias($._name, $.name),

    attribute: $ => seq(
      $.name,
      '=>',
      $._value
    ),

    _value: $ => choice(
      $.plugin,
      $.bareword,
    ),

    bareword: $ => choice(
      /[A-Za-z_][A-Za-z0-9_]+/,
      $._name,
    ),
  }
});

Problem cases parse as the following:

All tests are passed too.

2 replies

yehorb Oct 17, 2023
Author

Thank you for the suggestion, I will try to implement the full parser and see how far can I take it. I took a look at the debug graphs, and they are more or less in line with what I was expecting from my previous implementation. But I can't really wrap my head around how it was achieved. Can you share some insight on what led you to such a solution?

ahlinc Oct 17, 2023

You can look to a similar question in a Discord thread where I've put my explanation: https://discord.com/channels/1063097320771698699/1126899360995737722

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for help understanding and solving lexical precedence issue in a new grammar (`tree-sitter-logstash`) #2710

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Looking for help understanding and solving lexical precedence issue in a new grammar (tree-sitter-logstash) #2710

yehorb Oct 15, 2023

Things I tried to fix the issue:

Setting lexical precedence explicitly (token(prec(...))) or implicitly (changing the order of the rules in the grammar.js).

Using an external scanner to create the $._name_end token.

Using conflicts to trigger GLR parsing.

Using an external scanner to create the $._name_start.

Many other permutations of using parsing precedence, lexical precedence, mixing, matching, and rewriting this small set of rules.

Questions I have regarding the issue:

Why is GLR parser not triggering when parsing the $._value part of a $.attribute token?

Why did error recovery backtrack to the beginning of the $.attribute token, and not to the $._value part?

Is this example even parsable via the LR(1) parser?

Can this issue be solved without the external scanner?

Replies: 1 comment · 2 replies

ahlinc Oct 16, 2023

yehorb Oct 17, 2023 Author

ahlinc Oct 17, 2023

Looking for help understanding and solving lexical precedence issue in a new grammar (`tree-sitter-logstash`) #2710

yehorb
Oct 15, 2023

Setting lexical precedence explicitly (`token(prec(...))`) or implicitly (changing the order of the rules in the `grammar.js`).

Using an external scanner to create the `$._name_end` token.

Using `conflicts` to trigger GLR parsing.

Using an external scanner to create the `$._name_start`.

Why is GLR parser not triggering when parsing the `$._value` part of a `$.attribute` token?

Why did error recovery backtrack to the beginning of the `$.attribute` token, and not to the `$._value` part?

Is this example even parsable via the `LR(1)` parser?

Replies: 1 comment 2 replies

ahlinc
Oct 16, 2023

yehorb Oct 17, 2023
Author