General strategy for handling extras #3196

michaelfortunato · 2024-03-19T17:08:05Z

michaelfortunato
Mar 19, 2024

Hi,
What is a good way of handling a language which in general is newline insensitive, but in certain rules is not?

Disclaimer

I have looked at #884 , and I know that my question is basically a repeat of #1421. I also know that #931 has been talked about.
I post because I was still hoping to get some feedback for my particular language if possible, and to continue the extras conversation in general.

Background/Language Spec

I am working on writing a tree sitter grammar for a language that

Usually allows newlines to be interspersed within a statement
EXCEPT, if a "single-line" if statement is encountered, the language will force that the consequence statement be on the same line as the if statement (no part of consequence can be interrupted by a newline).
In a single line if statement, all statements which are on the same line as the if construct are associated with the if construct. So if 1 let x = 1 let x = 2 gets parsed as (if-stmt (let-stmt) (let-stmt)) as opposed to (if-stmt (let-stmt)) (let-stmt) and

if 1 x = 1
x = 2

gets parsed as (if-stmt (let-stmt)) (let-stmt)

Examples

Here are some examples of strings my language accepts and rejects

ACCEPTS as (let-stmt) (let-stmt):

let x =
1
let x = 1

ACCEPTS as (if-stmt (let-stmt) (let-stmt)) (let-stmt):

if 1 let x = 1 let x = 2 
let x = 3

ACCEPTS as if-stmt (let-stmt) (this is what I call a "block-type" if-statement, as opposed to single line)

if 1 {
let x = 
1
}

REJECTS

if 1 let x 
= 1

What is a "good way" to define such an if statement?

Solutions

I have two solutions so far that I have come up with.

1: Using an external scanner

The idea here is that we use a zero-width external token to start the single-line if statement, which notifies our scanner struct that we are "single-line mode", and then terminate the rule with single-line end zero width token to toggle our scanner back to its original state.

grammar({
"name": "my_cumbersome_language",
externals: $ => [$.single_line_sensitive_mode_start, $.single_line_sensitive_mode_end],
rule: $=> {
  program: ....

  if_stmt: $ => seq("if", $. expression, choice($.single_line_if, $.block_if)),
  single_line_if: $=> seq($.single_line_sentitive_mode_start, repeat1($.stmt), $.single_line_sensitive_mode_end),
  block_if: $=> seq("{", $.program, "}"),
 ....
} 

})

Then, in my scanner.c, my void * payload would be a struct Scanner { bool is_newline_sensitive; };

Advantages

One of the main advantages, if this approach could even work at all, (which I have yet to get it to, and will talk about more below), is that it allows me to keep \n in the extras rule. The other advantage, which I argue is more attractive, is that I could potentially reuse this construct in other parts of my language which are whitespace insensitive.

Disadvantages

First off, I have not been able to get this to completely work, I can attach my full code if people become interested in this post, but figured I would ask for feedback first. Another disadvantage, which is more about my skill level, is that I do not know how to handle error recovery gracefully in my external scanner, and I find that having a stateful scanner gets hard to maintain. More on that second point, my understanding is that tree sitter serializes the state of the scanner inside each token that is recognized, which makes sense, but its just hard for me to keep track of when/if I need to reset my state in case of an error. For example,
say ts has the following sequence of tokens: if ... single_line_senstive_mode_start ... singe_line_sensitive_mode_end, I understand that the state of my scanner is is_newline_senstive = true in from token range [singe_line_sensitive_mode_start, single_line_sensitive_mode_end], but what would happen if I deleted the single_line_senstive_mode_start token from the token sequence, then the scanner would be is_newline_sensitive = true from the first token in the consequence statement to $.singe_line_sensitive_mode_end. It is unclear to me what the state of my scanner should be in this case. I know also that this is general problem. As such I try to avoid state in my external scanner in general.

2: Remove `\n` from `extras` and handle it manually everywhere in my grammar.js

This approach is easier for me to reason about but becomes harder to maintain as more statements get added.
The idea is that in my grammar.js, we would instead have:

const MAYBE_NEWLINE = optional('\n')
grammar({
"name": "my_cumbersome_language",
extras: $ => [' '], // no \n
rule: $=> {
  program: ....
 

  if_stmt: $ => seq("if", MAYBE_NEWLINE  $. expression, choice($.single_line_if, $.block_if)),
  single_line_if: $=> repeat1($.no_newline_stmt),
  block_if: $=> seq(MABYE_NEWLINE, {", MAYBE_NEWLINE, $.program,  MAYBE_NEWLINE, "}", MAYBE_NEWLINE),
 ....
}

What I would do in this case is have a function which takes in the no_newline_stmt rule, traverses it recursively, and inserting MAYBE_NEWLINE tokens between all terminals.

In fact I have done this, and it wasn't too bad, but still adds a lot of new rules to the parse table

/// <reference types="tree-sitter-cli/dsl" />
// @ts-check
/**
 *
 * Utility functions to aid in the generation of
 * grammar.js
*/

/** 
 * This function takes in a rule, 
 * and recursively traverses it,
 * wrapping each token rule with token.immediate
 * The recursion continues for non terminals 
 * that are not the value of the current terminal
 * Recursion is done with DFS and a visited dictionary 
 * is used to avoid cycles
 * FIXME: excludeList is not actually implemented 
 *
 * @param {GrammarSchema<string>} baseGrammar
 * @param{SymbolRule<string>[]} excludeList
 * @returns {Record<string, RuleBuilder<string>>} 
 */
module.exports.unspace = function(baseGrammar, excludeList) {
  // first build up a map of unvisted rules,
  // mark everything in the excludeList as visited
  /** @type { { [x: string]: boolean; } } */
  let visited = {}
  for (const eRule of excludeList) {
    visited[eRule.name] = true;
  }
  /** @type {Record<string,RuleBuilder<string>>} */
  let ruleMap = {}
  // Now, given the rule we would like to unspace,
  // access its definition from the grammar
  let _ = _unspace(baseGrammar, visited, sym("expression"), ruleMap);
  // console.dir(ruleMap, { depth: null, colors: true });
  return ruleMap
}


/**
 * @param {RuleOrLiteral} rule
 */
module.exports.repeat_with_commas = function(rule) {
  return seq(rule, repeat(seq(',', rule)))
}

/**
 * @template {string} T
 * @param {GrammarSchema<T>} baseGrammar
 * @param {{ [x: string]: boolean; }} visited
 * @param {Rule} rule
 * @param {Record<string,RuleBuilder<string>>} ruleMap
 * @returns {Rule}
 */
function _unspace(baseGrammar, visited, rule, ruleMap) {
  if (rule.type == 'SYMBOL') {
    if (visited[rule.name]) {
      return alias(sym(rule.name + "_spaceless"), rule);
    }
    let unwrappedRule = baseGrammar.rules[rule.name];
    visited[rule.name] = true;
    const nodes = _unspace(baseGrammar, visited, unwrappedRule, ruleMap);
    ruleMap[rule.name + "_spaceless"] = _$ => nodes;
    return alias(sym(rule.name + "_spaceless"), rule);
  }
  if (rule.type == 'CHOICE') {
    /** @type {ChoiceRule} rule */
    /** @type {Rule[]} */
    let members = [];
    for (const r of rule.members) {
      members.push(_unspace(baseGrammar, visited, r, ruleMap));
    }
    return choice(...members);
  }
  if (rule.type == 'SEQ') {
    /** @type {SeqRule} rule */
    /** @type {Rule[]} */
    let members = [];
    for (const r of rule.members) {
      members.push(_unspace(baseGrammar, visited, r, ruleMap));
    }
    return seq(...members);
  }
  if (rule.type == 'REPEAT') {
    let unwrappedRule = rule.content;
    return repeat(_unspace(baseGrammar, visited, unwrappedRule, ruleMap));
  }

  if (rule.type == 'REPEAT1') {
    let unwrappedRule = rule.content;
    return repeat1(_unspace(baseGrammar, visited, unwrappedRule, ruleMap));
  }
  if (rule.type == 'ALIAS') {
    /** @type {AliasRule} rule */
    return alias(_unspace(baseGrammar, visited, rule.content, ruleMap), rule.value);
  }
  if (rule.type == 'PATTERN') {
    /** @type {PatternRule} rule */
    return token.immediate({ ...rule });
  }
  if (rule.type == 'PREC_LEFT') {
    /** @type {PrecLeftRule} rule */
    return prec.left(rule.value, _unspace(baseGrammar, visited, rule.content, ruleMap));
  }
  if (rule.type == 'PREC_RIGHT') {
    /** @type {PrecRightRule} rule */
    return prec.right(rule.value, _unspace(baseGrammar, visited, rule.content, ruleMap));
  }
  if (rule.type == 'STRING') {
    /** @type {StringRule} rule */
    return token.immediate({ ...rule });
  }
  if (rule.type == 'TOKEN') {
    /** @type {TokenRule} rule */
    return token.immediate({ ...rule })
  }
  if (rule.type == 'PREC') {
    /** @type {PrecRule} rule */
    return prec(rule.value, _unspace(baseGrammar, visited, rule.content, ruleMap));
  }
  if (rule.type == 'FIELD') {
    /** @type {PrecRule} rule */
    return field(rule.name, _unspace(baseGrammar, visited, rule.content, ruleMap));
  }
  if (rule.type == 'BLANK') {
    /** @type {BlankRule} rule */
    return blank();
  }
  if (rule.type == 'PREC_DYNAMIC') {
    /** @type {PrecDynamicRule} rule */
    return prec.dynamic(rule.value, _unspace(baseGrammar, visited, rule.content, ruleMap));
  }
  if (rule.type == 'IMMEDIATE_TOKEN') {
    /** @type {ImmediateTokenRule} rule */
    return { ...rule }
  }
  /// This should NEVER happen
  return rule;
}


})

Update on the above: It seems like masaeedu, #931 (comment), also had this generator idea!

Thanks and thanks for the wonderful software!
Michael

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General strategy for handling extras #3196

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

General strategy for handling extras #3196

michaelfortunato Mar 19, 2024

Disclaimer

Background/Language Spec

Examples

Solutions

1: Using an external scanner

Advantages

Disadvantages

2: Remove \n from extras and handle it manually everywhere in my grammar.js

Replies: 0 comments

michaelfortunato
Mar 19, 2024

2: Remove `\n` from `extras` and handle it manually everywhere in my grammar.js