Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result of ts_node_first_child_for_pos in a heredoc_body #139

Open
casouri opened this issue Dec 14, 2022 · 8 comments
Open

Incorrect result of ts_node_first_child_for_pos in a heredoc_body #139

casouri opened this issue Dec 14, 2022 · 8 comments

Comments

@casouri
Copy link

casouri commented Dec 14, 2022

In a following source:

echo <<EOF
text1 $var
text2 $(echo cmd)
EOF

The return value of ts_node_first_child_for_byte(heredoc_body, 22), where heredoc_body is the heredoc_body node, and 22 is at the beginning of "text2", is expected to be the command_substitution node representing $(echo cmd). However, the function returns a null node.

But if I evaluate ts_node_first_child_for_byte(heredoc_body, 12), where 12 is at the beginning of "text1", the return value is expected: the simple_expansion representing $var. So I don't know what could be the cause.

Here is the program I used to test this:

#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>

TSLanguage *tree_sitter_bash();

int main() {

  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_bash());


  char *source = "echo <<EOF\n"
	"text1 $var\n"
	"text2 $(echo cmd)\n"
	"EOF\n";


  TSTree *tree = ts_parser_parse_string(parser, NULL, source, strlen(source));

  TSNode root = ts_tree_root_node(tree);
  TSNode heredoc_body = ts_node_child(root, 2);
  /* 12 = beginning of "text1".  */
  TSNode child_at_12 = ts_node_first_child_for_byte(heredoc_body, 12);
  /* 12 = beginning of "text2".  */
  TSNode child_at_22 = ts_node_first_child_for_byte(heredoc_body, 22);

  if (!ts_node_is_null(child_at_12))
	printf("Child at 12 (beg of \"text1\") is %s\n", ts_node_type(child_at_12));
  if (ts_node_is_null(child_at_22))
	printf("Child at 22 (beg of \"text2\") is null: unexpected\n");
  printf("Parse tree:\n%s\n", ts_node_string(root));
  return 0;
}

Compiled using

gcc bash-bug.c -o bash-bug -ltree-sitter -ltree-sitter-bash -L.

If I run it, I get

Child at 12 (beg of "text1") is simple_expansion
Child at 22 (beg of "text2") is null: unexpected
Parse tree:
(program (redirected_statement body: (command name: (command_name (word))) redirect: (heredoc_redirect (heredoc_start))) (heredoc_body (simple_expansion (variable_name)) (command_substitution (command name: (command_name (word)) argument: (word)))))

I'm using the latest master branch of tree-sitter-bash and tree-sitter 0.20.7. TIA!

@maxbrunsfeld
Copy link
Contributor

Thanks for the detailed report @casouri.

@dannyfreeman
Copy link

@maxbrunsfeld

Hello, I believe I have found a similar issue in tree-sitter-clojure: sogaiu/tree-sitter-clojure#32

I have based the reproduction recipe off of the one @casouri submitted. I'm not sure what we can do about it in the grammar itself. Now that it is seen here and in the clojure grammar I am wondering if it is a bug in tree-sitter itself.

@sogaiu
Copy link

sogaiu commented Dec 24, 2022

@casouri I continued investigating @dannyfreeman's sogaiu/tree-sitter-clojure#32 case and have turned up some bits that might be relevant to this issue.

I've summarized a bit here.

Update: On second thought, it might be better to disregard the above for now -- I'm not seeing a difference in debug graph output for this issue. Sorry for the false alarm.

@sogaiu
Copy link

sogaiu commented Dec 25, 2022

@casouri @maxbrunsfeld @dannyfreeman

The current implementation of ts_node__first_child_for_byte is:

static inline TSNode ts_node__first_child_for_byte(
  TSNode self,
  uint32_t goal,
  bool include_anonymous
) {
  TSNode node = self;
  bool did_descend = true;

  while (did_descend) {
    did_descend = false;

    TSNode child;
    NodeChildIterator iterator = ts_node_iterate_children(&node);
    while (ts_node_child_iterator_next(&iterator, &child)) {
      if (ts_node_end_byte(child) > goal) {
        if (ts_node__is_relevant(child, include_anonymous)) {
          return child;
        } else if (ts_node_child_count(child) > 0) {
          did_descend = true;
          node = child;
          break;
        }
      }
    }
  }

  return ts_node__null();
}

IIUC, this implementation doesn't appear to handle this issue's case or that of sogaiu/tree-sitter-clojure#32 because there is some "climbing back up" that needs to happen (which doesn't).

Here's the tail end of debug graph output I get for the source text in the sample code:

issue-139-smaller

In the figure above, there comes a time when the lower heredoc_body_repeat1 is examined. After looking at its two children (simple_expansion and _heredoc_body_middle), because the second of the children -- _heredoc_body_middle -- does not have any children (note for future that did_descend does not get set to true and remains false), the inner while loop is reached a third time, but then ts_node_child_iterator_next() returns false. This ultimately leads to the whole function returning the value of ts_node__null() (as did_descend is false so the outer while loop is exited). It seems to me that rather than do that, there should be some "climbing back up" which should lead to examining command_substitution.

I tried replacing the current implementation of ts_node__first_child_for_byte with this potentially much less efficient (but possibly a bit better behaved) implementation:


static inline TSNode ts_node__first_child_for_byte(
  TSNode self,
  uint32_t goal,
  bool include_anonymous
) {
  TSNode child;
  uint32_t count = ts_node_child_count(self);

  for (uint32_t i = 0; i < count; i++) {
    child = ts_node_child(self, i);
    if (ts_node_end_byte(child) > goal) {
      if (ts_node__is_relevant(child, include_anonymous)) {
        return child;
      }
    }
  }

AFAICT, with this alternative implementation, the sample code seems to behave better (though it may not be efficient if ts_node_child_count gets called repeatedly). This alternative implementation also yielded better results for sogaiu/tree-sitter-clojure#32.

I take it that calling ts_node_child repeatedly is undesirable as it appears similar work will get performed unnecessarily in each subsequent call as subsequent children are requested.

Here's the log.html I got for reference: log.html.gz

@ahlinc
Copy link
Contributor

ahlinc commented Aug 27, 2023

image

The current heredoc implementation ate the words text1 and text2 in the body of the heredoc.
/cc: @amaanq

@amaanq
Copy link
Member

amaanq commented Aug 27, 2023

it's because it's _heredoc_body_middle which is hidden, do we want to expose that as heredoc_content?

@ahlinc
Copy link
Contributor

ahlinc commented Aug 27, 2023

var=VAR
cat <<EOF
    text1 $var
    text2 $(echo cmd)
EOF

image
image

Lost of the prefix words.


cat <<"EOF"
    text1 $var
    text2 $(echo cmd)
EOF

image
image

Lost of the first heredoc line.

@ahlinc
Copy link
Contributor

ahlinc commented Aug 27, 2023

it's because it's _heredoc_body_middle which is hidden, do we want to expose that as heredoc_content?

Yes we want, for all visible things there should be nodes in the tree.

I think the _heredoc_body_middle should be exposed like heredoc_literal.

Also the heredoc_body shouldn't be a single multiline node that may represent the whole heredoc body and instead of it should contain one or more heredoc_literal nodes for every heredoc line (without trailing \n).

Between heredoc_literal nodes there may be various variable or command expansion nodes.

/cc: @amaanq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants