[BUG] Missed comment attachment during parsing #299

richardmarshall · 2024-04-06T21:20:39Z

Kind of proposals

Parser

Describe the problem

While reviewing and testing out the new formatting functionality (#291) I ran into a number of issues related to comments. There are a lot of places where comments in the input VCL are lost in the parsing process and as a consequence are missing from the formatted output. While this is largely most relevant to the formatting efforts I'm making this a separate issue since at its root this is a parser problem.

To use an over the top example, for the following VCL input:

/* 1 */ declare /* 2 */ local /* 3 */ var.s /* 4 */ STRING /* 5 */ ; /* 6 */

In this case only comments 1 and 6 are captured during parsing and a print of this node would yield

/* 1 */
declare local var.s STRING; /* 6 */

There are many other cases of valid comment positions that get lost during parsing, before semi-colons, trailing comments at the end of files, etc. Some of the loss is simply in the printing and the comments are attached to nodes. It's just the printing logic isn't emitting them, others are completely lost when tokens are read without a node being created for them.

TL;DR

We're currently losing comments but our existing comment attachment points should be flexible enough to correctly capture all possible comments. To do so will require tweaks to the parser to ensure all of them are correctly attached to their relevant node. Will submit a PR with a starting point implementation for discussion.

Comment attachment point analysis

Currently our AST node meta struct can handle a max of 3 comment attachment points (leading, infix, trailing) so it is important to confirm that this is enough to handle all cases. The following is an attempt to cover the VCL syntax and verify that our current model is sufficient to represent all comment positions as well as identifying what nodes and which positions each comment location should be attached to.

Position markers:

<start> - comment immedietly preceding the node
<end> - comment immediatly following the node
<after_keyword> - comment immedietaly following a keyword
<after_bracket> - comment immedietly following a bracket ((, {)
<x> - comment is attached to a child node

Declarations:

Untyped declarations: (acl, table (implicit string), backend, ...)

<start> token <x> name <x> { <after_bracket>
  <x>
} <end>

attachment points: 3

Typed declarations: (table, sub)

<start> token <x> name <x> type <x> { <after_bracket>
  <x>
} <end>

attachment points: 3

Statements:

Single token statements: return (bare), break, fallthrough, esi, ...

<start> token <after_keyword> ; <end>

attachment points: 3

Two token statements: return (value/state), log, unset ...

<start> token <x> value <x> ; <end>

attachment points: 2

Three token statements: error

<start> token <x> value <x> value <x> ; <end>

attachment points: 2

Four token statements: declare, set

<start> declare <after_keyword> local <x> var.f <x> STRING <x> ; <end>

attachment points: 3

<start> set <x> var.s <x> = <x> value <x> ; <end>

attachment points: 2

Block statements:

<start> if <x> () <x> { <x>
  <x>
} <x>
<start> elseif <x> () <x> { <x>
  <x>
} <x>
<start> elsif <x> { <x>
  <x>
} <x>
<start> else <x> { <x>
  <x>
} <x>

attachment points: 1

<start> else <after_keyword> if <x> () <x> { <x>
  <x>
} <x>
<start> switch <x> () <x> { <after_bracket>
  <x>
} <end>
<start> { <after_bracket>
<x>
} <end>

attachment points: 3

Labels:

single token labels: (goto label)

<x> ident <x> : <end>

attachment points: 1

keyword label: (default:)

<start> default: <end>

Note: default label also can't have white space (meaning no comments) between token and colon, need to make sure parser is handling that correctly.

case labels:

<start> token <x> value <x> : <end>
<start> token <x> operator <x> value <x> : <end>

attachment points: 2

Expressions:

Ident:

<start> var.x <end>

attachment points: 2

Unary operator:

<start> ! <x> var.x <x>
<x> 20 <x> % <end>

attachment points: 1

Binary operator:

<x> var.x <x> == <x> var.y <x>
var.x ~ "foo"
var.x + var.y
var.x var.x

Attachment points: 0

Grouped expressions:

<start> (<x> a <x> == <x> b <x>) <end>

attachment points: 2

Function call expression:

<x> foo <x> (<x>) <end>

attachment points: 1

Exception: empty argument list

Function expression AST node stores arguments in a slice so when the expression has no arguments there is nowhere to attach the comments within the argument list. In this case the comment could be attached to the function expression.

<x> foo <x> (<after_bracket>) <end>

attachment points: 2

Others:

acl entries

<x> ip <x> ; <end>
<start> ! <x> ip <x> ; <end>
<x> ip <x> / <x> mask <x> ; <end>
<start> ! <x> ip <x> / <x> mask <x> ; <end>

table entries

<x> key <x> : <x> value <x>
<x> key <x> : <x> value <x> , <end>

backend / director properties

<x> key <x> = <x> value <x> ; <end>
<x> key <x> = <x> { <x> } <x> ; <end>

Ambiguity:

In some cases there is ambiguity around which node a comment should be attached to.

Comment could be an comment for 200 or a comment for "OK"

error 200 <x> "OK";

Comment could be an <after_keyword> comment for the error node or comment for the status code.

error <x> 200;

For cases of single comments between a token and a value such as in the case above I think it makes more sense for the comment to be attached as a comment for the value. As the comment attachment logic for the declare statement doesn't need to worry about the comment and it can be handled by the ident parsing. For the same reason I think for situations like declare local var.s <x> STRING; attaching the comment as an comment for var.s makes the most sense. No need to special case the ident comment attachment logic since there is no formatting operation that would be impacted by which node the comment is attached to.

For multiple comments between tokens should all be comments for 200, all comments for "OK", or split between them.

error 200 <x> <y> <z> "OK";

For single line statements like this again I think there is little value in attempting to apply meaning to the attachment for each of the comments between idents as there is no formatting rules that would be impacted by it.

For comments between two statements should the comment be an comment for the first statement or a comment for the next one.

log "foo";
<x>
log "bar";

Typically comments on their own line are associated with the statements following them vs a statement preceding them. So in situations like this the comment should be attached as a comment for the second log statement.

Similarly when multiple comments are between two statements.

log "foo";
<x>
<y>
<z>
log "bar";

Again for this case I would say the comments should be attached as comments for the second log statement.

Some uncertainty does come up when dealing with comments at the declaration level:

sub foo () {}
<x>

<y>

<z>
sub bar () {}

The simplest solution would be to consider the comments as comments for the second function declaration.

Note: If declarations do not check for trailing comments then a special case handling for EOF would need to be added to attach any remaining unbound comments to the last declaration node found in the source.

Conclusions

So far I have not found any syntax structures that would not be able to be represented with our current 3 comment attachment point model.

This is a lot to get through and much of this is notes for implementing a fix.

The text was updated successfully, but these errors were encountered:

ysugimoto · 2024-04-09T14:27:22Z

I realized that the current comments parsing won't be enough.

Fortunately, the lexer can lex each comment but the parser will lack them.
So I will change saving comments in the ast.Meta struct as map[position]ast.Comments to get comments placed at any positions that you describe.

Example

Here is the case of subroutine:

// a
sub /* b */ vcl_recv /* c */ { // d
  ...
}

&ast.SubroutineDeclaration{
  Meta: &ast.Meta{
    Comments: map[string]ast.Comments{
      "leading": ast.Comments{{Value: "// a" }},
      "before_name": ast.Comments{{Value: "/* b */"}},
      "after_name": ast.Comments{{Value: "/* c */"}},
    },
  },
  ...

The // d comment will be attached to the first statement inside block statement (the same as biomejs)
The comment-position name could be arbitrary and suitable for the statements.

richardmarshall added bug Something isn't working proposal project proposal labels Apr 6, 2024

richardmarshall self-assigned this Apr 6, 2024

This was referenced Apr 7, 2024

Test of new comment attachment model #300

Draft

Feature/implement formatter #291

Merged

ysugimoto mentioned this issue Apr 19, 2024

Improve parser/ast for complex comments #302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Missed comment attachment during parsing #299

[BUG] Missed comment attachment during parsing #299

richardmarshall commented Apr 6, 2024 •

edited

ysugimoto commented Apr 9, 2024 •

edited

[BUG] Missed comment attachment during parsing #299

[BUG] Missed comment attachment during parsing #299

Comments

richardmarshall commented Apr 6, 2024 • edited

Kind of proposals

Describe the problem

TL;DR

Comment attachment point analysis

Position markers:

Declarations:

Statements:

Block statements:

Labels:

Expressions:

Others:

Ambiguity:

Conclusions

ysugimoto commented Apr 9, 2024 • edited

Example

richardmarshall commented Apr 6, 2024 •

edited

ysugimoto commented Apr 9, 2024 •

edited