Digits after binary/octal #74

blake-regalia · 2019-08-21T05:25:28Z

Not sure how this happens yet but binary and octal numbers suffixed with digits outside their range are not marked as invalid.

// `constant.numeric.binary`, `constant.numeric.decimal`
let bin = 0b0123;

// `constant.numeric.octal`, `constant.numeric.decimal`
let oct = 0o09;

bathos · 2019-08-31T11:14:17Z

This is an example of a more general thing which we currently allow in nearly all cases without attempting to mark invalidity: one assignment expression following another without a semicolon or (if applicable) a possible ASI opportunity. For example:

Although less surprising visually, that example is illustrating the same thing as 0b0123. For 2 2, a space is needed to observe the problem because 22 would lex as a single decimal token. Lexically, 0b0123 is valid source text — it tokenizes as a BinaryIntegerLiteral "0b01" followed by a DecimalLiteral "23" without a hitch. Whitespace isn’t required to appear between tokens and there’s no lookahead assertion after BinaryDigit or anything. But then in the syntactic grammar, both 2 2 and 0b0123 will end up being invalid anway, and for the same reason, which is that one number token followed by another doesn’t match any production in ES.

So the broader problem has to do with deciding that an expression (or at least, whatever would seem to start one) isn’t legal if the last thing matched was itself an expression. The current ‘allowances’ occur in various ways. For example, we aren’t requiring expression statements to be followed by a semicolon (or an ASI opportunity), and we aren’t requiring array element assignment expressions to be separated by at least one comma.

IIRC, I think I was originally reluctant to pursue marking ‘unexpected expression continations’ as invalid for two reasons. One is just the heuristic nature of expression matching in sublime syntax. Though hopefully the expression contexts ultimately implement the same logic (to the extent possible) as the real expression grammar, it’s still tough to say confidently ‘that’s definitely wrong’ when AE contexts didn’t find a way to continue, yet the next token is an ‘AE component’. This is closely related to the second, and primary reason: ASI. There’s really very little in ES Sublime that attempts addressing ASI (much less which attempts addressing it correctly) except to occasionally throw our hands in the air and say ‘well, anything could happen here,’ which is what occurs in these examples. The ASI algorithm + absence of cross-line matching in sublime-syntax ... not best pals.

There’s a precedent for special casing stuff like 0b0123. It’s unambiguously a typo and addressing it doesn’t need to entail wading into the ASI swamp. The existing similar case is that we are expressly disallowing matching 123abc as a decimal followed by an identifier, even though lexically that is correct (for the exact same reason that 0b0123 is). This is done by including {{idEnd}} at the end of the decimal pattern. The same could be done for the binary and octal (and hex) patterns.

That would fix the specific cases reported here, but on reviewing the bigger problem, I find that solution feels pretty weak. It’s not just a band-aid which doesn’t prevent the majority of similar cases from occurring — it also doesn’t actually produce the correct output. The existing {{idEnd}} in the decimal pattern should be removed, too, since it causes 123abc to be marked as invalid, yet the real invalid token here is only abc. Failing to scope the first of the two tokens as valid obscures where the mistake really happens. The SyntaxError messages in Firefox communicate this correctly:

So the questions now are sorta ... was I correct to be wary of / give up on asserting ‘this is wrong’ upon encountering another AE right after bottoming out in ae_AFTER_POSTFIX? Is more accurate ASI simulation intractable?

I just had a go at correcting the handling in cases within arrray literals. This one is quite simple:

It’s also easy to get to this:

...but what’s hard is getting to that without also getting to this:

I suspect tackling ASI in a more legit way is possible. Some of the existing ASI-related logic has clear room for improvement, correction, and unification. I’ll continue playing around with different strategies (which likely will involve meta_include_prototype: false, since linebreaks within comments count for ASI) for a bit. If it turns out to be too tough right now, sprinkling some {{idEnd}} dust on the numeric patterns might be alright as a stopgap.

bathos · 2019-08-31T11:17:33Z

(tagging both 'bug' and 'enhancement' wouldn’t quite capture the nature of this issue like bughancement does)

bathos · 2019-09-29T19:45:29Z

Just an update:

I dug into the ASI question more today. The logic can be generalized to address the whole category in theory, but it still fails under most common circumstances because all the constructs to which ASI can apply also (implicitly) have meta_include_prototype: true. That means they’ll consume whitespace (including newlines) and comments (including newlines) greedily. Putting meta_include_prototype in the asi scope effectively does nothing, then, and the only cases you can really handle right are those where the token which follows an ‘asi newline’ does so immediately (since you can use ^, in that case).

If we could use arbitrary length lookbehinds, this could be addressed with pretty good accuracy without big changes (‘am I, the unexpected token, preceded by a newline followed by zero or more whitespace & inline comment tokens, or what appears to be the tail end of a multiline comment? if so, asi can be considered applicable; pop’). Sublime requires lookbehinds to have fixed length (probably for good reason), so that isn’t an option.

A solution would seem to need to get pushed up to all the points where expressions might or might not continue. All such scopes would have to have meta_include_prototype: false and would add an include of a new scope that, on the basis of newline-including would-be-proto tokens, may shift into additional new ‘asi could happen’ and ‘asi cannot happen’ scopes. I’m not sure how realistic this is — I think it would need to extend its tendrils through everything — though it kinda makes sense that this would be so, since the ASI algorithm is essentially ‘those non-syntactic tokens? they’re syntactic now, maybe’.

blake-regalia · 2019-09-30T03:48:29Z

Really interesting getting this deep into the syntax -- I wonder if this will be yet another aspect of the language that ultimately motivates us to transition to a mostly generated syntax def. We could easily experiment with the feasibility of recursively generating contexts for situations like this.

By the way, I ended up creating this API/extension for generating syntaxes to streamline my work with the LinkedData syntaxes package. The API/extension is largely informed by the techniques I picked up from you and working on this repo (e.g., else_pop, other_illegal, etc). Perhaps the most convenient feature of this library tho is the switch/goto directives which create auto-lookahead variables for all the possible regexes that can match in the context to which they are transitioning. Overall, it brings the syntax much closer to the productions of PEG parser.

bathos added the bughancement label Aug 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digits after binary/octal #74

Digits after binary/octal #74

blake-regalia commented Aug 21, 2019

bathos commented Aug 31, 2019 •

edited

bathos commented Aug 31, 2019

bathos commented Sep 29, 2019 •

edited

blake-regalia commented Sep 30, 2019 •

edited

Digits after binary/octal #74

Digits after binary/octal #74

Comments

blake-regalia commented Aug 21, 2019

bathos commented Aug 31, 2019 • edited

bathos commented Aug 31, 2019

bathos commented Sep 29, 2019 • edited

blake-regalia commented Sep 30, 2019 • edited

bathos commented Aug 31, 2019 •

edited

bathos commented Sep 29, 2019 •

edited

blake-regalia commented Sep 30, 2019 •

edited