Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Digits after binary/octal #74

Open
blake-regalia opened this issue Aug 21, 2019 · 4 comments
Open

Digits after binary/octal #74

blake-regalia opened this issue Aug 21, 2019 · 4 comments

Comments

@blake-regalia
Copy link
Collaborator

Not sure how this happens yet but binary and octal numbers suffixed with digits outside their range are not marked as invalid.

// `constant.numeric.binary`, `constant.numeric.decimal`
let bin = 0b0123;

// `constant.numeric.octal`, `constant.numeric.decimal`
let oct = 0o09;
@bathos
Copy link
Owner

bathos commented Aug 31, 2019

This is an example of a more general thing which we currently allow in nearly all cases without attempting to mark invalidity: one assignment expression following another without a semicolon or (if applicable) a possible ASI opportunity. For example:

image

Although less surprising visually, that example is illustrating the same thing as 0b0123. For 2 2, a space is needed to observe the problem because 22 would lex as a single decimal token. Lexically, 0b0123 is valid source text — it tokenizes as a BinaryIntegerLiteral "0b01" followed by a DecimalLiteral "23" without a hitch. Whitespace isn’t required to appear between tokens and there’s no lookahead assertion after BinaryDigit or anything. But then in the syntactic grammar, both 2 2 and 0b0123 will end up being invalid anway, and for the same reason, which is that one number token followed by another doesn’t match any production in ES.

So the broader problem has to do with deciding that an expression (or at least, whatever would seem to start one) isn’t legal if the last thing matched was itself an expression. The current ‘allowances’ occur in various ways. For example, we aren’t requiring expression statements to be followed by a semicolon (or an ASI opportunity), and we aren’t requiring array element assignment expressions to be separated by at least one comma.

IIRC, I think I was originally reluctant to pursue marking ‘unexpected expression continations’ as invalid for two reasons. One is just the heuristic nature of expression matching in sublime syntax. Though hopefully the expression contexts ultimately implement the same logic (to the extent possible) as the real expression grammar, it’s still tough to say confidently ‘that’s definitely wrong’ when AE contexts didn’t find a way to continue, yet the next token is an ‘AE component’. This is closely related to the second, and primary reason: ASI. There’s really very little in ES Sublime that attempts addressing ASI (much less which attempts addressing it correctly) except to occasionally throw our hands in the air and say ‘well, anything could happen here,’ which is what occurs in these examples. The ASI algorithm + absence of cross-line matching in sublime-syntax ... not best pals.

There’s a precedent for special casing stuff like 0b0123. It’s unambiguously a typo and addressing it doesn’t need to entail wading into the ASI swamp. The existing similar case is that we are expressly disallowing matching 123abc as a decimal followed by an identifier, even though lexically that is correct (for the exact same reason that 0b0123 is). This is done by including {{idEnd}} at the end of the decimal pattern. The same could be done for the binary and octal (and hex) patterns.

That would fix the specific cases reported here, but on reviewing the bigger problem, I find that solution feels pretty weak. It’s not just a band-aid which doesn’t prevent the majority of similar cases from occurring — it also doesn’t actually produce the correct output. The existing {{idEnd}} in the decimal pattern should be removed, too, since it causes 123abc to be marked as invalid, yet the real invalid token here is only abc. Failing to scope the first of the two tokens as valid obscures where the mistake really happens. The SyntaxError messages in Firefox communicate this correctly:

image

So the questions now are sorta ... was I correct to be wary of / give up on asserting ‘this is wrong’ upon encountering another AE right after bottoming out in ae_AFTER_POSTFIX? Is more accurate ASI simulation intractable?

I just had a go at correcting the handling in cases within arrray literals. This one is quite simple:

image

It’s also easy to get to this:

image

...but what’s hard is getting to that without also getting to this:

image

I suspect tackling ASI in a more legit way is possible. Some of the existing ASI-related logic has clear room for improvement, correction, and unification. I’ll continue playing around with different strategies (which likely will involve meta_include_prototype: false, since linebreaks within comments count for ASI) for a bit. If it turns out to be too tough right now, sprinkling some {{idEnd}} dust on the numeric patterns might be alright as a stopgap.

@bathos
Copy link
Owner

bathos commented Aug 31, 2019

(tagging both 'bug' and 'enhancement' wouldn’t quite capture the nature of this issue like bughancement does)

@bathos
Copy link
Owner

bathos commented Sep 29, 2019

Just an update:

I dug into the ASI question more today. The logic can be generalized to address the whole category in theory, but it still fails under most common circumstances because all the constructs to which ASI can apply also (implicitly) have meta_include_prototype: true. That means they’ll consume whitespace (including newlines) and comments (including newlines) greedily. Putting meta_include_prototype in the asi scope effectively does nothing, then, and the only cases you can really handle right are those where the token which follows an ‘asi newline’ does so immediately (since you can use ^, in that case).

If we could use arbitrary length lookbehinds, this could be addressed with pretty good accuracy without big changes (‘am I, the unexpected token, preceded by a newline followed by zero or more whitespace & inline comment tokens, or what appears to be the tail end of a multiline comment? if so, asi can be considered applicable; pop’). Sublime requires lookbehinds to have fixed length (probably for good reason), so that isn’t an option.

A solution would seem to need to get pushed up to all the points where expressions might or might not continue. All such scopes would have to have meta_include_prototype: false and would add an include of a new scope that, on the basis of newline-including would-be-proto tokens, may shift into additional new ‘asi could happen’ and ‘asi cannot happen’ scopes. I’m not sure how realistic this is — I think it would need to extend its tendrils through everything — though it kinda makes sense that this would be so, since the ASI algorithm is essentially ‘those non-syntactic tokens? they’re syntactic now, maybe’.

@blake-regalia
Copy link
Collaborator Author

blake-regalia commented Sep 30, 2019

Really interesting getting this deep into the syntax -- I wonder if this will be yet another aspect of the language that ultimately motivates us to transition to a mostly generated syntax def. We could easily experiment with the feasibility of recursively generating contexts for situations like this.

By the way, I ended up creating this API/extension for generating syntaxes to streamline my work with the LinkedData syntaxes package. The API/extension is largely informed by the techniques I picked up from you and working on this repo (e.g., else_pop, other_illegal, etc). Perhaps the most convenient feature of this library tho is the switch/goto directives which create auto-lookahead variables for all the possible regexes that can match in the context to which they are transitioning. Overall, it brings the syntax much closer to the productions of PEG parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants