Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree-sitter 1.0 Checklist #930

Open
16 of 30 tasks
maxbrunsfeld opened this issue Feb 20, 2021 · 34 comments
Open
16 of 30 tasks

Tree-sitter 1.0 Checklist #930

maxbrunsfeld opened this issue Feb 20, 2021 · 34 comments
Milestone

Comments

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Feb 20, 2021

In the not-too-distant future, I'd like to bump Tree-sitter's version to 1.0, indicating a greater degree of stability and completeness. After that I'd like to regenerate all of the parsers in the tree-sitter github org, and bump them to 1.0 as well. Before doing this, there are several important problems with the framework that I think should be fixed.

Tasks

  • Unicode character properties - Support ECMAScript unicode property escapes in regexes.

  • Partial Precedence Orderings - The integer precedence system makes some grammars shockingly difficult to maintain.

    • Enhance the precedence system to allow precedences to be expressed in a pairwise partial ordering instead of requiring a total ordering based on integers. (Allow precedences to be specified using strings and a partial ordering relation #939)
    • Update tree-sitter-javascript and tree-sitter-typescript to use this more flexible precedence scheme. Right now, the integer precedence system is making it very difficult to continue development of tree-sitter-typescript in particular, because of the mix of different conflicts between types and expressions.
    • Dynamic precedence should probably stay integer-only, for simplicity
  • Grammars with many fields, aliases - By historical accident, generated parsers use too small an integer type (uint8_t) for storing nodes' field and alias information. Parsers with large numbers of fields can cause integer overflows (Tree-sitter generates invalid code for grammars with large numbers of fields and/or aliases #511)

    • Start representing nodes' production_id as a uint16_t (Clean up parse table representation, use 16 bits for production_id #943)
    • Strategy - Decide whether we're going to bother to maintain backward compatibility with old generated parsers, if so, the library code will need to become a bit more complicated in order to consume both binary formats.
    • Grammars - Regenerate all the parsers with the new representation.
  • Fix issues with the get_column external scanner API (Fix the behavior of Lexer.get_column #978)

  • CLI Ergonomics

  • Mergeable Git Repos - Make it easier to collaborate on grammars by removing generated files from version control.

  • Documentation

    • Document the ability to match against supertypes in queries with the expression/identifier syntax.
    • Add more thorough explanations of LR conflicts, precedence, and dynamic conflict-resolution with GLR.
    • Make it clear how to use Tree-sitter for basic syntax highlighting without the tree-sitter-highlight rust crate (just using tree queries directly).
    • Document the tags.scm queries used for code navigation on GitHub. Document queries/tags.scm #660
    • Create a CHANGELOG file and start maintaining it. Wish: CHANGELOG #527

Stretch Goals

I'm recording these here even though they are a bit less urgent.

  • Incremental Parsing Perf - Enhance the external scanner API to allow for looser state comparisons, avoiding the catastrophic node-reuse failures seen in the HTML parser (Incremental parsing is ineffective when a new tag is opened tree-sitter-html#23)

    • Figure out if the new scanner function can be made optional (with the parser generator inspecting scanner.c to decide whether to link against a _compare function).
    • Update tree-sitter-html to use this API, improving its incremental performance
  • Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled .wasm files, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native. Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime #1864

@maxbrunsfeld
Copy link
Contributor Author

For anyone who is interested, please let me know if I've left important things off of this list ☝️ .

@maxbrunsfeld maxbrunsfeld pinned this issue Feb 20, 2021
@razzeee
Copy link
Contributor

razzeee commented Feb 21, 2021

Reads like tag queries are not going to be a 1.0 feature?

@theHamsta
Copy link
Contributor

An alternative to removing the generated files would be to let them be pushed automatically on master by a CI bot. User can create mergable PRs by not needing to change any generated files. In this repo https://github.com/neovim/nvim-lspconfig/blob/master/.github/workflows/docgen.yml user coot changes to a configuration and a bit updates the documentation after each push on master.

@razzeee
Copy link
Contributor

razzeee commented Feb 21, 2021

I think #516 should also be addressed, even if the function is marked experimental? At least document the behavior.

@ahlinc
Copy link
Contributor

ahlinc commented Feb 21, 2021

I would suggest to reduce implicitness:

  • Provide dsl.js as a regular file shipped with the tree-sitter-cli npm package and make it possible to require it as a regular JS library. This would help to extend it easily and would reduce confusion for IDEs and auto completion functionality in them. Behavior when dsl.js is embed in tree-sitter binary also would be good to save if the dsl.js wasn't required in the grammar file explicitly, this will allow to continue to use tree-sitter CLI as pretty standalone tool. Also this will make possible to separate the grammar.json generation in case of extended DSL or simplify its generation debugging as a regular node.js script.
  • If to talk about tree-sitter's independence it would be good that tree-sitter would have an embedded JS runtime #465 with a fallback to a system node.js if this is requested explicitly by some CLI parameter, IMO a deno library looks promising.

@ahlinc
Copy link
Contributor

ahlinc commented Feb 21, 2021

Also I saw that *.so files always have zeros in version spec like libtree-sitter.so.0.0 it would be good that minimal ABI compatible version would be reflected in the *.so.X.X suffix somehow.

@dcreager
Copy link
Contributor

Note that the version number in those file names aren’t the same as the 1.0 semver release that @maxbrunsfeld is proposing. If there are any backwards incompatible changes as part of putting together this release, we’d bump the SOVERSION to 1; if not, we’d keep it at 0. More details can be found here.

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Feb 22, 2021

@razzeee Tag queries are already done, but you're right that we still need to document them. I envision those mostly being documented in a GitHub-specific context, since there isn't much generally-useful functionally specific to Tags; it's mostly just a convention for tree queries that GitHub is using for code navigation. All of the broadly-useful stuff has been generalized into the query system. I added that to the TODOs around documentation though.

I think #516 should also be addressed, even if the function is marked experimental?

Yeah, you're right about that API being broken. I'm inclined to just address that for 1.0 by marking the function as half-baked. For our use cases, the API was only ever needed for the Haskell parser, and then we discontinued development of that parser because it was hard to find a good subset of the language that was amenable to parsing with a context-free grammar. It could definitely be made to work some day, but I think it's low-priority for us. There is still a bit of work to do to get it to play properly with incremental parsing.

Nevermind, this got fixed.

@razzeee
Copy link
Contributor

razzeee commented Feb 22, 2021

@razzeee Tag queries are already done, but you're right that we still need to document them. I envision those mostly being documented in a GitHub-specific context, since there isn't much generally-useful functionally specific to Tags; it's mostly just a convention for tree queries that GitHub is using for code navigation. All of the broadly-useful stuff has been generalized into the query system. I added that to the TODOs around documentation though.

So you don't think tags make sense for others? I hoped, that it would help moving the queries towards the parser and thus having multiple projects consume these/improve these.

I think #516 should also be addressed, even if the function is marked experimental?

Yeah, you're right about that API being broken. I'm inclined to just address that for 1.0 by marking the function as half-baked. For our use cases, the API was only ever needed for the Haskell parser, and then we discontinued development of that parser because it was hard to find a good subset of the language that was amenable to parsing with a context-free grammar. It could definitely be made to work some day, but I think it's low-priority for us. There is still a bit of work to do to get it to play properly with incremental parsing.

Understandable, do I need to be worried about the incremental parsing bit? Moved our parser to use this on a regular basis now and it seemed good, after figuring out, while it always gets stuck...

@razzeee
Copy link
Contributor

razzeee commented Feb 25, 2021

Nice strech goals would be:

@ubolonton
Copy link
Contributor

CLI commands - Add new pack and publish subcommands to the Tree-sitter CLI, for uploading tarballs and compiled .wasm files to the GitHub releases API.

This is awesome. Currently for Emacs, I have a custom package that compiles the grammar binaries for the 3 major platforms, and distributes them through GitHub Releases, in a single bundle. Having a standard tool for individual language package to do this on their own would be great.

Will the official language repositories start distributing these binaries through GitHub Releases as well? I think some GitHub actions on top of these subcommands would be very helpful for that.

@maxbrunsfeld
Copy link
Contributor Author

@ubolonton I might not take on the automation of compilation and storage of binary files (except for wasm) right now. I was mostly planning to use GH releases to store tarballs of generated files like parser.c, to avoid having so many merge conflicts in development.

@WhyNotHugo
Copy link

WhyNotHugo commented Mar 7, 2021

Add new pack and publish subcommands to the Tree-sitter CLI, for uploading tarballs and compiled .wasm files to the GitHub releases API.

I find this item problematic; what about tree-sitter implementations that are not hosted on GitHub? What's the plan on how those should be redistributed?

Never mind, I see now that this only applies only to tree-sitters in this org.

@dcreager
Copy link
Contributor

dcreager commented Mar 7, 2021

@WhyNotHugo Yes, to confirm, the plan is not to mandate any particular hosting platform. Those commands will be able to produce the generated artifacts without uploading them as a GitHub release.

@maxbrunsfeld
Copy link
Contributor Author

@razzeee I think you're right that the get_column problem is important. It's especially relevant now that tree-sitter-haskell has been revived from the dead (thanks @tek). I believe I've addressed all of the problems with that API.

@razzeee
Copy link
Contributor

razzeee commented Mar 12, 2021

while I agree, feel it's disappointing that it needed that to happen. as there have been other grammars suffering from it. still, thank you ❤️

@ahlinc
Copy link
Contributor

ahlinc commented Mar 18, 2021

It would be awesome to automate release process for all official tree-sitter tools, especially for tree-sitter-cli, for all official bindings Wasm, Rust, Node.js, Python, Haskell, Ruby and the Playground with its separately living parsers and keep all in sync with the core tree-sitter library releases. This would help to reduce misunderstanding and situations that some things work somewhere and somewhere don't.

Versions

NPM crates.io

crates.io crates.io

Bindings

binding:wasm:npm binding:rust binding:node binding:python binding:haskell binding:ruby

Notes

  • For now tree-sitter-cli installation from the crate seems the bad idea, the crate is stuck in 2 years old version.
  • tree-sitter-highlight 0.19.2 does not compile with tree-sitter 0.19.5 #1122 - tree-sitter-highlight 0.19.2 does not compile with tree-sitter 0.19.5 - demonstrates an issue that changes in tree-sitter's Rust binding requires bumping version in all dependencies that use changed parts. Otherwise there need to be a CI check that would test that the last dependent can be built against all equal or higher versions of the dependence.
  • I can't say about all bindings but Node and Python bindings use static linking to tree-sitter core library and this means that these are lag behind the core library and don't receiving core fixes and logic improvements synchronously. IMO that's the important reason why such updates need to be automated. This doesn't solves problem with the core lib features covering but at least bug fixes would be delivered in time.

@XVilka
Copy link
Contributor

XVilka commented Apr 7, 2021

I am not sure if this is actually possible - it would be also awesome if generated parser/runtime never segfaults. Showing errors, warnings, exiting - yes, but never segfaulting.

@maxbrunsfeld
Copy link
Contributor Author

I am not sure if this is actually possible - it would be also awesome if generated parser/runtime never segfaults.

Obviously the library should never segfault. AFAIK, that's already the case. I think you're referencing tree-sitter/tree-sitter-c#64, which I can't reproduce after stripping out third-party libraries.

If anyone is seeing Tree-sitter cause a segfaults, and you can reproduce the problem, please report it.

@likern
Copy link

likern commented May 10, 2021

For anyone who is interested, please let me know if I've left important things off of this list .

Add generating bindings for Zig programming language. It's successor of C language.

It provides a lot of safety features, like Rust, and might be more because of runtime checks.
Very low-level, like C. But at the same time syntax and safety and tooling of modern language.
Very fast (faster than C)

@casouri
Copy link

casouri commented Jul 24, 2021

tree-sitter should provide means to replace memory allocation functions at runtime. This allows us to link to tree-sitter as a library instead of embedding it.

@stevenbarragan
Copy link

+1 for better error messages.
related comment

@CreatCodeBuild
Copy link

Native Library, WASM parsers I would love to use wasm in other runtimes. Currently I am only able to use wasm in JS. But I would want to use it in wasmer and I don't want to use the c version because the same parser is run in different runtimes.

@oovm
Copy link

oovm commented Oct 5, 2021

For wasm target, how about wasm-bindgen, which can generate Rust
and Typescript binding at the same time.

Typescript typing is really useful when working with VSCode LSP(Language Server Protocol)

@drwpow
Copy link

drwpow commented Sep 7, 2022

Suggestion: ESM format

In the interest of an evergreen format for 1.0 I’d like to recommend ESM over CJS (e.g. basically just changing module.exports to export default. Now that that’s the official module system of JS in all forms and is supported on web and Node.js, that’s a breaking change that would be easier to do sooner than later.

Happy to help with this if this is a desirable change! But just a suggestion I’ll leave to the author/maintainers to decide 🙂

@maxbrunsfeld
Copy link
Contributor Author

Suggestion: ESM format

Yeah, I've been thinking about this too @drwpow. I added this to the list, as well as an item about reducing our coupling to npm in general.

To clarify, do you want this to be WASM engine implementation agnostic, as per your link to wasm-c-api, or is it fine to just embed a specific WASM engine?

@lambdadog I started work on this issue in #1864. I ended up going with a solution that's specifically tied to wasmtime for now.

@kevinbarabash
Copy link

@maxbrunsfeld I worked around the issue of having to check-in build files by running yarn install and yarn generate as part of the build.rs file. Thankfully yarn generate doesn't clobber this file. One issue I ran into is that binding.gyp cannot be checked in otherwise yarn install fails. I got around this by renaming it to real-binding.gyp and then copying it to binding.gyp after running yarn install. This seems to work even if it is a bit janky. See escalier-lang/escalier#288 to see this approach in action.

@tree-sitter tree-sitter deleted a comment from Kennobi19 Aug 20, 2023
@amaanq amaanq unpinned this issue Aug 30, 2023
@amaanq amaanq pinned this issue Sep 10, 2023
@xiaoma20082008
Copy link

maybe this issue Standardized node name need to be released

@ahlinc ahlinc mentioned this issue Nov 29, 2023
Closed
@dundargoc dundargoc added this to the 1.0 milestone Feb 6, 2024
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Apr 15, 2024
The ABI break seemed to be unintentional, but adding a subslot will be
useful in the future as a break with version 1.0 of tree-sitter looks
to be planned.

Ref: tree-sitter/tree-sitter#930 (comment)
Bug: https://bugs.gentoo.org/930039
Signed-off-by: Matthew Smith <[email protected]>
@amaanq amaanq unpinned this issue May 7, 2024
@amaanq amaanq pinned this issue May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests