The tree-sitter packaging mess

25 August 2024

7 min read

Recently I got into packaging tree-sitter. TS (for short) is kind of a framework to write language parsers, which are mainly used for code editors.

As far as I know, TS started as a project from the Atom team. They wanted to substitute the regex-based grammars for something more powerful. TS grammars “understand” the indentifiers, as they are very similar to how languages parse the source code: by identifying variables, class names, etc. This is in contrast to the regex-based grammars, which can produce worse results. This video from 2018 is a very intersting watch, so I recommend you see it to know for about TS: https://www.youtube.com/watch?v=Jes3bD6P0To.

I have 2 uses for packaging TS and the grammars:

My own Neovim distribution. I do my packaging with Nix, but I want to move away from nixpkgs’s packaged plugins. TS is one of the dependencies that are not trivial to package.
Using the tree_sitter rust crate, which can be used to write your own TS applications. As of this writing, this blog post uses TS to do the syntax highlighting, instead of a regex-based engine.

Tree-sitter CLI and grammars

TS grammars are the main units that the TS project produces. A TS grammar is a repository dedicated to writing the parser for the project. For example, we can find TS grammars for C, Python, etc. The TS grammars can be compiled and loaded by your editor (more onto that later). We also have the TS CLI and libraries. The CLI is a simple Rust app, that can be used to set-up new TS grammars from scratch. Everything is clear?

Madness begins

Before we start, I recommend you open YouTube and set some background music while you read this, like this one: https://www.youtube.com/watch?v=IKwstbpvNMc.

JavaScript for the parsers

To begin with, grammars are written in JavaScript and compiled with Node. This might not be an issue by itself, and you could make the point of using TypeScript, but I haven’t written any grammar by myself. But this will be important later.

The JavaScript is compiled into C

Yes, you heard right. The JavaScript grammars declare the grammar, and then they are compiled into a huge C file. When you build this C file, the resulting binary exposes a symbol tree_sitter_lang that can be used by consumers to run the parser.

Copy-pasting C headers

As the generated C depends on some interface with the TS libraries, the CLI drops a tree_sitter.h that is used for the interface. They don’t rely on an installation of TS that is discovered with pkg-config or cmake, but rather drop this file on your grammar.

The generated C code and headers are tracked with git

I can only guess that for “portability”, the huge C file and TS headers are usually checked in into version control. This means, that for every change to the JavaScript grammar, you have to regenerate the C blob and vendor it in the repo.

The CLI tool is written in Rust

This is not a problem, but I just find it funny at this point that they can’t decide on a single language.

The Lisp queries

Oh, did I say that they can’t decide on a language? Spoke too soon! TS also uses Scheme, from the Lisp family. To my understanding, the queries are used by the TS clients to know how the “use” tokens created by the parser. We have 3 types of queries: highlights.scm, used for syntax highlights, injections.scm used for injecting one language into another, and locals.scm, which I don’t know what they are for. Why are these not also expressed in JavaScript? I also don’t know.

Pseudo-npm packaging with a

Because we are doing JS, we also have a package.json for each query. It is extended with off-spec keys, namely tree_sitter, which is used by the TS CLI to know when to apply the grammar. We even have some regex rules for that.

Grammar sub-directories

To make things harder, a single grammar repo might contain multiple grammars. This is the case for the TypeScript repo, which contains grammars for both typescript and tsx. The TS CLI also doesn’t recursively build all the grammars, why would you think it would do that! There are also some grammars that don’t have the package.json in the root of the repo, if you want to make things harder.

`node_modules`

The temptation is too high, and some grammar developers choose to include JS libraries from NPM with a simple npm install. In some way, it is not a problem when the C blob is pre-generated, but the grammars may also depend on queries from node_modules. And the choice of words is precise here: the grammar might depend on some query.scm located in a relative node_modules. I guess we can safely assume it comes from NPM, and it’s not a random file…

If you do packaging with Nix, you have probably know the pain of dealing with this.

Node packages running scripts and connecting to the internet

Funnily enough, as there is no concept of “systems” in NPM, the main tree_sitter JavaScript package connects to the internet to download the TS CLI that is needed for the current architecture, during its installation, ouch! Good thing we can pass --ignore-scripts to npm so that it doesn’t do it.

The TS CLI does weird things

During my testing, I saw that the CLI copies the compiled parser.so into ~/.cache. Don’t ask me why it does that, but I hope it can tell when to invalidate that cache.

Neovim ignores everything

Now we come to Neovim. TS support in neovim is provided by the nvim-treesitter plugin (no dash in the tree-sitter). Please, just forget everything you read in the last minutes, because nvim-treesitter ignores all of that. They only need that you sideload the compiled parser.so by yourself. Or you can use their :TSInstall command that downloads the grammar and compiles it. Because we know that won’t ever go wrong, and you won’t have store paths in your ELF binaries pointing to locked versions in some absolute path. No, totally, don’t use Nix.

Upstream vs downstream queries

And we come to one of the weird things in the TS ecosystem: the queries (Lisp files) that are developed in the grammar’s repository, are not used by Neovim. Instead, we have two copies of the same highlights.scm, that are mostly identical, but have diverged as different people work on them, and the Neovim queries might use specific queries only available in the editor.

Then you run into situations, where the upstream grammar developers don’t care about the grammar, because “why bother if Neovim already provides them”.

NxM problem with language libraries

At this point, it is very apparent that TS has a problem of packaging. But it is starting to make a bigger problem, as we have downstream consumers of the TS libraries. For example, you can use the Rust crate tree_sitter or the Python packaging, for your parsing needs.

But every language repository also distributes the grammars! We have the TS grammars in PyPI, in Crates.io, in NPM, Nuget and probably some more.

My last words

It is sad to see a project that was created by very intelligent people, fall in the packaging hell. I don’t know if the original authors of TS caused this, or people with less experience that came after. In any case, I think starting again from scratch is the only solution to this. And if you are the TS developer that is behind after all of this packaging, Linus has some words for you.

If you are interested in all of these findings, I’ve put some packaging for Nix in https://github.com/viperML/tree-sitter. It’s mostly stuff that I use for myself, but get in contact with me if you want to use it or make it more user-friendly.

Update in October 2024

All grammars just changed from using NPM’s package.json for storing their metadata, to using a new tree-sitter.json changes in tree-sitter-cpp.

Time to change all my build systems — hopefully they keep this new system and don’t change it again…