- In https://replicated.wiki/blog/partII this part is very interesting to me:
> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.
I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.
Then we want to do things like update the "root" system prompt and have that applied everywhere.
There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.
Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?
- mmm. interesting and fun concept, but it seems to me like the text is actually the right layer for storing and expressing changes since that is what gets read, changed and reasoned about. why does it make more sense to use asts here?
are these asts fully normalized or do (x) and ((x)) produce different trees, yet still express the same thing?
why change what is being stored and tracked when the language aware metadata for each change can be generated after the fact (or alongside the changes)? (adding transform layers between what appears and what gets stored/tracked seems like it could get confusing?)
- > why does it make more sense to use asts here
For one, it eliminates a class of merge conflict that arises strictly from text formatting.
I always liked the idea of storing code in abstraction, especially editors supported edit-time formatting. I enjoy working on other people's code, but I don't think anybody likes the tedium of complying with style guides, especially ones that are enforced at the SCM level, which adds friction to creating even local, temporary revisions. This kind of thing would obviate that. That's why I also appreciate strict and deterministic systems like rustfmt. Unison goes a little further, which is neat but I think they're struggling getting adoption because of that, even though I'm pretty sure they've got some better tooling for working outside the whole ecosystem. These decoupled tools are probably a good way to go.
I was messing around with a file-less paradigm that would present a source tree in arbitrary ways, like just showing a individual functions, so you have the things you're working on co-located rather than switching between files. Kind of like the old VB IDE.
- An AST based conflict resolver could eliminate the same kind of merge conflicts on a text based RCS
- Yeah I suppose that's true, too. You've got to do the conversion at some point. I don't know that you get any benefit of doing storing the text, doing the transformation to support whatever ops (deconflicting, etc.) and then transforming back to text again vs just storing it in the intermediate format. Ideally, this would all be transparent to the user anyway.
- For one merge, yes. The fun starts when you have a sequence of merges. CRDTs put ids on tokens, so things are a bit more deterministic. Imagine a variable rename or a whitespace change; it messes text diffing completely.
- I remember someone mentioning a system that operated with ASTs like this in the 70s or 80s. One of the affordances is that the source base did not require a linter. Everyone reading the code can have it formatted the way they like, and it would all still work with other people’s code.
- Related, I’d love an editor that’d let me view/edit identifier names in snake_case and save them as camelCase on disk. If anyone knows of such a thing - please let me know!
- This is actually possible with glasses-mode in Emacs: https://codelearn.me/2025/02/24/emacs-glasses-mode.html
I think it sees very little usage though.
- Sure. Presumably you could have localized source presentation, too.
But, yeah, I think a personalized development environment with all of your preferences preserved and that don't interfere with whatever the upstream standard is would be a nice upgrade.
- 100% agree. I think AST-driven tooling is very valuable (most big companies have internal tools akin to each operation Beagle provides, and Linux have Coccinelle / Spatch for example), but it's still just easier implemented as a layer on top of source code than the fundamental source of truth.
There are some clever things that can be done with merge/split using CRDTs as the stored transformation, but they're hard to reason about compared to just semantic merge tools, and don't outweigh the cognitive overhead IMO.
Having worked for many years with programming systems which were natively expressed as trees - often just operation trees and object graphs, discarding the notion of syntax completely, this layer is incredibly difficult for humans to reason about, especially when it comes to diffs, and usually at the end you end up having to build a system which can produce and act upon text-based diffs anyway.
I think there's some notion of these kinds of revision management tools being useful for an LLM, but again, at that point you might as well run them aside (just perform the source -> AST transformation at each commit) rather than use them as the core storage.
- > but it's still just easier implemented as a layer on top of source code than the fundamental source of truth
Easier but much less valuable.
- you can parse the text at any time pretty much for free and use anything you learn to be smarter about manipulating the text. you can literally replace the default diff program with one that parses the source files to do a better job today.
- This is the fundamental idea behind git - to fully compute/derive diffs from snapshots (commits) and to only store snapshots. While brilliant in some ways - particularly the simplifications it allows in terms of implementation, I’ve always felt that dropping all information about how a new commit was derived from its parent(s) was wasteful. There have been a number of occasions where I wished that git recorded a rename/mv somehow - it’s particularly annoying when you squash some commits and suddenly it no longer recognizes that a file was renamed where previously it was able to determine this. Now your history is broken - “git blame” fails to provide useful information, etc. There are other ways of storing history and revisions which don’t have this issue - git isn’t the end of the line in terms of version control evolution.
- CRDT's trick is metadata. Good old diff guesses the changes by solving the longest-common-subsequence problem. There is always some degree of confusion as changes accumulate. CRDTs can know the exact changes, or at least guess less.
- One nice thing about serializing/transmitting AST changes is that it makes it much easier to to compose and transform change sets.
The text based diff method works fine if everyone is working off a head, but when you're trying to compose a release from a lot of branches it's usually a huge mess. Text based diffs also make maintaining forks harder.
Git is going to become a big bottleneck as agents get better.
- what do you actually gain over enforced formatting?
first you should not be composing releases at the end from conflicting branches, you should be integrating branches and testing each one in sequence and then cutting releases. if there are changes to the base for a given branch, that means that branch has to be updated and re-tested. simple as that. storing changes as normalized trees rather than normalized text doesn't really buy you anything except for maybe slightly smarter automatic merge conflict resolution but even then it needs to be analyzed and tested.
- Diffs are fragile, and while I agree with that process in a world where humans do all the work and you aren't cutting a dozen different releases, I think that's a world we're rapidly moving away from.
- in that case you probably flag a bunch of prs for release and it linearizes their order and rebases and tests each one a step ahead of your review (responding to any changes you make as you go).
- Having a VCS that stores changes as refactorings combined with an editor that reports the refactorings directly to the VCS, without plain text files as intermediate format, would avoid losing information on the way.
The downside is tight coupling between VCS and editor. It will be difficult to convince developers to use anything else than their favourite editor when they want to use your VCS.
I wonder if you can solve it the language-server way, so that each editor that supports refactoring through language-server would support the VCS.
- Somewhat similar project is unison:
- Mildly pedantic, but AST already stands for Abstract Syntax Tree, so the post title when unabbreviated is Abstract Syntax Tree trees.
- Can it store my PIN numbers and my map of ATM machines also?
- was about to point that out, you beat me to it
- I rushed so much that I didn't have time to do it right. It could have been the AST tree of my PIN number validation algorithm for ATM machines. :-P
- I don't think your original post suffered for the lack of one more TLA acronym.
- Right! I had to get up in the morning, at ten o'clock at night, half an hour before I went to bed... sorry, wrong sketch. I had to set up my PIN number to display on the LCD display of an ATM machine with the instructions printed in PDF format telling me how to add VAT tax all before midday GMT time.
And you try and tell the young people of today about RAS syndrome, they won't believe you!
- The linked page looks like a subsystem of some specific library, I am not sure if it is intended for general use.
If it were intended to be general replacement for general purpose version control systems, I am not sure how storing AST is better than storing the original plain text files since the transformation from text to AST might be lossy. I might want to store files with no AST (e.g. plain text files), files with multiple AST (e.g. polyglots), multiple files with the same AST (e.g. files to test different code layout), broken AST (e.g. data files to be used as test cases). These use cases would be trivially supported by storing the original file as is, whereas storing any processed form of the file would require extra work.
- (Author) There is a fall-back general-text codec: tokens, no AST (e.g. for Markdown). If that fails (non UTF8), there is the general-blob final-fallback codec (the git mode).
The way it makes an AST tree is non-lossy. Additionally, it stamps ids on the nodes, so merges do not get confused by renames, formatting changes and similar things. There is value in preserving structure this way that repeat parsing can not provide. In big-O terms, working with such an AST tree and a stack of its patches is not much different from stacks of binary diffs git is using.
If I have k independent changesets, I have k^2 unplanned interactions and 2^k unplanned change combinations. Having a bunch of change sets, which I had not fully evaluated yet, esp in relation to one another, I would like k-way merges and repeat-merges to be seamless, non-intrusive and deterministic. git's merges are not.
The project is experimental at this point.
- AST of what? Will it read my clojure code's forms as such? What if my source file has a paran balancing error? I feel I'm thinking of this at the wrong level/angle.
- I cannot remember a case, in the last 10 years at least, when I committed code that does not compile. Why should I share that? Also, tree-sitter sort of handles that.
- > code that does not compile. Why should I share that?
If you collect test cases for compilers, for example.
> tree-sitter sort of handles that
My worry is that stability of committed ASTs would depend on tree-sitter being stable, and it might be difficult to guarantee that for languages are still in flux. Even most well established languages gain new grammar once every few years, sometimes in backward incompatible ways.
Maybe you meant tree-sitter itself will also be versioned inside this repository?
- Tree-sitter can parse somewhat-bad code.
Also, there is an option to pick a codec for a particular file. Might use tree-sitter-C, might use general-text. The only issue here, you can't change the codec and keep nice diffs.
So, these cases are handled.
- It leans on tree-sitter for language handling, so i wonder if they're actually Concrete Syntax Trees.
- This sounds good in theory, but it means Beagle needs to understand how to parse every language, and keep up with how they evolve. This sounds like a ton of work and a regression could be a disaster. It'll be interesting to see how this progresses though.
- IMO this really isn’t a huge problem for this project specifically, since that part is outsourced to tree-sitter which has a lot of effort behind it to begin with.
I think this project is incredibly cool as a line of research / thought but my general experience in trying to provide human interfaces using abstractions over source code suggests that most people in general and programmers especially are better at reasoning in the source code space. Of course, beagle can generate into the source code space at each user interaction point, but at that point, why not do the opposite thing, which is what we already do with language servers and AST driven (semantic) merge and diff tools?
- It's also just one more facet. The problem already exists for anything else that we already have, like formatters, linters, syntax highlighters, language servers... And it's also not an exclusive choice. If you want to use a dumb editor, there's nothing preventing that. All of the machinery to go back and forth to text exists. Not really a huge departure.
- > AST driven (semantic) merge and diff tools?
Would you say these are commonly in use, and if so what are some "mainstream" examples? IME most people just use git's built-in diff/merge...
- I find Mergiraf pretty pleasant to use and frequently pretty helpful as a time-saver. Handles TOML and Rust for me, and I have way fewer manual interventions, especially after supplementing it with rustfmt rules to not do a bunch of merged use statements in one go. Easy to configure as a jujutsu tool as well.
- Glad to see this. We can do better then git.
- who is "we"? and "better" in what measure?
- Care to elaborate?
- https://github.com/Ataraxy-Labs/sem stores AST operations in git
- what bothers me is, while CRDTS converge, the question is to what. in this case, it seems like there's a last-write-wins semantic. which is very problematic as an implicit assumption for code(or anything where this isn't the explicit invaraint)
- It makes a lot of sense for math-focused LLMs to work with higher order symbols - or context-dependent chunking - than tokens. The same is probably true for software.
From "Large Language Models for Mathematicians (2023)" (2025) https://news.ycombinator.com/item?id=42899805 :
> It makes sense for LLMs to work with testable code for symbolic mathematics; CAS Computer Algebra System code instead of LaTeX which only roughly corresponds.
> Are LLMs training on the AST parses of the symbolic expressions, or token coocurrence? What about training on the relations between code and tests?
There are already token occurrence relations between test functions and the functions under test that they call. What additional information would it be useful to parse and extract and graph rewrite onto source code before training, looking up embeddings, and agent reasoning?