← back

Building Semantic Version Control in Rust

Rohan Sharma · April 2026

If you ask a developer what they changed in a commit, they'll say something like "I refactored the authentication handler" or "I added a retry to the upload function," and they never say "I changed lines 47 through 52," because nobody thinks about code that way, but git does, and git has no idea what a function is, or a class, or a method, it just sees lines of text and diffs them the same way it would diff a grocery list.

This is fine most of the time, git is a brilliant piece of engineering and the line abstraction has carried us remarkably far, but there's a ceiling to what you can build on top of it and I keep hitting that ceiling in increasingly painful ways: merge conflicts that aren't real conflicts, diffs full of noise from reformatting, code review tools that dump the entire changeset into an LLM and hope for the best. These aren't bugs in git, they're consequences of the abstraction it chose.

I've been developing and researching sem to try a different abstraction. It's a version control CLI in Rust that operates on entities: functions, classes, methods, interfaces. Its companion, weave, is a merge driver that uses the same entity model to resolve conflicts git can't, and this post is about the algorithmic decisions I made along the way, and why I made them in Rust.

Choosing the right atom

The first question you have to answer when building something like this is: what's the unit of work? You could operate on tokens, or AST nodes, or lines, or files. Each choice implies a different set of tradeoffs, and it's worth thinking carefully about which ones you want.

Files are too coarse, because when you say "this file changed" you've thrown away most of the interesting information, and a thousand-line file where someone fixed a typo in one function looks the same as a thousand-line file where someone rewrote the core algorithm. Lines are too fine, because a single logical change like adding a parameter to a function signature and updating the body to use it spans multiple lines but is semantically one thing. And individual AST nodes are too numerous and too low-level to be useful for anything except compiler internals.

The entity sits in a sweet spot: a function, a class, a method, a type definition. It's the unit that humans naturally think in when they talk about code, and it's also the unit that captures meaningful structure because an entity has a name, a type, dependencies on other entities, and a body that can be hashed and compared. My core data model reflects this directly:

pub struct SemanticEntity { pub id: String, pub name: String, pub entity_type: String, pub content: String, pub start_line: usize, pub end_line: usize, pub structural_hash: u64, pub parent_id: Option<String>, pub dependencies: Vec<String>, }

One field here deserves special attention: the parent_id creates a hierarchy, so when I parse a class, I also extract each of its methods as separate entities whose parent_id points back to the class, and this turns out to be crucial for merging because it means two developers editing different methods in the same class produce two independent changes rather than one monolithic conflict, which is something git can't see because it just sees one contiguous block of text that both sides touched.

Why tree-sitter, and why one trait per language

To parse code into entities you need a parser, and I considered three approaches: regular expressions, which are fast but fragile and break on edge cases constantly, language-specific compilers, which are accurate but would require embedding rustc and the Go compiler and javac and so on, and tree-sitter, which turned out to be the right choice because it gives you concrete syntax trees that are good enough for structural analysis, it works on partial and broken code, and it has grammars for essentially every language you'd encounter in a real codebase.

Each grammar is a separate Rust crate, and I currently pull in grammars for Rust, Python, TypeScript, JavaScript, Go, Java, C, C++, Ruby, Swift, Kotlin, and C# among others. Every grammar produces a tree with language-specific node kinds, so the Rust grammar gives you function_item where Python gives you function_definition and Go gives you function_declaration, and I handle this with a simple trait where each language plugin maps its grammar's node kinds to the entity types while the rest of the pipeline remains entirely language-agnostic.

fn extract_entities(tree: &Tree, source: &str) -> Vec<SemanticEntity> { let mut cursor = tree.root_node().walk(); let mut entities = Vec::new(); for node in tree.root_node().children(&mut cursor) { match node.kind() { "function_item" | "function_definition" => { entities.push(build_entity(node, source, "function")); } "class_definition" | "impl_item" => { let class_entity = build_entity(node, source, "class"); let class_id = class_entity.id.clone(); entities.push(class_entity); extract_children(node, source, &class_id, &mut entities); } _ => {} } } entities }

The design choice to keep parser plugins as thin wrappers around tree-sitter node matching paid off quickly, because adding a new language is just a matter of writing one function that says "this node kind is a function, this one is a class, these are its children," and everything downstream, the hashing, the diffing, the graph building, the merging, doesn't need to know or care what language it's looking at, which is how I went from supporting 4 languages to 26 in about two weeks.

Deciding when code actually changed

Here's a problem that seems simple until you think about it carefully: someone reformats a file, adds a few comments, and renames a local variable, and git shows a diff with dozens of changed lines, but did the code actually change? In any meaningful sense, no, the program does exactly what it did before.

I wanted a way to tell the difference between cosmetic changes and functional ones, and the answer turned out to be a structural hash, where the idea is to normalize the code before hashing it by stripping all comments, collapsing whitespace, and then computing a hash over what remains, so that if two versions of an entity produce the same structural hash they're functionally identical regardless of how different the raw text looks.

use xxhash_rust::xxh64; fn structural_hash(content: &str) -> u64 { let normalized = content .lines() .filter(|line| !is_comment_line(line)) .map(|line| line.trim()) .collect::<Vec<_>>() .join("\n"); xxh64::xxh64(normalized.as_bytes(), 0) }

I chose xxHash64 over SHA-256 or BLAKE3 because I don't need cryptographic properties here, I'm not building a content-addressable store, I just need fast, low-collision hashing of code-sized inputs, typically a few hundred bytes to a few kilobytes, and xxHash64 runs at memory bandwidth speeds which means hashing is never the bottleneck in the pipeline. The more interesting decision was what to normalize: I strip comments and whitespace, but I don't normalize identifier names, because renaming a public function is a meaningful change even if the body stays the same, and where you draw that line determines what your tool considers "noise" versus "signal," and getting it wrong in either direction makes the tool useless.

The matching problem

When you diff two versions of a file you need to figure out which entity in the old version corresponds to which entity in the new version, and this is harder than it sounds because entities can be renamed, moved within the file, or modified so heavily that they barely resemble their former selves, and you can't just match by line number because everything shifts when someone adds a function at the top of the file, and you can't just match by name because people rename things.

I settled on a three-phase approach that gets progressively more expensive but also progressively more tolerant of changes: phase one matches by entity ID, which is a combination of file path, entity type, and name, and this is O(n) with a hash map and catches every entity that wasn't renamed or moved. Phase two takes the remaining unmatched entities and looks for pairs with identical structural hashes, same code but different name or location, which catches renames and moves and is also O(n). Phase three is the expensive one, where for whatever is still unmatched I compute a Jaccard similarity coefficient over the token sets of each pair.

The Jaccard coefficient measures overlap between two sets: for entities a and b, you tokenize their content by whitespace and compute J(a, b) = |tokens(a) ∩ tokens(b)| / |tokens(a) ∪ tokens(b)|. A score of 1.0 means identical token sets, a score above 0.6 means the entities share enough structure to be considered the same entity after refactoring, and below that I treat them as unrelated.

The reason for the phased approach is computational: Jaccard comparison is O(n²) in the number of unmatched entities, which would be expensive if you ran it on everything, but phases one and two are cheap and typically resolve the vast majority of matches, so by the time you get to phase three n is small, and in practice I almost never run Jaccard on more than a handful of entities per file, which means the whole pipeline feels instant even on large diffs.

fn jaccard_similarity(a: &str, b: &str) -> f64 { let tokens_a: HashSet<&str> = a.split_whitespace().collect(); let tokens_b: HashSet<&str> = b.split_whitespace().collect(); let intersection = tokens_a.intersection(&tokens_b).count(); let union = tokens_a.union(&tokens_b).count(); if union == 0 { 0.0 } else { intersection as f64 / union as f64 } }

The dependency graph

If you had to pick one thing that makes entity-level version control genuinely different from line-level diffing, it would be the dependency graph. After parsing all the entities in a repository I build a directed graph where each node is an entity and each edge means "this entity depends on that one," so function A calls function B, class C inherits from class D, module E imports from module F, and I use petgraph's DiGraph for this, mostly because it gives me clean APIs for graph reversal and traversal without having to think about the underlying adjacency list representation.

use petgraph::graph::DiGraph; use petgraph::visit::Bfs; pub struct EntityGraph { graph: DiGraph<String, ()>, node_map: HashMap<String, NodeIndex>, } impl EntityGraph { pub fn dependents(&self, entity_id: &str) -> Vec<String> { let start = self.node_map[entity_id]; let reversed = Reversed(&self.graph); let mut bfs = Bfs::new(&reversed, start); let mut result = Vec::new(); while let Some(node) = bfs.next(&reversed) { if node != start { result.push(self.graph[node].clone()); } } result } }

The graph answers a question that no amount of text diffing can answer: if you change this function, what else might break? I call the answer "blast radius," and it's computed by running BFS on the reversed graph starting from the changed entity, where every node you reach is something that transitively depends on the thing you changed, so an entity with a blast radius of zero can be modified freely while an entity that half the codebase depends on needs to be touched with great care. This is the kind of information that's obvious to a senior developer who knows the codebase by heart, but invisible to a line-level diff, and prohibitively expensive for an LLM to figure out by reading source code.

The reason I use BFS rather than DFS is subtle but matters: BFS gives you results ordered by distance from the changed entity, direct callers first, then their callers, and so on outward, and this ordering is useful for prioritizing review because a direct dependent is more likely to be affected by a change than something five hops away, whereas DFS would give you an arbitrary ordering that's harder to reason about.

Parallelism as a natural consequence

One of the things that made Rust the right choice for this project is that the parsing problem is embarrassingly parallel in a way that Rust makes trivially safe to exploit, because each file is independent and there's no shared mutable state during parsing, you just read the file, build a tree, extract entities, and return them, and with Rayon the whole thing is a one-line change from sequential to parallel:

use rayon::prelude::*; fn parse_repository(files: Vec<PathBuf>) -> Vec<SemanticEntity> { files .par_iter() .flat_map(|path| { let source = std::fs::read_to_string(path).ok()?; let lang = detect_language(path)?; let parser = get_parser(lang); Some(extract_entities(&parser, &source)) }) .flatten() .collect() }

What's nice about this is that I didn't have to think about thread safety at all, because the borrow checker guarantees that nothing in the parsing pipeline can cause a data race since none of the closures capture mutable references to shared state, and Rayon's work-stealing scheduler handles load balancing automatically so files that are large and take longer to parse don't create stragglers, and the only synchronization point is the final .collect() which assembles all the per-file entity lists into one big vector. In practice this means a large repository with thousands of files parses in seconds, and I get that performance for free, without a single mutex or atomic.

Merging at the right level

The companion tool, weave, takes the same entity model and uses it to merge, and it registers as a custom merge driver in .gitattributes so git calls it automatically during merges, rebases, and cherry-picks. The algorithm takes the three versions of a file that git provides (base, ours, theirs), parses each into entities, matches them across versions using the same three-phase algorithm, and then makes a decision for each entity: if only one side changed it, take that side's version, if neither side changed it, keep the base, and if both sides changed it, you have a real conflict, which is where the interesting algorithmic decision lives.

Rather than giving up and dumping conflict markers like git does, weave tries one more level of decomposition, where if the conflicting entity is a container like a class with methods it decomposes it into its children using the parent_id hierarchy and attempts to merge at that level instead. I call this "inner merge," and it means two developers who edited different methods in the same class get a clean merge because at the method level there's no conflict at all, something git can't do because it doesn't know where one method ends and the next begins, it just sees a block of text that both sides touched.

The inner merge decision is conservative by design: if decomposing a container into sub-entities doesn't resolve the conflict cleanly, I fall back to git's line-level merge for that entity, because the cardinal rule is that weave must never produce a worse result than git. Either it resolves something git couldn't, or it gets out of the way and lets git handle it, which means the tool is safe to enable globally since the worst case is identical to not using it at all.

Why Rust

I could have built this in Python or TypeScript, tree-sitter has bindings for both, but three things pushed me toward Rust and in retrospect all three turned out to matter more than I expected.

First, the type system. The entity model has a lot of invariants, parent IDs must reference real entities, dependency edges must point to existing nodes, structural hashes must be computed before matching, and in Python these invariants would be enforced by convention and broken by accident, but in Rust they're enforced by the compiler, and I've never had a bug where a parent ID pointed to nothing because the type system won't let you construct that state.

Second, performance is table stakes for developer tools, and if sem diff took two seconds instead of being instant nobody would use it. Rust gives me the performance of C without the memory safety footguns, and Rayon gives me parallelism without the concurrency bugs, and the whole tool feels snappy in a way that matters for adoption.

Third, Cargo workspaces. I have three tools (sem, weave, inspect) that share a core library, and Cargo workspaces let me keep them in one repository with shared compilation, independent versioning, and one cargo test that runs everything, and the dependency graph between my own crates mirrors the dependency graph I build for user code, which is a satisfying kind of dogfooding.

sem/ crates/ sem-core/ # Entity model, parsers, graph, diffing sem-cli/ # CLI: sem diff, sem blame, sem graph, sem impact sem-mcp/ # MCP server for AI agent integration

Try it

# Install brew install sem-cli brew install weave # Entity-level diff sem diff HEAD~1 # What depends on this function? sem graph src/main.rs # What breaks if this function changes? sem impact src/parser.rs::parse_file # Enable semantic merging for your repo weave install

Both are open source and available on Homebrew: sem for entity-level diffing, blame, and impact analysis, and weave for entity-level merging.