← back

A $0 graph traversal outperforms GPT-5.2 at finding bugs in PRs

Rohan Sharma · March 2026

We planted 141 bugs across 52 real pull requests in 5 open source repos. Then we ran three tools on them: our own, Greptile, and CodeRabbit.

The results surprised us.

Tool Recall How it works
inspect + GPT-5.2 95.0% Graph triage, then LLM on top 10%
Greptile 91.5% Full LLM review (API)
CodeRabbit 56.0% Full LLM review (CLI)

The first tool uses zero LLM calls for triage. It builds a dependency graph from tree-sitter, runs a BFS from each changed entity, counts how many things could break, and sorts by blast radius. The entire triage step takes 6ms per commit. No network. No tokens. No cost.

The LLM only sees the top entities. On a typical 8-file PR, that's 2 files instead of 8. On a 50-file PR, maybe 5 instead of 50. 92% fewer tokens, better recall.

Why does this work?

Most code review tools send the entire diff to an LLM. The LLM reads everything, tries to understand everything, and comments on everything. The problem is that context windows are finite and attention is not uniform. When you send 500 lines, the model spends the same attention on a renamed variable as on a broken public API.

A dependency graph knows which functions call which. It knows that merge_entities() has 12 callers and update_readme() has zero. This isn't a heuristic or a guess. It's a BFS on a directed graph. Deterministic, reproducible, instant.

When you combine the two, the graph handles triage (what to look at) and the LLM handles judgment (is this actually wrong). Each does what it's good at.

The 100% number

Of the 141 planted bugs, 29 were classified as high-criticality: broken APIs, security flaws, data loss risks. The graph caught all 29. 100% recall on the bugs that actually matter.

It missed some low-severity issues (wrong comments, minor style things). That's fine. Those aren't what breaks production.

How the graph is built

Tree-sitter parses every tracked file into an AST. We extract entities: functions, classes, methods, structs. Then we analyze references and calls between them to build a cross-file dependency graph.

When a commit changes an entity, we traverse the graph outward. Every entity reachable from the changed one gets counted. That count is the blast radius. High blast radius = high risk. Simple.

This works across 21 languages. Same graph, same traversal, same scoring. The only thing that changes per language is the tree-sitter grammar.

inspect.ataraxy-labs.com · Benchmark dataset on HuggingFace