Coderev
Building a compiler-grade code intelligence engine in Rust — AST parsing, symbol graphs, and local AI for semantic search.
- DATE:
- FEB.04.2026
- READ:
- 20 MIN
The problem with searching code
Every developer has done this: you’re working in an unfamiliar codebase, you need to understand what calls authenticate_user(), and you reach for grep. You get 47 results — imports, comments, test mocks, string literals, actual invocations, and a variable named re_authenticate_user_token. None of them tell you what you actually want: the call graph.
The standard toolchain gives you two choices. Text search (grep, ripgrep) is fast but structurally blind — it treats code as flat strings. RAG (chunk files, embed, cosine-search) gives semantic fuzzy matching but loses structural precision — it’ll return a paragraph that mentions authentication without telling you which function calls which.
Coderev takes a third path: parse code with a real grammar, build a typed symbol graph, then layer semantic search on top. The result is a system that answers “what calls this function?” with compiler-grade accuracy, and “where is authentication handled?” with embedding-based recall. Neither grep nor RAG can do both.
Architecture: four phases
The indexing pipeline has four phases, each building on the previous one’s output. All results converge into a single SQLite database.
Phase 1: AST extraction
Coderev uses Tree-sitter to parse source files into concrete syntax trees. Each supported language (Python, JavaScript, Rust, Go) has a .scm query file that defines what to extract:
;; python.scm — extract function definitions
(function_definition
name: (identifier) @callable.name
parameters: (parameters) @callable.params
body: (block) @callable.body) @callable.def The query adapter walks the AST and produces symbols (functions, classes, variables, imports) and edges (calls, contains, references). It also builds a scope graph per file — mapping which symbols are visible where — and flags any reference it can’t resolve locally as an unresolved reference.
Phase 2: global linker
The global linker resolves cross-file references using a three-step algorithm. This is where coderev diverges from text search — it doesn’t match names, it resolves symbols through the same scoping rules a compiler would use.
Step 1 — Local: search the same file’s symbol table. If foo() is called and def foo() exists in the same file, resolve immediately.
Step 2 — Import: if no local match, follow the import chain. from auth import foo resolves foo to whatever auth exports. Aliases are tracked (import foo as bar → bar resolves to foo).
Step 3 — Global: if imports don’t resolve it, search the entire symbol table by name. Method calls filter by receiver type when possible.
If all three fail, the reference is stored as unresolved — Phase 4 handles it probabilistically. Every edge created by Phase 2 carries confidence = 1.0 and resolution_mode = "static".
Phase 3: embedding generation
Coderev runs FastEmbed locally — the all-MiniLM-L6-v2 transformer model via ONNX Runtime. No API calls, no cloud dependency. Embeddings are generated per symbol using a chunking strategy:
- Head embedding: symbol name + signature + first 1500 characters of the body. This captures intent.
- Body embeddings: the remaining content is split into 1000-character chunks with 100-character overlap.
Large functions get multiple embeddings; small functions get one. All vectors are stored as f32 BLOBs in SQLite, keyed by the symbol’s URI.
Phase 4: semantic linker
The semantic linker picks up where the global linker left off. For each unresolved reference from Phase 2, it:
- Embeds the call site context (the surrounding code)
- Computes cosine similarity against all cached symbol embeddings
- Creates an edge if
similarity ≥ 0.6, withconfidence = scoreandresolution_mode = "semantic"
The threshold of 0.6 is deliberately lower than the typical 0.8 used in document retrieval — code symbols have shorter, more ambiguous names, so recall matters more than precision at this stage. Downstream queries can filter by confidence.
The universal intermediate representation
One of coderev’s most consequential design decisions: all language-specific constructs map to five universal symbol types.
+-------------+------------------+-------------------+--------------------+--------------------+ | Symbol kind | Python | JavaScript | Rust | Go | +-------------+------------------+-------------------+--------------------+--------------------+ | Namespace | module | module/file | mod | package | +-------------+------------------+-------------------+--------------------+--------------------+ | Container | class | class | struct / enum | struct / interface | +-------------+------------------+-------------------+--------------------+--------------------+ | Callable | def / async def | function / arrow | fn / impl fn | func / method | +-------------+------------------+-------------------+--------------------+--------------------+ | Value | variable / const | let / const / var | let / const / | var / const | | | | | static | | +-------------+------------------+-------------------+--------------------+--------------------+ | Document | non-code file | non-code file | non-code file | non-code file | +-------------+------------------+-------------------+--------------------+--------------------+
The query engine operates on these five kinds — it doesn’t know or care whether a Callable was a Python async def or a Rust impl fn. This means the call graph, impact analysis, and search work identically across all four languages.
The tradeoff is real: you lose language-specific nuance. Rust trait bounds, Python decorators, JavaScript hoisting behavior — all flattened away. But for the use cases coderev targets (navigation, impact analysis, AI grounding), the universal model is sufficient and dramatically simpler.
Symbol URIs: deterministic identity
Every symbol gets a globally unique URI:
codescope://blogv1/src/auth/middleware.py#callable:authenticate_user@42 The format is codescope://repo/path#kind:name@line. This serves as the primary key in SQLite and the stable reference in all graph edges. It’s deterministic — the same code always produces the same URI — which makes incremental indexing possible: if a file’s BLAKE3 hash hasn’t changed, skip it.
Edges: typed and scored
The edge model is simple but expressive:
+------------+--------------------+--------------------+ | Edge kind | Meaning | Example | +------------+--------------------+--------------------+ | Defines | Namespace defines | module auth | | | a symbol | defines | | | | authenticate_user | +------------+--------------------+--------------------+ | Contains | Container holds a | class UserService | | | member | contains | | | | validate() | +------------+--------------------+--------------------+ | Calls | Callable invokes | login() calls | | | another | authenticate_user() | +------------+--------------------+--------------------+ | References | Symbol reads/uses | handler references | | | another | AUTH_CONFIG | +------------+--------------------+--------------------+ | Inherits | Container extends | AdminUser inherits | | | another | User | +------------+--------------------+--------------------+ | Exports | Module re-exports | index.js exports | | | a symbol | Router | +------------+--------------------+--------------------+
Every edge carries two metadata fields: confidence (1.0 for static, 0.6–1.0 for semantic) and resolution_mode (“static” or “semantic”). This lets consumers decide their own precision/recall tradeoff — strict refactoring tools can filter to confidence = 1.0; exploratory AI agents can include everything.
Local-first AI
Coderev is aggressively local-first. The embedding model (all-MiniLM-L6-v2, 23M parameters) runs via ONNX Runtime on CPU. No API keys, no network calls, no data leaving your machine.
The practical implications:
- Privacy: corporate codebases stay on-disk. No embedding API sees your code.
- Offline: works on planes, in air-gapped environments, behind restrictive firewalls.
- Latency: model startup is ~1.4s (cold), but subsequent operations are fast. Batch embedding amortizes the cost.
- Quality tradeoff: MiniLM is smaller than OpenAI’s
text-embedding-3-largeor Cohere’s models. For code symbols (short, technical, naming-convention-heavy), the smaller model performs well enough. The structural graph compensates for what the embeddings miss.
Storage: everything in SQLite
The entire index lives in a single .coderev/coderev.db file. The schema is straightforward:
CREATE TABLE symbols (
uri TEXT PRIMARY KEY,
kind TEXT, name TEXT, path TEXT,
line_start INT, line_end INT,
doc TEXT, signature TEXT, content TEXT
);
CREATE TABLE edges (
from_uri TEXT, to_uri TEXT,
kind TEXT, confidence REAL,
resolution_mode TEXT
);
CREATE TABLE embeddings (
uri TEXT, id INT,
vector BLOB, -- f32 array
PRIMARY KEY (uri, id)
); Vector search is currently a linear scan in Rust — load all embeddings into memory, compute cosine similarity, sort. This works for repositories with up to ~10k symbols. Beyond that, an HNSW index would be needed, but for the target use case (single-repo navigation), linear scan is fast enough and avoids the complexity of maintaining a vector index.
MCP integration: grounding AI agents
Coderev exposes its graph via the Model Context Protocol, making it available as a tool for Claude Desktop, ChatGPT, and other MCP-compatible agents. The server runs over stdio and exposes four tools:
search_code(query, limit)— semantic search returning symbol URIs and metadataget_callers(uri, depth)— traverse incomingCallsedgesget_callees(uri, depth)— traverse outgoingCallsedgesget_impact(uri, depth)— BFS impact analysis with confidence propagation
This is where the architecture pays off. When an AI agent asks “what would break if I changed authenticate_user?”, the MCP tool returns a precise list of affected symbols with confidence scores — not a vague chunk of text that mentions authentication. The agent gets grounded context: specific functions, specific files, specific line numbers.
Compare this to a typical RAG pipeline, where the agent receives “the authentication module handles user validation and session management” and has to guess which functions matter. Coderev’s response is: login() at auth/views.py:34 (confidence=1.0), admin_check() at admin/middleware.py:12 (confidence=0.87).
CLI: the developer interface
The CLI is built with Clap and supports four output formats:
# semantic search
coderev search --query "authentication logic" --limit 5
# call graph
coderev callers --uri "codescope://repo/auth.py#callable:login@10" --depth 2
coderev callees --uri "codescope://repo/auth.py#callable:login@10"
# impact analysis
coderev impact --uri "codescope://repo/auth.py#callable:login@10" --depth 3
# output as JSON for tooling
coderev search --query "database connection" --format json Output formats: human (pretty tables), json (full schema), compact (short keys for bandwidth-constrained contexts), toon (compact with flavor text — because why not).
Benchmarks: precision vs. speed
Benchmarks comparing coderev against ripgrep and a simple RAG pipeline on the same repository:
+------------+---------+--------------------+--------------------+ | Tool | Latency | Result type | Precision | +------------+---------+--------------------+--------------------+ | ripgrep | ~0.02s | Line matches | Exact text, no | | | | | structure | +------------+---------+--------------------+--------------------+ | Simple RAG | ~0.15s | Text chunks | Semantic but vague | +------------+---------+--------------------+--------------------+ | coderev | ~1.5s | Symbol URIs + | Structural + | | | | metadata | semantic | +------------+---------+--------------------+--------------------+
Coderev is 75x slower than ripgrep for raw latency. That’s the cost of model inference. But the comparison isn’t fair — they answer different questions. Ripgrep tells you “this string appears on line 42.” Coderev tells you “this function is called by these three callers, which are in turn called by this HTTP handler, and if you change it, these seven downstream symbols are affected.”
The latency is dominated by model startup (~1.4s cold). Once warm, embedding operations run at millisecond scale. For interactive use, coderev serve keeps the model hot.
What makes this interesting
The architectural insight is that semantic search without structure is just cosine similarity. Embeddings find similar code, but similarity isn’t the same as relationship. A function that mentions authentication and a function that calls the authenticator produce similar embeddings but have fundamentally different relationships to your change.
By building the deterministic graph first (AST → symbols → edges) and then using embeddings only for what the graph can’t resolve (ambiguous references, semantic queries), coderev gets the best of both worlds: compiler-grade precision where it’s available, AI-assisted recall where it’s needed.
The second insight is that local-first AI is viable for specialized domains. A 23M-parameter model can’t compete with GPT-4 on general reasoning, but for embedding code symbols into a vector space where structural context fills the gaps, it’s sufficient. The privacy and latency benefits of running locally outweigh the quality tradeoff.
Limitations and open questions
Coderev is honest about its gaps:
- Language coverage: Python and JavaScript are well-supported. Rust and Go are experimental — their tree-sitter queries extract basic symbols but miss some edge cases.
- No stdlib/package resolution:
import requestsis tracked as an import, but coderev doesn’t index therequestspackage. External symbols are marked but not expanded. - Linear vector search: works fine for single repositories but doesn’t scale to monorepos with 100k+ symbols. HNSW indexing is the obvious next step.
- No LSP: the query engine can’t integrate with IDEs directly. The MCP server bridges to AI agents, but IDE users still need to use the CLI.
- Tree-sitter query maintenance: each upstream grammar update can break extraction queries. This is the tax on using a grammar-based approach instead of language-specific compilers.
The deeper question is whether the universal intermediate representation is the right abstraction. Flattening all languages to five symbol types works for navigation and impact analysis, but richer type-system queries (“what implements this trait?”, “what satisfies this interface?”) would need language-specific extensions. Whether to stay universal or add type-system layers is an open design decision.
The takeaway
Coderev demonstrates that the space between “grep” and “ask an LLM” is underexplored. By combining deterministic AST analysis with probabilistic embedding search — and keeping everything local — it delivers a code understanding tool that’s precise enough for refactoring and fuzzy enough for exploration.
The key design choices: Tree-sitter for cross-language parsing, a universal symbol model for simplicity, SQLite for portability, local transformers for privacy, and MCP for AI agent integration. Each has a tradeoff, and each tradeoff is deliberate.
Semantic search without a graph is just cosine similarity. Add structure, and you get reasoning.