Code Knowledge Graphs: Why Open-Source Stacks Stall at Enterprise Scale

TL;DR: Every code knowledge graph demo ends with a visualization. Most of them are syntax trees in a fancy renderer. At enterprise scale, the question isn't "is there a graph" but what's in the nodes, what's in the edges, and what survives Monday morning when production code changes. This post scores open-source code knowledge graph stacks (Joern, Kythe, Glean, Stack Graphs, SCIP/LSIF, CodeQL, DIY Neo4j, and agent repo maps) against the five dimensions that actually matter at enterprise scale, then shows what CoreStory does differently.

The KG Everyone Is Showing You Isn't the KG You Need

Walk into any vendor meeting this year and, at some point, you'll see a graph. Nodes and edges, probably rendered in a dark UI with a gold or teal color scheme. Someone will call it a knowledge graph. They'll fly through a visualization and say something like "this is your codebase. You can see how everything connects."

What they're often showing you is a tree-sitter AST fed into a graph renderer. The nodes are tokens. The edges are parse-tree relationships. It looks like intelligence. It isn't.

The distinction matters enormously when you're dealing with a real enterprise codebase (the kind that has six million lines of code, four programming languages, a hundred stored procedures, JCL batch jobs written in the 1990s, and a team of architects who need to understand it well enough to modernize it safely). At that scale, a syntax graph is noise. You need a knowledge graph that captures how the system actually behaves and why it was built that way.

The question isn't whether there's a graph but instead: what's in the nodes, what's in the edges, and what happens to both of those things at 9 a.m. on Monday when 50,000 lines of production code changed over the weekend?

‍

The Five Dimensions a Code KG Is Actually Judged On

Before comparing tools, it helps to name what you're actually evaluating. Here's the rubric — five dimensions that separate a real enterprise code intelligence layer from a science project:

Depth — Does the graph stop at syntax (token relationships), reach semantics (type resolution, data flow), or capture intent and behavior (business rules extracted from code paths, architectural decisions inferred from system behavior)? Most tools claim "semantic" but deliver structural.

Coverage — What artifact types are in scope? A graph that only indexes .java and .py files is missing half the system in most enterprises. Stored procedures, database schemas, batch job definitions, configuration files, and IaC all encode business logic.

Polyglot Scale — Can the graph traverse cross-language call boundaries? A Java service calling a stored procedure that drives a COBOL batch job is a single logical workflow. A tool that treats each language in isolation will miss the dependency entirely.

Freshness — When code changes, what happens to the graph? Full re-indexing every night means the graph is stale for most of the day. Incremental updates that track deltas keep the model current. The question is the staleness budget: how far behind can the graph be before it becomes a liability?

Queryability — Who and what can ask questions of the graph? A raw graph database requires experts to write traversal queries. An AI-native query surface — MCP endpoints, semantic retrieval, natural-language interfaces — opens the graph to agents and non-expert users alike.

This is the rubric. Every tool below scores against it.

‍

Competitive Teardown: Named, On the Rubric

Each of the following stacks was built for a specific purpose. That purpose matters — it explains both what they do well and where they fall short when asked to serve as an enterprise code intelligence layer.

Stack	Built For	Depth	Coverage	Polyglot Scale	Freshness	Queryability
Joern / Code Property Graph	Security analysis, vulnerability detection	Syntax + semantic; no business-intent layer	Code files; limited multi-artifact support	Uneven across languages; designed for single-language analysis	Batch re-analysis; not optimized for large incremental updates	Joern query language (Scala-based); no AI-native surface
Kythe (Google OSS)	Cross-reference indexing inside Google's monorepo toolchain	Structural cross-reference; no behavior model	Code files; indexer must be built per language	Multi-language via per-language indexers; cross-boundary traversal needs custom pipeline work	Incremental indexing supported but setup is a platform project	Low-level protobuf serving layer; no out-of-box semantic query surface
Glean (Meta OSS)	Indexing language facts at Meta scale	Language facts (types, references, definitions); no business-rule extraction	Code-centric; consumers build their own analyses on top	Schema-per-language; cross-language analysis is a consumer responsibility	Incremental at Meta scale; self-hosting and scaling is significant work	Angle query language; no AI-native surface
Stack Graphs / tree-sitter	Incremental name resolution for code navigation	Lexical and structural only; nothing about why code exists	Source files only	Per-language grammars; no cross-language traversal	Incremental by design for name resolution; limited to that scope	Code navigation in IDEs; no graph query or AI surface
SCIP / LSIF (Sourcegraph)	Code navigation index for IDEs and search	Symbol graph (definitions, references); not a behavior graph	Source files; no stored procs, schemas, batch	Multi-language symbol resolution; no semantic cross-language dependency graph	Static index generated at build time; refresh cadence depends on CI	LSP-based navigation; no AI query layer
CodeQL (GitHub)	Variant analysis for security research	Deep semantic analysis within a query; not a persistent model	Code files; extensible but requires query authorship	Multi-language with separate databases per language; cross-language queries need custom work	Analysis is per-query, per-run; no persistent intelligence model	QL query language; powerful for security researchers; not for non-experts or AI agents
Neo4j + tree-sitter (DIY)	Whatever the team decides to build	Entirely up to the implementation team	Entirely up to the implementation team	Entirely up to the implementation team	Entirely up to the implementation team	Entirely up to the implementation team
Agent-built repo maps (Aider, etc.)	In-session context for a single coding agent	Structural file/function map for session context	Source files in scope for the session	Session-scoped; no cross-language dependency graph	Rebuilt each session; not persistent	Agent-internal; not queryable externally
OpenGrok	Code search and cross-reference navigation	Symbol-level cross-reference only; no behavioral or intent layer	Source files; no stored procs, schemas, or batch artifacts	Multi-language file search; no semantic cross-language traversal	Index rebuilt on commit; no incremental graph update	Web UI and REST search API; no AI-native query surface
Understand by SciTools	Static analysis, code metrics, dependency visualization	Structural and some semantic (call graphs, data flow); no business-intent extraction	Source code; limited multi-artifact support	Multi-language; cross-language dependency views require manual configuration	Re-analysis required for updates; no incremental graph model	GUI and scripting API (Perl/Python); no AI-native surface
srcML	XML-based source representation used as a KG substrate in research pipelines	Structural (AST as XML); no semantic or intent layer	Source files only	Per-language parsers; no cross-language traversal	Static transformation; freshness depends on the consuming pipeline	XML query tools (XPath/XSLT); no AI-native surface
Spoon (INRIA)	Java source transformation and analysis; a graph substrate for research	Semantic (type resolution, AST rewriting); no behavior or intent model	Java source files only	Java-only by design; no cross-language graph	Analysis is per-run; no persistent model	Java API for programmatic analysis; no AI-native query surface
Gremlin / Apache TinkerPop	Graph traversal framework used to build code KGs on property graph stores	Up to the implementation; provides traversal primitives, not code understanding	Entirely up to the implementation team	Entirely up to the implementation team	Entirely up to the implementation team	Gremlin query language; powerful for graph experts; no AI-native surface out of the box
RDF-based code KGs (Apache Jena, etc.)	Semantic-web-style program analysis, mostly academic	Ontology-driven; rich relationships, but business-rule extraction needs custom ontology design	Varies by schema; typically source code only in practice	Cross-language modeling possible but requires significant ontology engineering	Batch triple-store ingestion; no production-grade incremental update pattern	SPARQL; highly expressive but requires expert users; no AI-native surface
Eclipse JDT / LSP4J	Java-specific code graph construction; building block for Java tooling	Semantic (type resolution, binding, AST); no behavioral or intent layer	Java source files; no multi-language or multi-artifact coverage	Java-only; LSP4J enables language servers but does not bridge language graphs	Incremental compilation model; freshness tied to build cycle	LSP-based navigation; no AI-native query surface
Depends	Dependency analysis producing code dependency graphs across languages	Structural dependency graph (imports, calls); no semantic or intent layer	Source files; limited artifact coverage	Multi-language dependency extraction; cross-language traversal requires integration work	Analysis is per-run; no persistent or incremental model	JSON/CSV export; no AI-native query surface

A few observations worth drawing out:

Joern and CodeQL are excellent security analysis tools. They were designed to answer specific security questions about specific code. That is a different problem than building a persistent intelligence layer about how a system behaves. Joern's Code Property Graph (CPG) is a genuine contribution to the field — but it was designed for vulnerability detection, not for reverse-engineering the business rules in enterprise software like a 20-year-old insurance claims system.

Kythe and Glean represent serious engineering at Google and Meta scale, respectively. The reason neither has significant enterprise adoption outside their origin companies is instructive: standing them up is itself a platform program. Kythe requires a per-language indexer; Glean requires consumers to build their own analysis schemas. Neither comes with a business-intelligence layer out of the box.

The DIY Neo4j path is the one that consumes the most architect time. The conversation usually goes: "We can build this ourselves with Neo4j and tree-sitter over a weekend." In practice, that weekend becomes the pipeline design, then the schema design, then the language coverage gaps, then the freshness problem, then the query layer, then the maintenance burden. Most enterprises that have gone down this path report spending before reaching what a vendor delivers at week one. See this detailed look at how a production-grade code knowledge graph is actually architected (including the five phases that separate working implementations from stalled ones) where the tradeoffs are covered in depth.

Agent repo maps (the kind coding agents like Aider produce on demand) are genuinely useful … for the agent, in that session. They are not persistent, not queryable externally, and not designed for multi-million-line codebases. They are not a knowledge graph; they are session context. Understanding where curated knowledge reaches its limits, and where a structured code graph has to take over, is the clearest way to see why session-scoped maps don't scale.

The broader ecosystem includes a set of tools specifically designed for code intelligence rather than general-purpose graph databases. OpenGrok is widely deployed for code search and cross-reference in enterprise environments, but it is a navigation tool — symbol-level only, no behavioral layer, and no AI-native query surface. Understand by SciTools goes further with call graphs and code metrics but still stops at structural semantics and requires re-analysis for every update. srcML and Spoon are research-grade substrates — useful building blocks in academic pipelines but not production intelligence platforms. Gremlin/Apache TinkerPop and RDF-based stacks (Apache Jena and similar) give you a powerful graph model but place the entire burden of schema design, ingestion, freshness, and query surface on the implementation team — the same DIY problem as Neo4j, just with a different traversal language. Eclipse JDT/LSP4J offers deep Java semantics but is single-language by design. Depends covers multi-language dependency graphs but produces structural exports rather than a queryable persistent intelligence model. In each case, the pattern is the same: capable of producing a graph of some kind; not architected to be an enterprise code intelligence layer.

What CoreStory Does That Those Don't

Scored against the same five dimensions:

Depth: from syntax to intent

Open-source KGs stop at "this function calls that function." CoreStory's Intelligence Model captures not only structural and semantic relationships, but the behavioral and intent layers above them. That means business rules extracted from code paths, architectural decisions inferred from the system's actual runtime behavior, and cross-artifact reasoning (that connects a Java API endpoint to the stored procedure it calls to the batch job that consumes its output, for example).

This distinction is the difference between a graph that tells you what code exists and a model that tells you what the system does. An architect evaluating a modernization program needs the latter.

Coverage: the full artifact stack

CoreStory indexes code, stored procedures, database schemas, configuration files, JCL batch job definitions, and IaC, not just .java, .py, or .ts files. In most legacy enterprise environments, the business logic is distributed across artifact types. A graph that only sees source code is missing a significant portion of the system's behavior.

Polyglot scale: cross-language graph traversal

CoreStory's ingestion is language-agnostic and traverses cross-language call boundaries. A Java service calling a stored procedure that drives a COBOL batch job is represented as a single connected subgraph, not as three separate single-language analyses that happen to sit next to each other. For legacy enterprises with polyglot stacks (which is most of them) this is the capability that makes the model useful.

Freshness: persistent intelligence that survives change

The Intelligence Model updates incrementally as code changes. It is not rebuilt from scratch on a schedule. More importantly, it persists across sessions and across team turnover. When a senior engineer leaves, their understanding of the system leaves with them — unless CoreStory captures it. The model compounds over time rather than degrading.

Queryability: MCP, API, and human dashboard

CoreStory exposes the Intelligence Model through an MCP/API surface for AI agents and a dashboard for humans, and both interfaces query the same underlying model. This is the architecture that makes it possible for a coding agent to ask "what are the business rules enforced in this service's validation layer?" and get a grounded, specific answer rather than a hallucinated one. It's also what makes the 44% improvement in task resolution possible for agents grounded in CoreStory's Intelligence Model versus agents operating without it.

‍

Most open-source code KGs stop at the syntactic or semantic layer. CoreStory spans all four layers.

‍

The Buyer's Checklist

The next time you evaluate a code KG vendor or your own team's proposal to build one with OSS, go through these six questions:

What's in your nodes beyond syntax? Ask them to show you a node that contains a business rule, not just a function name.
What artifact types are in scope, and does the graph traverse across them? A Java-only graph is not a system model.
What's your incremental update story when 50,000 lines change overnight? Full re-indexing is not an answer.
What's the query surface for an AI agent, not just a human? A graph DB that requires manual traversal queries is not agent-ready.
What's your largest production deployment, by LOC and language count? Scale claims need to be verified against real deployments.
Show me the benchmark. Any claim about accuracy or agent performance improvement should be backed by a reproducible methodology.

These questions work equally well against an external vendor and against an internal team proposing to build the capability in-house. If the answers are vague on any of them, the graph is probably shallower than advertised.

The honest way to evaluate any code intelligence platform is to bring it your actual codebase — the messy, polyglot, partially documented one, not a sanitized demo environment. That's the only benchmark that matters for your specific system.

If you want to see what's in CoreStory's Intelligence Model that isn't in the tools above, bring us your hardest codebase. Schedule a call with our experts.

‍