The KG Everyone Is Showing You Isn't the KG You Need
Walk into any vendor meeting this year and, at some point, you'll see a graph. Nodes and edges, probably rendered in a dark UI with a gold or teal color scheme. Someone will call it a knowledge graph. They'll fly through a visualization and say something like "this is your codebase. You can see how everything connects."
What they're often showing you is a tree-sitter AST fed into a graph renderer. The nodes are tokens. The edges are parse-tree relationships. It looks like intelligence. It isn't.
The distinction matters enormously when you're dealing with a real enterprise codebase (the kind that has six million lines of code, four programming languages, a hundred stored procedures, JCL batch jobs written in the 1990s, and a team of architects who need to understand it well enough to modernize it safely). At that scale, a syntax graph is noise. You need a knowledge graph that captures how the system actually behaves and why it was built that way.
The question isn't whether there's a graph but instead: what's in the nodes, what's in the edges, and what happens to both of those things at 9 a.m. on Monday when 50,000 lines of production code changed over the weekend?
The Five Dimensions a Code KG Is Actually Judged On
Before comparing tools, it helps to name what you're actually evaluating. Here's the rubric — five dimensions that separate a real enterprise code intelligence layer from a science project:
Depth — Does the graph stop at syntax (token relationships), reach semantics (type resolution, data flow), or capture intent and behavior (business rules extracted from code paths, architectural decisions inferred from system behavior)? Most tools claim "semantic" but deliver structural.
Coverage — What artifact types are in scope? A graph that only indexes .java and .py files is missing half the system in most enterprises. Stored procedures, database schemas, batch job definitions, configuration files, and IaC all encode business logic.
Polyglot Scale — Can the graph traverse cross-language call boundaries? A Java service calling a stored procedure that drives a COBOL batch job is a single logical workflow. A tool that treats each language in isolation will miss the dependency entirely.
Freshness — When code changes, what happens to the graph? Full re-indexing every night means the graph is stale for most of the day. Incremental updates that track deltas keep the model current. The question is the staleness budget: how far behind can the graph be before it becomes a liability?
Queryability — Who and what can ask questions of the graph? A raw graph database requires experts to write traversal queries. An AI-native query surface — MCP endpoints, semantic retrieval, natural-language interfaces — opens the graph to agents and non-expert users alike.
This is the rubric. Every tool below scores against it.
Competitive Teardown: Named, On the Rubric
Each of the following stacks was built for a specific purpose. That purpose matters — it explains both what they do well and where they fall short when asked to serve as an enterprise code intelligence layer.
A few observations worth drawing out:
Joern and CodeQL are excellent security analysis tools. They were designed to answer specific security questions about specific code. That is a different problem than building a persistent intelligence layer about how a system behaves. Joern's Code Property Graph (CPG) is a genuine contribution to the field — but it was designed for vulnerability detection, not for reverse-engineering the business rules in enterprise software like a 20-year-old insurance claims system.
Kythe and Glean represent serious engineering at Google and Meta scale, respectively. The reason neither has significant enterprise adoption outside their origin companies is instructive: standing them up is itself a platform program. Kythe requires a per-language indexer; Glean requires consumers to build their own analysis schemas. Neither comes with a business-intelligence layer out of the box.
The DIY Neo4j path is the one that consumes the most architect time. The conversation usually goes: "We can build this ourselves with Neo4j and tree-sitter over a weekend." In practice, that weekend becomes the pipeline design, then the schema design, then the language coverage gaps, then the freshness problem, then the query layer, then the maintenance burden. Most enterprises that have gone down this path report spending before reaching what a vendor delivers at week one. See this detailed look at how a production-grade code knowledge graph is actually architected (including the five phases that separate working implementations from stalled ones) where the tradeoffs are covered in depth.
Agent repo maps (the kind coding agents like Aider produce on demand) are genuinely useful … for the agent, in that session. They are not persistent, not queryable externally, and not designed for multi-million-line codebases. They are not a knowledge graph; they are session context. Understanding where curated knowledge reaches its limits, and where a structured code graph has to take over, is the clearest way to see why session-scoped maps don't scale.
The broader ecosystem includes a set of tools specifically designed for code intelligence rather than general-purpose graph databases. OpenGrok is widely deployed for code search and cross-reference in enterprise environments, but it is a navigation tool — symbol-level only, no behavioral layer, and no AI-native query surface. Understand by SciTools goes further with call graphs and code metrics but still stops at structural semantics and requires re-analysis for every update. srcML and Spoon are research-grade substrates — useful building blocks in academic pipelines but not production intelligence platforms. Gremlin/Apache TinkerPop and RDF-based stacks (Apache Jena and similar) give you a powerful graph model but place the entire burden of schema design, ingestion, freshness, and query surface on the implementation team — the same DIY problem as Neo4j, just with a different traversal language. Eclipse JDT/LSP4J offers deep Java semantics but is single-language by design. Depends covers multi-language dependency graphs but produces structural exports rather than a queryable persistent intelligence model. In each case, the pattern is the same: capable of producing a graph of some kind; not architected to be an enterprise code intelligence layer.
What CoreStory Does That Those Don't
Scored against the same five dimensions:
Depth: from syntax to intent
Open-source KGs stop at "this function calls that function." CoreStory's Intelligence Model captures not only structural and semantic relationships, but the behavioral and intent layers above them. That means business rules extracted from code paths, architectural decisions inferred from the system's actual runtime behavior, and cross-artifact reasoning (that connects a Java API endpoint to the stored procedure it calls to the batch job that consumes its output, for example).
This distinction is the difference between a graph that tells you what code exists and a model that tells you what the system does. An architect evaluating a modernization program needs the latter.
Coverage: the full artifact stack
CoreStory indexes code, stored procedures, database schemas, configuration files, JCL batch job definitions, and IaC, not just .java, .py, or .ts files. In most legacy enterprise environments, the business logic is distributed across artifact types. A graph that only sees source code is missing a significant portion of the system's behavior.
Polyglot scale: cross-language graph traversal
CoreStory's ingestion is language-agnostic and traverses cross-language call boundaries. A Java service calling a stored procedure that drives a COBOL batch job is represented as a single connected subgraph, not as three separate single-language analyses that happen to sit next to each other. For legacy enterprises with polyglot stacks (which is most of them) this is the capability that makes the model useful.
Freshness: persistent intelligence that survives change
The Intelligence Model updates incrementally as code changes. It is not rebuilt from scratch on a schedule. More importantly, it persists across sessions and across team turnover. When a senior engineer leaves, their understanding of the system leaves with them — unless CoreStory captures it. The model compounds over time rather than degrading.
Queryability: MCP, API, and human dashboard
CoreStory exposes the Intelligence Model through an MCP/API surface for AI agents and a dashboard for humans, and both interfaces query the same underlying model. This is the architecture that makes it possible for a coding agent to ask "what are the business rules enforced in this service's validation layer?" and get a grounded, specific answer rather than a hallucinated one. It's also what makes the 44% improvement in task resolution possible for agents grounded in CoreStory's Intelligence Model versus agents operating without it.

The Buyer's Checklist
The next time you evaluate a code KG vendor or your own team's proposal to build one with OSS, go through these six questions:
- What's in your nodes beyond syntax? Ask them to show you a node that contains a business rule, not just a function name.
- What artifact types are in scope, and does the graph traverse across them? A Java-only graph is not a system model.
- What's your incremental update story when 50,000 lines change overnight? Full re-indexing is not an answer.
- What's the query surface for an AI agent, not just a human? A graph DB that requires manual traversal queries is not agent-ready.
- What's your largest production deployment, by LOC and language count? Scale claims need to be verified against real deployments.
- Show me the benchmark. Any claim about accuracy or agent performance improvement should be backed by a reproducible methodology.
These questions work equally well against an external vendor and against an internal team proposing to build the capability in-house. If the answers are vague on any of them, the graph is probably shallower than advertised.
The honest way to evaluate any code intelligence platform is to bring it your actual codebase — the messy, polyglot, partially documented one, not a sanitized demo environment. That's the only benchmark that matters for your specific system.
If you want to see what's in CoreStory's Intelligence Model that isn't in the tools above, bring us your hardest codebase. Schedule a call with our experts.







