IBM Bob + CoreStory: An Agentic Code Modernization Case Study

Q: Can you reproduce this on our codebase?

Yes. The same methodology works on any codebase: pick a workflow, write a rubric from the source code, run the agent twice, and score against the rubric.

TL;DR Bob is IBM's AI agent for agentic code modernization. We ran Bob against a representative modernization task on Java Pet Store 1.3.2, then ran the exact same task again with Bob paired with CoreStory's code intelligence layer. Same prompt, same codebase, two runs. On its own, Bob correctly handled 16 of 26 rubric points. With CoreStory, the correct implementation was identified for 25 of 26 — Bob shipped 20 of them and deferred the last 5 only because it hit its bobcoin budget, not because CoreStory missed them. Production-breaking bugs went from 3 to 0. Message handler conversion went from 3/6 to 6/6. Bob's data model went from an in-memory mock to a full JPA model with every field present. Bob is a capable agent on its own. With CoreStory in the loop, the kinds of defects that pause a modernization program stopped happening. This post documents the methodology, the scorecard, and what changed.

Why we ran this comparison

Agentic code modernization is moving fast. IBM's Bob is one of the more capable agents in this space, purpose-built for the specific task of modernizing enterprise applications. CoreStory is the persistent code intelligence layer that humans and AI agents query to understand a system before changing it. The obvious question is whether combining the two produces measurable improvement on the kind of task an enterprise actually cares about.

We wanted a comparison that wasn't a toy. So we picked an old, well-known, architecturally honest reference application, defined a real modernization task on it, wrote the scoring rubric from the source code rather than from a generic checklist, and ran the same task twice. Run one was Bob alone. Run two was Bob with CoreStory connected through MCP. Everything else was identical.

‍

Methodology

The codebase

Java Pet Store 1.3.2. It is old, but architecturally it still looks like the systems enterprise architects deal with every day: four interconnected components, JMS message queues, cross-system contracts written in code rather than documented anywhere obvious, and a healthy amount of implicit business logic tucked inside EJBs and message-driven beans. It is the kind of codebase where the agent's understanding of "what is supposed to happen here" matters more than its ability to generate plausible-looking Java.

The task

We asked Bob to modernize the order processing flow. That single workflow touches all four components, crosses two message queues, and is constrained by 13 integration contracts (26 rubric points, scored 0–2 each) that the modernized implementation must preserve. The prompt was the same in both runs and contained no hand-holding beyond what an internal engineer would naturally write.

The conditions

Two runs. Same prompt. Same target codebase. The only variable was whether Bob had access to CoreStory.

In run one, Bob worked alone, with whatever context it could derive from reading the files itself.

In run two, Bob was given an MCP connection to CoreStory and approximately twenty lines of natural-language instructions telling it that CoreStory was available and that it should consult CoreStory before modifying any component. No other changes. No prompt rewrite, no orchestration glue, no hidden chain of thought.

The rubric: 13 integration contracts in three categories

Scoring an agent on "did the code compile" is a low bar. Plenty of code compiles and looks correct while quietly breaking the system around it. So we built the rubric directly from the source code: 13 integration contracts that the modernized implementation must preserve, sorted into three categories.

Compliance rules. Business logic that silently changes if Bob does not know it exists. A regulatory check skipped here is a fine collected six months later.

Data contracts. The exact field structures that upstream and downstream systems depend on. One field renamed and the message queue stops carrying the right payload.

Dual triggers. Multi-step workflows where skipping one output breaks another system. The order is approved but the supplier is never notified.

Each contract is a thing the modernized code must preserve. None of them are negotiable. All of them are invisible to an agent that has not actually read the system.

‍

Results

The headline scorecard

The numbers below are the unedited output of the two runs:

Metric	Bob solo	Bob + CoreStory
Correct implementations identified	16/26	25/26
Contracts correctly implemented in code	16/26	20/26
Production-breaking bugs	3	0
Message handlers converted	3/6	6/6
Data model	In-memory mock	Full JPA with all fields
False confidence claims	2	0
Self-validation loops	0	3
Wall-clock time	8 minutes	15 minutes
Bobcoins (Bob's internal scoring unit)	10.2	20.47

‍

Where the run with CoreStory pulled ahead

The contract delta is the headline number. On its own, Bob correctly handled about 62% of the rubric (16/26). With CoreStory, the correct implementation was identified for about 96% (25/26) — and Bob shipped 77% (20/26) within its bobcoin budget. The gap between identified and shipped is Bob's budget, not CoreStory's intelligence.

The bug delta is the headline outcome. Bob solo declared "100% message compatibility" while shipping three production-breaking defects. Bob + CoreStory shipped zero.

The data-model delta is the most quietly important one. Bob solo built an in-memory mock that made the build pass. Bob + CoreStory built a full JPA model with every field the upstream systems expect. That difference is invisible during demo and catastrophic during cutover.

The wall-clock delta is the only metric where Bob solo "wins" (8 minutes vs 15 minutes). It is worth noting explicitly: the augmented run takes its extra minutes to do work the solo run skipped, namely research before building and verification before declaring done. Both phases prevented bugs. Eight minutes that ship three production defects is not faster than 15 minutes that ship zero.

What Bob solo got wrong

The clean version of this story would be "Bob missed some edge cases." That would not be accurate. The three bugs Bob solo produced were not edge cases. They were the kind of defects that make customers call support and lawyers write letters.

Hardcoded shipping address. Every approved order in the modernized code told the supplier to ship one unit of a default item to "123 Main St, Anytown, CA." Bob solo had filled in placeholders to make the build pass. Every real customer would have received the wrong product at a fictional address.

Compliance safeguard removed. International orders that should have triggered manual review now auto-approved. The original code held all non-US, non-Japan orders for manual review — a $300 order from France, for instance, should never have auto-approved. Bob solo did not see it, did not preserve it, and did not flag the omission. The result is a regulatory violation waiting to be discovered.

Partial shipments marked complete. Status transitions were incorrectly mapped, so partially fulfilled orders were closed out as complete. Inventory tracking broke quietly. Customers were told their order had arrived when it had not.

Each of those would have shipped behind Bob's "100% message compatibility" claim. Each of those was prevented in the CoreStory-augmented run because Bob had access to the actual integration contracts before it started writing replacements.

The behavior change: Bob paused to check its own work

The most interesting finding in this comparison was not a number. It was a behavior.

Midway through the CoreStory-augmented run, Bob paused its own implementation. Without prompting, it sent CoreStory a six-question review checklist. The questions were specific to the code Bob had just written:

"Have I correctly identified all key EJB components?"

"Are all Message-Driven Beans accounted for?"

"Are there critical components I may have missed?"

CoreStory returned gaps. Bob fixed them before declaring done. That is not how chat-based code generation usually works. That is how a senior engineer works.

We have been calling it the pair programming moment. Bob did not just consume CoreStory's intelligence at the start of the task. Bob reached for it again when it needed a second pair of eyes. Research, then build, then verify. That is the cycle every experienced engineer runs without thinking about it, and in this comparison Bob ran it twice without being asked.

Demo vs. foundation: 9 generic tasks became 18 specific ones

The pre-flight plan Bob produced changed shape between runs. In run one, Bob generated a checklist of 9 generic modernization tasks — the kind of plan you would expect from any agent reading a project README. In run two, Bob generated roughly 18 tasks, each tied to a specific component in Java Pet Store. The plan stopped being a checklist and became an implementation map.

This is the structural difference between an agent that pattern-matches against its training data and an agent that has read the actual system. The output of the planning phase tells you almost everything about what the build phase is going to produce.

Interpretation

Three takeaways from the run.

Bob is a strong base agent. On its own, Bob correctly handled 62% of the rubric (16/26) on a non-trivial enterprise codebase in under ten minutes. That is meaningful. Most generic AI agents would do worse on the same task.

CoreStory closed the gap that mattered. The contracts CoreStory surfaced that Bob missed on its own are exactly where modernization programs get stuck — the compliance rules, the cross-system handoffs, the partial-shipment edge cases. Closing that gap is the difference between a demo that worked once and a foundation that survives cutover.

The pair programming moment is the part to watch. Bob's mid-run self-validation, prompted only by the existence of a queryable intelligence layer, is the behavior we want from every AI agent on every enterprise change. It is what the agent knows about the codebase that makes that behavior possible.

The integration: about twenty lines and an MCP connection

The lift from "Bob solo" to "Bob + CoreStory" is unglamorous. There is no platform migration, no rebuild of Bob, no proprietary SDK to learn. The integration is approximately twenty lines of natural-language instructions and an MCP connection.

The configuration tells Bob three things:

CoreStory exists and has tools available.
When to query it (typically before modifying any component).
Answers from CoreStory are cheaper than reading files directly.

That is the entire integration. Bob does not become a different agent. Bob finally has the senior architect in the room.

What this means for enterprise modernization programs

If you are architecting modernization for a large legacy estate, you have already done the math on AI agents. They are cheaper than consultants and faster than humans. They are also confidently wrong in ways that make them dangerous on systems where the cost of a defect is measured in regulator letters, not Jira tickets.

The standard industry rule of thumb is that a production bug costs ten to a hundred times what a development bug costs. In a modernization program running tens of thousands of LLM-generated changes, that asymmetry compounds. The deltas we observed are not subtle: 9 contracts gained, 3 bugs prevented, 3 message handlers correctly converted, an in-memory mock replaced by a full JPA model. Scaled across an enterprise estate, those deltas are program-level outcomes.

Bob is a strong agent for legacy modernization. CoreStory makes it materially better. The combination is what agentic code modernization looks like when it is ready for an enterprise to run at scale.

What to do next

If you are evaluating AI agents for a modernization program, the test is straightforward: pick a workflow that touches your real integration contracts, write the rubric from the source code, run your agent of choice twice — once on its own and once with a persistent code intelligence layer in front of it. The deltas will tell you whether your current setup is a demo or a foundation.

We are happy to run the same methodology on your codebase. Talk to an expert, or try it yourself.

‍

FAQ

Why Java Pet Store and not a customer codebase?

We needed a comparison anyone can reproduce. Java Pet Store is publicly available, architecturally honest, and contains the kind of cross-system contracts that real enterprise modernization has to preserve. The same methodology applies to a customer codebase. Most enterprise estates produce wider gaps than the ones we observed here.

Why was the Bob + CoreStory run slower (15 minutes versus 8)?

Because it did the work. The extra minutes were spent on research before building and on verification before declaring done. Both phases prevented the production-breaking defects the solo run shipped. The wall-clock metric is the only one where the solo run looks favorable, but it stops looking favorable the moment you weight it by defect cost.

Does CoreStory replace Bob?

No. CoreStory is the intelligence layer Bob queries. It works with Bob, and it works with other agents that support MCP — Claude Code, Codex, Copilot, Cursor, Devin, and any other tool-using agent with a similar interface. The agent stays the agent. CoreStory is what the agent talks to before it touches the code.

How long does it take to ingest a codebase into CoreStory?

Ingestion runs once and then maintains itself. For a codebase the size of Java Pet Store, ingestion is measured in minutes. For multi-million-line enterprise estates, the initial pass takes longer, but only the initial pass. Incremental updates are continuous.

Can you reproduce this on our codebase?

Yes. The same methodology — pick a workflow, write a rubric from the source code, run the agent twice, score against the rubric — works on any codebase. The harder question is what to do once you have the results.

Is this only about modernization?

No. The same pattern applies to feature work, bug triage, security audits, and onboarding. Any task where an AI agent needs to understand the existing system before changing it benefits from a persistent code intelligence layer in front of it.