Artificial Intelligence

Deep Dive: How CoreStory Improves Benchmark Performance for Coding Agents

By John Bender

Understanding the Positive Impact of CoreStory on Agent Performance on the Hardest SWE-bench Verified Tasks

Study Period: December 2025
Benchmark: SWE-bench Verified — Hard Mode (45 tasks)
Agents Evaluated: Claude Code, Cursor, GitHub Copilot, Codex, Droid, Devin
Evaluation Design: Controlled A/B comparison (baseline agent vs. CoreStory-augmented agent)

Executive Summary

The following is a controlled, cross-agent evaluation of CoreStory, a code intelligence platform that provides architectural context and specifications to AI coding agents via Model Context Protocol (MCP).

Across six leading AI coding agents and 45 of the hardest tasks in SWE-bench Verified, CoreStory consistently and materially improved agent performance. All agents showed measurable uplift when augmented with CoreStory, with relative improvements ranging from +14% to +44% and an average relative uplift of ~25%.

More importantly, this evaluation identifies why those improvements occur. CoreStory does not primarily help agents debug failing implementations. Instead, it prevents agents from pursuing incorrect solution paths in the first place by supplying architectural and semantic context that is not recoverable from local code inspection alone.

These results are consistent across agents, repositories, and task types, indicating that CoreStory addresses systematic, model-agnostic failure modes in AI-assisted software maintenance.

1. Results Overview

1.1 Performance Summary

Agent Baseline CoreStory-Augmented Absolute Gain Relative Uplift
Claude Code 25 / 45 (56%) 36 / 45 (80%) +11 +44%
Cursor 17 / 45 (38%) 23 / 45 (51%) +6 +35%
GitHub Copilot 28 / 45 (62%) 35 / 45 (78%) +7 +25%
Codex 29 / 45 (64%) 34 / 45 (76%) +5 +17%
Droid 35 / 45 (80%) 40 / 45 (91%) +5 +14%
Devin 37 / 45 (82%) 42 / 45 (93%) +5 +14%

Key observation: Performance gains are universal across agents with very different architectures, autonomy levels, and underlying models.

1.2 Aggregate Statistics

Metric Value
Total red → green flips 39
Unique tasks with ≥1 flip 24 / 45 (53%)
Average relative uplift +25%
Median relative uplift +21%
Tasks with ≥3 agents flipping 3
Tasks with 2 agents flipping 6

The fact that over half of the task set saw at least one agent flip from failure to success indicates that CoreStory’s impact is not limited to edge cases.

2. Baseline Capability vs. Uplift

2.1 Non-Linear Relationship

The relationship between baseline agent capability and CoreStory uplift is non-linear, not monotonic.

Agents with mid-range baseline performance benefit the most from architectural context injection.

Capability Zone Agents Avg Baseline Avg Relative Uplift
Low baseline Cursor 38% +35%
Mid baseline Claude Code, Copilot, Codex 61% +29%
High baseline Droid, Devin 80% +14%

2.2 Implication for Teams

CoreStory provides maximum leverage when agents are:

  • capable enough to implement correct solutions
  • but insufficiently oriented in unfamiliar architectures

This range (roughly 50–70% baseline success) aligns closely with most production-deployed AI coding agents today.

3. Cross-Agent Convergence as Validation

A key strength of this evaluation is cross-agent convergence: independent agents making the same incorrect assumptions — and receiving the same corrective guidance from CoreStory.

3.1 Multi-Agent Flip Tasks

Task Agents Architectural Failure Mode
sphinx-8548 Claude Code, Droid, Codex Pipeline stage error
django-15503 Claude Code, Devin Hidden dependency
sphinx-9461 Claude Code, Devin Descriptor protocol
sympy-12489 Claude Code, Devin Call-site enumeration
sphinx-11510 Claude Code, Droid Nested state machine

When multiple agents converge on the same wrong solution, the failure is not model-specific — it is architectural.

3.2 Deep Dive: sphinx-8548

Symptom: Missing docstrings for inherited attributes

Observed agent behavior: All three agents initially attempted to fix docstring retrieval logic.

Actual root cause: Attributes were never collected during member enumeration.

This is a classic pipeline misattribution error:

  • downstream symptom
  • upstream cause

CoreStory redirected all three agents to the same upstream fix (get_class_members()), preventing hours of wasted implementation effort.

4. Value Mechanism Taxonomy

Analysis of all 39 “flips” reveals five recurring mechanisms.

4.1 Mechanism Distribution

Mechanism % of Flips
Wrong solution prevention 72%
Hidden dependency discovery 46%
Multi-location fix identification 41%
Semantic distinction revelation 36%
Existing pattern reference 21%

These mechanisms frequently overlap within a single task.

5. Primary Mechanisms

5.1 Wrong Solution Prevention (72%)

Failure mode
Agents fix symptoms instead of causes when architectural boundaries are unclear.

Why models fail here
Local reasoning is rational given limited architectural visibility and available context.

What CoreStory supplies
Pipeline-level orientation — where data enters, flows, and is filtered.

Result
Agents abandon incorrect approaches before implementation.

This is the dominant source of CoreStory’s performance gains.

5.2 Hidden Dependency Discovery (46%)

Failure mode:
Non-local coupling causes changes in one component to affect unrelated behavior.

Why models fail here:
Dependencies are implicit, undocumented, and often instantiated indirectly.

What CoreStory supplies:
Semantic dependency graphs that expose propagation paths.

Example: django-15503
With CoreStory, both agents independently discovered that KeyTransformIsNull internally constructs HasKey, a dependency invisible via local inspection.

5.3 Multi-Location Fix Identification (41%)

Failure mode:
Partial fixes in distributed logic lead to fragile or inconsistent behavior.

What CoreStory supplies:
Complete enumeration of all participating locations.

Example: sympy-12489
A correct fix required coordinated changes across 15+ locations.

5.4 Semantic Distinction Revelation (36%)

Failure mode:
Agents confuse constructs with identical syntax but different semantics.

Examples:

  • array index vs. object key
  • replace vs. optimize-through semantics
  • alias vs. annotate behavior

CoreStory clarifies semantic intent, not just structure.

6. Efficiency Impact

6.1 Time Reduction (Agent-Reported)

Agent Estimated Reduction
Claude Code 70–80%
Droid 60–70%
Copilot 60–83%
Cursor 60–75%
Devin 50–70%
Codex 40–50%

Higher-baseline agents show smaller marginal gains, consistent with ceiling effects. Given agents’ difficulty with translating their work into actual time, it’s best to understand these as relative reductions in implementation effort and token expenditure.

6.2 Files Avoided

Across 39 flips, agents avoided reading an estimated 300–500 files, replacing exploratory code archaeology with targeted architectural queries.

7. The “Senior Engineer Effect”

Across all agents, CoreStory was independently described as equivalent to “a senior engineer giving a five-minute architectural walkthrough.”

This reflects the specific class of context CoreStory provides:

  • system boundaries
  • data flow
  • implicit contracts
  • historical patterns
  • known architectural traps

8. Implications for Adoption

When CoreStory Delivers Highest Value

  • Large, unfamiliar codebases
  • Pipeline-heavy frameworks
  • Multi-component bugs
  • Mid-capability agents (50–70% baseline)

When Value Is Lower-Impact

  • Trivial, single-file changes
  • Extremely high-baseline agents
  • Small, fully documented systems

9. Conclusion

This evaluation demonstrates that architectural context injection is a first-order driver of AI coding agent performance on complex software maintenance tasks.

CoreStory:

  • improves success rates by ~25% on average
  • prevents predictable, systematic failure modes
  • operates consistently across agents and domains
  • requires no changes to the agents themselves

For teams deploying AI coding agents on real production systems, these results indicate that architectural intelligence is not an optimization layer — it is foundational infrastructure.

Ready to drive better performance from your coding agents with CoreStory? Join our self-service waitlist.

John Bender is the Director of Product Marketing at CoreStory.