Concept illustration of MirrorCode, a benchmark testing whether AI agents can autonomously rebuild software from specifications and test suites.
METR and Epoch AI published preliminary results from MirrorCode on April 10, a benchmark that tests whether AI agents can autonomously reimplement existing software from specifications and test suites alone, without access to the original source code. In the headline result, Claude Opus 4.6 successfully rebuilt gotree – a bioinformatics toolkit containing approximately 16,000 lines of Go code across 40-plus commands – a task that four independent researchers estimated would require a skilled software engineer between two and seventeen weeks to complete without AI assistance.
Why It Matters
MirrorCode directly addresses a core limitation of existing coding benchmarks, which cap out at tasks completable in minutes or hours. The benchmark’s design forces an AI agent to plan, write, test, and iterate across an entire software project – not patch a single bug or complete a single function. The preliminary results extend the known frontier of AI research on autonomous coding well past the 12-hour task horizon that METR had previously established for Claude Opus 4.6 based on standard bug-fixing evaluations. Critically, the researchers also reported continued performance gains from inference scaling on larger projects – meaning that adding compute, not new training, can extend the horizon further. That finding has direct implications for the economics of frontier model deployment.
What’s Next
METR and Epoch AI plan to release the full benchmark with additional target programs and more model comparisons in coming weeks. The initial results may already be partially saturating the benchmark for top-tier models, which will require the teams to design harder tasks to continue measuring progress. The practical implication for the software industry is direct: if AI can reliably reimplement a 16,000-line production codebase, the cost of duplicating, porting, or modernizing existing software drops substantially. That affects legacy migration budgets, open-source competition dynamics, and the economics of software outsourcing at scale.
The benchmark also surfaces a subtle legal risk. Reimplementing software from a specification is generally lawful, but AI-assisted reimplementation at this speed may prompt new litigation around trade secret protections – particularly if models are trained on proprietary codebases before producing clean-room rewrites. That question will land on regulators’ desks before the technology reaches broad enterprise adoption.
