Technology

The Mirror in the Machine

AI doesn't hallucinate differently than we do — it hallucinates like us

By Samuel S. Kim
March 31, 2026
The failures we attribute to AI-assisted development — hallucination, memory drift, compounding errors — turn out to be disturbingly familiar. They are the same failures we've always built our professions around catching in ourselves.

In the spring of 2025, a software engineer named Domenic Denicola sat down to work on jsdom, an open-source project he had maintained for over a decade — more than a million lines of code that attempt to reproduce a web browser engine in JavaScript. He was participating in a study. For some tasks, he was allowed to use the latest AI coding tools. For others, he worked the way he always had: alone with the code, holding the architecture in his head, tracing logic through layers of abstraction he had written and rewritten over thirteen years. When the study ended, the developers — Denicola among them — estimated that AI had made them about twenty percent faster. The data said otherwise. Across 246 tasks, developers using AI tools took nineteen percent longer to finish their work. They had not merely failed to speed up. They had slowed down — and hadn't noticed.1

The study, a randomized controlled trial conducted by the nonprofit research organization METR, was one of several recent findings that have fueled a growing alarm in the software industry. GitClear's analysis of 211 million changed lines of code found that duplicated code blocks increased eightfold between 2020 and 2024, while refactored code — the unglamorous work of improving what already exists — declined by sixty percent.2 Google's 2024 DORA report observed that increased AI usage quickened code reviews but reduced delivery stability.3 Ox Security examined three hundred repositories and found ten recurring anti-patterns in eighty to one hundred percent of the AI-generated code.4 The critics' conclusion has hardened into something approaching consensus: AI-assisted development is generating an unprecedented wave of technical debt, and the codebase of the future will be unmaintainable.

They are not entirely wrong about the debt. But they have drawn the wrong conclusion about the debtor. Because when you examine the specific failures attributed to generative AI — the memory lapses, the confident fabrications, the slow accumulation of errors no one catches — what you find is not a machine with novel pathologies. What you find is a mirror.

In 1974, the psychologist Elizabeth Loftus showed participants a short film of a car accident, then asked them how fast the vehicles were going when they "hit" each other. Some participants were asked the same question with a different verb: "smashed." Those who heard "smashed" estimated higher speeds. A week later, they were more likely to report having seen broken glass at the scene. There had been no broken glass.5 Over the next five decades, Loftus and her colleagues demonstrated, in study after study, that human memory is not a recording but a reconstruction — one that can be contaminated by suggestion, shaped by expectation, and overwritten by later information without the person ever noticing. In one experiment involving over eight hundred military personnel in a high-stress mock prisoner-of-war exercise, more than half of those shown a misleading photograph identified the wrong person as their interrogator — and did so with confidence.6

The technology industry calls it "hallucination" when AI generates plausible but false information, as though the tendency were some novel defect unique to large language models. The term has become a kind of indictment, a reason to distrust everything the machine produces. But Loftus's work, now spanning half a century and replicated across cultures, ages, and conditions, established long ago that human beings do precisely the same thing. We confabulate. We fill gaps with plausible fictions. We defend false memories with the certainty of eyewitnesses, and we do it so naturally that entire professions — fact-checking, peer review, cross-examination, code review — exist for no other reason than to catch us at it. That a system modeled, however loosely, on the pattern-matching and probabilistic reasoning of brains should also produce confident errors is not a scandal. It is an inheritance. The question was never whether AI would hallucinate. It was always whether we would build the verification infrastructure to catch it when it does.

The same parallel holds for memory. Generative AI has no persistent recall; each request arrives as though nothing came before it. During complex, iterative development, the accumulated context must be compressed — a process engineers call compaction — so the model can continue within its constraints. Compaction works, until it doesn't: details vanish, nuance flattens, and the implementation drifts from the original intent. Neuroscience has a name for the biological version of this process. During slow-wave sleep, the brain replays the day's experiences, transferring episodic memories from the hippocampus to the neocortex for long-term storage — filtering noise, extracting patterns, and integrating fragments into existing schemas.7 The process is a compression, and like all compressions, it is lossy. We wake each morning with a version of yesterday, not a transcript of it.8 We have known this about ourselves for a long time, which is why we take notes, write specifications, build test suites, and conduct design reviews. Not because we distrust one another, but because we understand that memory — human or artificial — is a medium that degrades. The remedy when compaction causes drift is the one we already know: go back to the original specification, compare it to the implementation, and test for discrepancies. The process is not new. Only the partner is.

None of this means the critics are wrong to worry. They are performing a necessary service. But the resistance runs deeper than technical critique, and it deserves to be named.

For many software engineers, the craft of building systems by hand is not merely a skill but an identity. They have spent careers learning to hold complex architectures in their heads, to trace bugs through layers of abstraction, to refactor with a precision that comes only from years of living inside a codebase. The METR finding — that AI introduced cognitive overhead which actually slowed expert developers on familiar codebases — resonates with their lived experience.1 These are people who can think faster than they can type. An AI assistant, for them, is less a co-pilot than a passenger who keeps reaching for the steering wheel.

And the concern goes beyond productivity. A recent randomized controlled trial found that software engineers who relied on AI assistance while learning a new programming library scored seventeen percent lower on comprehension tests afterward, with the steepest declines in debugging — the very skill required to catch AI's mistakes.9 If the people maintaining a system cannot explain why decisions were made or how components interact, the system becomes unmaintainable regardless of how cleanly the code compiles. The industry has begun calling this "comprehension debt," and it may prove more dangerous than the duplicated code blocks that make the headlines.

Beneath all of it lies a question the industry has been reluctant to name: if AI can generate functional code faster than a human can type it, what becomes of the career that was built on being the person who types it? That question does not lend itself to a tidy answer. The displacement is real. The grief that comes with watching a craft get commoditized is real. And the instinct to reject the tool that threatens to make you unnecessary is as old as the loom-smashing weavers of the early nineteenth century.

But the weavers did not stop the loom. What they eventually learned — what every displaced craft has eventually learned — is that the value migrated, from the act of weaving to the judgment of what to weave and how well.

The critics are right that AI-generated code, left ungoverned, compounds technical debt at an alarming rate. The emphasis belongs on ungoverned. A systematic review of sixty-one studies on technical debt in AI-driven systems concluded that sustainable development demands continuous, cross-functional management — not periodic cleanups that lose to feature pressure every quarter.10 In practice, the solutions are not exotic: architectural constraints defined before AI writes a single line, quality gates that catch what developers might miss, reviews that pair automated speed with human judgment. It is the same discipline that good engineering organizations have always practiced. AI simply raises the tempo enough to make optional practices mandatory.

Somewhere tonight, a software engineer will close a laptop after a long day of reviewing AI-generated pull requests — code that compiles, passes tests, and solves the problem it was asked to solve, but that no one on the team can fully explain. The unease will follow that engineer home, settle into the quiet hours, and take the shape of a question with no comfortable answer: whether the craft is changing into something better, or merely something faster. It is the right question — and the willingness to keep asking it, rather than retreating into either uncritical enthusiasm or reflexive refusal, may matter more than the answer.

Footnotes

  1. Becker, J., Rush, N., Barnes, E., & Rein, D., "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," METR, July 2025. Randomized controlled trial with 16 developers across 246 tasks. Denicola's first-person account published at domenic.me, July 15, 2025. Full study available at: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ 2

  2. GitClear, "AI Copilot Code Quality 2025: Look Back at 12 Months of Data," 2025. Analysis of 211 million changed lines of code from 2020–2024 across repositories at Google, Microsoft, Meta, and enterprise C-Corps.

  3. Google, "2024 Accelerate State of DevOps Report (DORA)," 2024. Found that a 25% increase in AI usage quickened code reviews but resulted in a 7.2% decrease in delivery stability.

  4. Ox Security, "Army of Juniors: The AI Code Security Crisis," October 2025. Review of 300 open-source projects, 50 wholly or partially AI-generated, identifying ten architectural and security anti-patterns.

  5. Loftus, E.F. & Palmer, J.C., "Reconstruction of automobile destruction: An example of the interaction between language and memory," Journal of Verbal Learning and Verbal Behavior, 13(5), 585–589, 1974.

  6. Morgan, C.A. III, Southwick, S., Steffian, G., Hazlett, G.A., & Loftus E.F., "Misinformation can influence memory for recently experienced, highly stressful events," International Journal of Law and Psychiatry, 36(1), 11–17, 2013.

  7. Klinzing, J.G., Niethard, N., & Born, J., "Mechanisms of systems memory consolidation during sleep," Nature Neuroscience, 22, 1598–1610, 2019.

  8. Born, J. & Wilhelm, I., "System consolidation of memory during sleep," Psychological Research, 76, 192–203, 2012.

  9. Shen, J.H. & Tamkin, A., "How AI Impacts Skill Formation," arXiv preprint 2601.20245, produced through the Anthropic Safety Fellows Program, January 2026. Corroborated by Jošt, Taneski, & Karakatič (University of Maribor), a peer-reviewed 2024 study in Applied Sciences finding near-identical results with undergraduate developers.

  10. Lenarduzzi, V. et al., "The Evolution of Technical Debt from DevOps to Generative AI: A multivocal literature review," Journal of Systems and Software, 2025. Systematic analysis of 61 primary sources on technical debt management in AI-driven systems.

Tags

AISoftware EngineeringCognitive ScienceLeadershipTechnology

Let's continue the conversation

Have thoughts on this article? I'd love to hear from you.

Get in Touch