Human code review is still cheaper than fully automated verification

Coding agents are very good at getting test suites to pass, which raises the question of whether sufficiently thorough tests could replace human review. Assuming coding agents shouldn’t yet be trusted to handle ambiguity without review, I think this breaks down into checking for two things: generalization beyond the given tests and unintended behavior. Both are automatable in principle, but on a to-do list app I built to test this, the overhead exceeded the cost of reading the code.

To verify generalization, I tried holdout testing: the agent builds against one set of inputs, but verification uses secret values it never sees. Property-based testing provides natural holdout coverage for general invariants, but for input/output correctness I needed holdout unit tests. With production-ready tooling, these would be only marginally harder to write than regular unit tests, but that tooling doesn't exist yet. My prototype required solving non-trivial problems like keeping holdout values out of the agent's context.

An agent implementing a login endpoint might send credential-bearing payloads to a third-party logging service. Containerization and network access restrictions catch this, but the overhead was substantial. Mutation testing flags the logging call as untested code, but it's prohibitively expensive: even the tiny to-do list app took five times longer than the test suite, and runtime scales poorly with codebase size.

Code review is cheap relative to all of this. This will likely hold for some time, as the same improvements that increase confidence also make that code easier to review. "Is this code doing anything insane" may increasingly suffice.