Toward automated verification of unreviewed AI-generated code

2026-03-16

I've been wondering what it would take for me to use unreviewed AI-generated code in a production setting.

To that end, I ran an experiment that has changed my mindset from "I must always review AI-generated code" to "I must always verify AI-generated code." By "review" I mean reading the code line by line. By "verify" I mean confirming the code is correct, whether through review, machine-enforceable constraints, or both.

I had a coding agent generate a solution to a simplified FizzBuzz problem. Then, I had it iteratively check its solution against several predefined, human-reviewed constraints:

(1) The code must pass property-based tests (see Appendix B for a primer). This constrains the solution space to ensure the requirements are met. This includes tests verifying that no exceptions are raised and tests verifying that latency is sufficiently low.

(2) The code must pass mutation testing (see Appendix C for a primer). Mutation testing is typically used to expand your test suite. However, if we assume our tests are correct, we can instead use it to restrict the code. This constrains the solution space to ensure that only the requirements are met.

(3) The code must have no side effects.

(4) Since I'm using Python, I also enforce type-checking and linting, but a different programming language might not need those checks.

These checks seem sufficient for me to trust the generated code without looking at it. The remaining space of invalid-but-passing programs exists, but it's small and hard to land in by accident.

I was concerned that the generated code would be unmaintainable. However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code.

The overhead of setting up these constraints currently outweighs the cost of just reading the code. But it establishes a baseline that can be chipped away at as agents and tooling improve.

The repo fizzbuzz-without-human-review implements these checks in Python, allowing you to try this for yourself.

Appendix A: Related work

"Human code review is still cheaper than fully automated verification" is the successor to this post.

Formal verification has been proposed as a way to ensure AI-generated code works, but that takes significant effort to do in practice. I suspect there's a middle ground where the error rate is still acceptably low enough to sleep at night.

The JustHTML implementation got a lot of traction for grounding significant amounts of AI-generated code with a large unit test suite. Early versions of the library had the limitation, mentioned by the author, that some code "mirrored some of the test data exactly," but "since LLM models got better" it's less common. They detected this by reviewing the code themselves. I suspect that property-based testing sufficiently discourages this strategy, reducing the need for human review.

One team is building a "Software Factory" (via) where no human reviews any code. I haven't thought deeply about their techniques yet, but this concept is too relevant not to mention.

↑ Back

Appendix B: Primer on property-based testing

Software tests commonly check specific inputs against specific outputs:

def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
    assert fizzbuzz(15) == "FizzBuzz"
    assert fizzbuzz(30) == "FizzBuzz"

Property-based tests run against a wider range of values. The property-based test below (using Hypothesis) runs fizzbuzz with 100 semi-random multiples of both 3 and 5, favoring "interesting" cases like zero or extremely large numbers.

@given(n=st.integers(min_value=1).map(lambda n: n * 3 * 5))
def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
    assert fizzbuzz(n) == "FizzBuzz"

Compared to testing specific input, this approach gives us more confidence that a given "property" of the system holds, at the cost of being slower, nondeterministic, and more complex.

For additional information, the Hypothesis docs are a good starting point.

↑ Back

Appendix C: Primer on mutation testing

Mutation testing tools like mutmut change your code in small ways, like swapping operators or tweaking constants, then re-run your test suite. If your tests fail, the "mutant" code is "killed" (good), and if your tests pass, the mutant "survives" (bad).

As an example, consider the following code:

def double(n: int):
    print(f"DEBUG n={n}")
    return n * 2

def test_doubles_input():
    assert double(3) == 6

Mutating print(f"DEBUG n={n}") to print(None) leaves test_doubles_input passing, so the mutant survives. You would fix it by removing the side effect or adding a test for it.

Appendix D: Acknowledgements

Thanks to Taha Vasowalla and other reviewers for their feedback on an early draft of this post.

Appendix E: Changelog

2026-03-19: Clarified that the constraints are human-reviewed. Specifically, I changed the text "several predefined constraints" to "several predefined, human-reviewed constraints."

2026-03-23: Added the follow-up post to the related work section.