Toward automated verification of unreviewed AI-generated code
I've been wondering what it would take for me to use unreviewed AI-generated code in a production setting.
To that end, I ran an experiment that has changed my mindset from "I must always review AI-generated code" to "I must always verify AI-generated code." By "review" I mean reading the code line by line. By "verify" I mean confirming the code is correct, whether through review, machine-enforceable constraints, or both.
I had a coding agent generate a solution to a simplified FizzBuzz problem. Then, I had it iteratively check its solution against several predefined constraints:
(1) The code must pass property-based tests (see Appendix B for a primer). This constrains the solution space to ensure the requirements are met. This includes tests verifying that no exceptions are raised and tests verifying that latency is sufficiently low.
(2) The code must pass mutation testing (see Appendix C for a primer). Mutation testing is typically used to expand your test suite. However, if we assume our tests are correct, we can instead use it to restrict the code. This constrains the solution space to ensure that only the requirements are met.
(3) The code must have no side effects.
(4) Since I'm using Python, I also enforce type-checking and linting, but a different programming language might not need those checks.
These checks seem sufficient for me to trust the generated code without looking at it. The remaining space of invalid-but-passing programs exists, but it's small and hard to land in by accident.
I was concerned that the generated code would be unmaintainable. However, I'm starting to think that maintainability and readability isn't relevant in this context. We should treat the output like compiled code.
The overhead of setting up these constraints currently outweighs the cost of just reading the code. But it establishes a baseline that can be chipped away at as agents and tooling improve.
The repo fizzbuzz-without-human-review implements these checks in Python, allowing you to try this for yourself.
Appendix B: Primer on property-based testing
Software tests commonly check specific inputs against specific outputs:
def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
assert fizzbuzz(15) == "FizzBuzz"
assert fizzbuzz(30) == "FizzBuzz"
Property-based tests run against a wider range of values. The property-based test below (using Hypothesis) runs fizzbuzz with 100 semi-random multiples of both 3 and 5, favoring "interesting" cases like zero or extremely large numbers.
@given(n=st.integers(min_value=1).map(lambda n: n * 3 * 5))
def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
assert fizzbuzz(n) == "FizzBuzz"
Compared to testing specific input, this approach gives us more confidence that a given "property" of the system holds, at the cost of being slower, nondeterministic, and more complex.
For additional information, the Hypothesis docs are a good starting point.
Appendix C: Primer on mutation testing
Mutation testing tools like mutmut change your code in small ways, like swapping operators or tweaking constants, then re-run your test suite. If your tests fail, the "mutant" code is "killed" (good), and if your tests pass, the mutant "survives" (bad).
As an example, consider the following code:
def double(n: int):
print(f"DEBUG n={n}")
return n * 2
def test_doubles_input():
assert double(3) == 6
Mutating print(f"DEBUG n={n}") to print(None) leaves test_doubles_input passing, so the mutant survives. You would fix it by removing the side effect or adding a test for it.