We Thought Our AI Reviews Were 98.6% Valid. Independent Validation Said 69%.

7 min read
AICode ReviewOpen SourceValidation

We Thought Our AI Reviews Were 98.6% Valid. Independent Validation Said 69%.

The most dangerous thing about AI-augmented work isn't the errors. It's thinking you're not making them.

I ran 449 AI-assisted code reviews on OCA (Odoo Community Association) open source PRs in 9 days. When I had the AI assess its own review quality, it said 98.6% valid. When I ran independent validation, the number dropped to 68.9%. The validation used 40 separate AI instances, each reading the actual code diffs and verifying every technical claim.

That 30-point gap should concern anyone using AI for serious work.

The experiment

Between February 24 and March 4, 2026, I reviewed 449 unique pull requests across 6 OCA repositories using AI-assisted workflows. Each PR got a full technical review: architecture assessment, bug identification, security analysis, test coverage evaluation. The output was structured code review comments posted directly to GitHub.

For scale: OCA's most prolific human reviewer has done 2,197 unique PR reviews over 9.5 years. My campaign produced 449 in 9 days.

The reviews weren't rubber-stamps either. They found real security vulnerabilities (portal sudo bypass, cross-record token exposure, getattr traversal), caught bugs that multiple human reviewers missed, and provided actionable technical feedback that PR authors implemented.

But how good were they really?

Round 1: The self-assessment trap

My first attempt at validation was obvious: have AI evaluate the reviews. I fed each review to an evaluator and asked "Is this review technically valid?"

Result: 98.6% valid.

This number is worthless.

The evaluator was reading the review text - not the actual code. It was checking whether the review sounded plausible, not whether the claims matched reality. A review that confidently describes a pre_init_hook performance pattern scores well on plausibility. The fact that no pre_init_hook exists anywhere in the 7,362-line diff? The evaluator had no way to know.

This is the fundamental problem with self-assessment. AI evaluating AI-generated text is pattern-matching for coherence, not verifying truth. It's the equivalent of grading your own exam by checking whether your handwriting is neat.

I discarded the entire Round 1 dataset.

Round 2: Independent validation against actual code

Round 2 used a different approach. I dispatched 40 independent AI instances (I call them "subclauds"), each assigned to a single PR. Each one:

  1. Retrieved the actual PR diff using gh pr diff
  2. Read every technical claim in the review
  3. Independently verified each claim against the real code
  4. Classified the review as VALID, PARTIALLY VALID, RUBBER-STAMP, or INVALID - with evidence

The key difference: validators had the ground truth. They weren't evaluating whether the review sounded right. They were checking whether each claim matched the code.

The results

CategoryCount%
Fully valid30368.9%
Partially valid9722.0%
Rubber-stamp337.5%
Invalid40.9%

68.9% fully valid. Combined with partially valid: 90.9%.

Not terrible. But 30 points below what self-assessment reported.

What "partially valid" means

Most partially valid reviews had genuinely correct observations but missed important issues in the diff. A review might correctly identify three concerns but miss a critical fourth one. The feedback it gave was real - it just wasn't complete.

What "rubber-stamp" means

33 reviews (7.5%) approved PRs without evidence of reading the diff. These are the reviews that said "LGTM, CI green" on a 3,500-line new module with no inline comments. One approved a security-sensitive portal module with zero tests and gave it three words.

What "invalid" means

Four reviews were factually wrong at their core:

  • One described bugs from version 16.0 in an 18.0 migration review. The variable names it cited don't exist in the diff.
  • One approved a fix that a maintainer corrected 15 minutes later.
  • One flagged features as "missing" that were intentionally removed per months of community discussion.
  • One requested changes for a state value that already exists correctly in the code.

The fabrication problem

Out of roughly 2,000 total claims across 440 validated PRs, 4 were fabricated. Less than 0.2%.

But each fabrication is instructive:

  1. Phantom pattern: Described a pre_init_hook performance pattern with confidence. No pre_init_hook exists anywhere in the 7,362-line diff. The AI generated a plausible Odoo code pattern from training knowledge rather than reading the actual code.

  2. Phantom tests: Claimed "tests pass" on a module with zero tests. Codecov was failing. The AI assumed tests exist because most modules have them.

  3. Wrong version: All three bug descriptions cite variable names and code patterns from version 16.0 source code, not the 18.0 migration diff under review. The AI was analyzing code it had memorized from training, not code in the PR.

  4. Invisible tests: Claimed "module doesn't include any tests" when a 187-line test file with 6 test methods exists in the PR. The AI missed a file it should have read.

The common thread: every fabrication stems from the AI substituting pattern-matched expectations for actual observation. It "knows" what Odoo modules typically look like and fills in the blanks rather than reading what's actually there.

The quality curve

Quality wasn't uniform. It improved over time, then degraded with volume.

PeriodValid rate
Timesheet (early)34%
HR50%
Bank-statement-import59%
Project70%
Sale-workflow (early batches)87%
Sale-workflow (late batches)70%

The jump from 34% to 87% shows genuine learning - prompts improved, edge cases were handled, failure modes were addressed. The regression from 87% to 70% shows volume fatigue. The same degradation pattern that affects human reviewers doing batch work.

Why this matters beyond code review

The 30-point validation gap isn't specific to code review. It's a structural problem with any AI-assisted workflow where:

  1. The output looks plausible. Well-written text passes surface-level scrutiny.
  2. Self-assessment is circular. AI checking AI text measures coherence, not correctness.
  3. Ground truth verification requires extra work. Actually checking claims against reality takes effort most people skip.

If you're using AI for research, writing, analysis, or decision support, the same gap likely exists. You just haven't measured it yet.

How to validate your own AI output

The methodology is reusable:

  1. Separate the evaluator from the generator. Don't ask the same model to grade its own output.
  2. Give the evaluator ground truth. The evaluator must have access to the source material, not just the AI's output.
  3. Require evidence for every claim. Each verification should quote specific evidence from the source.
  4. Use categorical classification with clear definitions. Valid / Partially Valid / Rubber-stamp / Invalid gives you actionable data.
  5. Run at scale. A few spot checks won't reveal systemic patterns. I validated 440 reviews to see the quality curve.

The cost of this validation was a fraction of the cost of generating the reviews. The cost of NOT validating? Thinking you're at 98.6% when you're at 68.9%.

The bottom line

Self-assessed AI quality is a vanity metric. If you're measuring your AI workflow by asking "does this look right?" you're overestimating quality by 20-30 points.

Validate against ground truth, not against the AI's own output. The gap you find will be uncomfortable. That discomfort is the point.