When the AI Code Reviewer Is Confidently Wrong

Erik Treviño

•Apr 18, 2026

Cover for When the AI Code Reviewer Is Confidently Wrong

I merged 3 PRs last week. The AI reviewer generated 7 review threads across all three.

5 were valid improvements. Null safety checks I’d missed. A cleaner assertion pattern. Consistent formatting on a test utility. I accepted all 5 without hesitation.

2 would have broken every test in CI. I rejected both.

The ratio — 5 good, 2 dangerous — is the whole story of AI code review in test infrastructure. Most suggestions are fine. The ones that aren’t are catastrophic. And the AI presents both with identical confidence.

The Pattern: Removing “Unnecessary” Guards

Both rejected suggestions did the same thing. They identified defensive guards in test code and flagged them as unnecessary conditionals that could be tightened.

Here’s a sanitized version of what the AI saw:

// The original test code
test('verify user profile displays correctly', async ({ page }) => {
  await page.goto('/profile');

  // Check if the premium badge exists before asserting on it
  const premiumBadge = page.getByTestId('premium-badge');
  if (await premiumBadge.isVisible({ timeout: 3000 })) {
    await expect(premiumBadge).toHaveText(/Premium|Enterprise/);
  }

  // Always verify the core profile elements
  await expect(page.getByRole('heading', { name: /profile/i })).toBeVisible();
  await expect(page.getByTestId('user-email')).toBeVisible();
});

The AI’s suggestion:

// AI suggestion: "Unnecessary conditional — the premium badge should
// always exist on the profile page. Remove the conditional check
// and assert directly."
test('verify user profile displays correctly', async ({ page }) => {
  await page.goto('/profile');

  // ❌ AI removed the guard — this breaks in staging and dev environments
  const premiumBadge = page.getByTestId('premium-badge');
  await expect(premiumBadge).toBeVisible();
  await expect(premiumBadge).toHaveText(/Premium|Enterprise/);

  await expect(page.getByRole('heading', { name: /profile/i })).toBeVisible();
  await expect(page.getByTestId('user-email')).toBeVisible();
});

The AI’s logic is clean: if the test navigates to a profile page, the premium badge should exist. The conditional is unnecessary complexity. Tighten the assertion.

Correct — in production. Dead wrong in staging, where the premium badge is behind a feature flag that’s only enabled in the production environment. Dead wrong in dev, where the test data doesn’t include premium accounts. Dead wrong during deploy windows, where the feature flag service takes up to 30 seconds to propagate after deployment.

Why the AI Gets This Wrong

The AI reviews code in isolation. It sees the test file. It might see the component being tested. It does not see — and cannot see — the following:

Multi-environment execution. This test runs across 3 environments (dev, staging, production) with different feature flag states, different data availability, and different deployment timing.
Deploy-time transients. During a rolling deployment, the feature flag service can take 30 seconds to sync. A test that runs during that window sees a different state than a test that runs 60 seconds later.
Test data variability. The staging environment uses anonymized production data snapshots that are refreshed weekly. Not every user in the snapshot is a premium user.
Intentional defensive design. The guard isn’t there because the developer was lazy. It’s there because the test needs to pass across all three environments. The guard is the feature.

The AI sees a conditional and thinks “code smell.” An engineer who has been woken up at 2 AM by a CI failure in staging sees a conditional and thinks “someone learned the hard way.”

The Second Rejected Suggestion

The other dangerous suggestion was the same pattern applied to a timeout value:

// Original: defensive timeout for slow staging environment
await expect(page.getByTestId('dashboard-loaded')).toBeVisible({
  timeout: 15000, // Staging can take up to 12s after deploy
});

// AI suggestion: "Timeout of 15000ms is excessive. The default 5000ms
// should be sufficient for a visibility check. Reduce to default."
await expect(page.getByTestId('dashboard-loaded')).toBeVisible();

The AI is right that 15 seconds is unusual for a visibility check. The AI doesn’t know that the staging environment runs on smaller infrastructure, shares a database with the QA environment, and regularly takes 8-12 seconds to render the dashboard after a fresh deployment.

Reducing the timeout to the default 5 seconds would cause the test to fail in staging on every deploy. It would pass locally. It would pass in production. It would fail in the one environment where it matters most for catching regressions before they reach users.

The Accept / Validate / Reject Decision Tree

After encountering this pattern multiple times, I built a decision framework for triaging AI code review suggestions on test infrastructure:

Accept Immediately

These suggestions are almost always safe to take without further validation:

Null safety improvements. Adding optional chaining, null checks, or type narrowing. These make tests more robust across environments, not less.
Type improvements. Tightening types, removing any, adding explicit return types. Pure correctness improvements with no behavioral impact.
Formatting and naming. Consistent naming conventions, import ordering, test description clarity.
Assertion clarity. Switching from toBeTruthy() to toBeVisible(), using more specific matchers. These are better assertions that still pass in all environments.

// AI suggestion: use more specific matcher ✅ Accept
// Before
await expect(submitButton).toBeTruthy();
// After
await expect(submitButton).toBeEnabled();

Validate Before Accepting

These suggestions might be correct, but need manual verification against multi-environment behavior:

Assertion tightening. Changing loose matchers to strict matchers. Correct in theory, but might fail in environments with different data.
Guard removal. Removing conditional checks, optional chaining on assertions, or fallback paths. The guard might exist for a reason the AI can’t see.
Timeout changes. Reducing timeouts, changing polling intervals, removing explicit waits. These are environment-sensitive by nature.
Locator changes. Switching from getByTestId to getByRole. Better practice in general, but the role-based locator might match different elements across environments with different feature flag states.

// AI suggestion: tighten assertion ⚠️ Validate first
// Before: loose text matching
await expect(page.getByTestId('status')).toContainText('complete');
// After: exact text matching
await expect(page.getByTestId('status')).toHaveText('Complete');

// Is the text exactly "Complete" in all environments?
// Or does staging show "complete" (lowercase) or "COMPLETE" (uppercase)?
// Check before accepting.

Reject

These suggestions are based on assumptions that don’t hold in multi-environment test infrastructure:

Anything that assumes single-environment execution. If the suggestion only works when you imagine the test running locally, reject it.
Removing environment-aware conditional logic. Guards that check for element existence before asserting on environment-specific features.
Removing retry logic for deployment-sensitive operations. If a test retries because the deployment might be in progress, that retry is load-bearing.

// AI suggestion: remove retry ❌ Reject
// Before: retry loop for post-deploy state
await expect(async () => {
  await page.reload();
  await expect(page.getByTestId('new-feature')).toBeVisible();
}).toPass({ intervals: [1000, 2000, 5000], timeout: 30000 });

// After (AI): "Unnecessary retry — just assert directly"
await expect(page.getByTestId('new-feature')).toBeVisible();

// The retry exists because this test runs against a deployment that
// takes up to 30 seconds to propagate. The AI has never seen a deploy.

The 5-of-7 Rule

Here’s the practical heuristic: in a batch of AI code review suggestions on test infrastructure, approximately 5 out of 7 are valid improvements. The remaining 2 are confidently wrong about something the AI cannot observe.

The ratio isn’t fixed. It varies by codebase complexity and how many environments your tests target. But the pattern is consistent: the majority of suggestions are safe, and a minority would break your pipeline in ways that only manifest outside the AI’s observable context.

This ratio is what makes AI code review both valuable and dangerous. If most suggestions were wrong, you’d ignore them entirely. If all suggestions were right, you’d auto-merge them. The 5-of-7 split means you must evaluate each one individually — and the cost of getting the wrong two accepted is disproportionately high.

An AI Review Triage Checklist

When an AI reviewer suggests changes to your test infrastructure, run through this:

Does this suggestion change behavior across environments? If the test runs in multiple environments with different data, flags, or deployment states, any suggestion that tightens assertions or removes guards needs manual validation.
Is this a timeout change? Timeout values in test infrastructure are almost never arbitrary. They encode knowledge about environment-specific performance characteristics. Ask why the current value exists before changing it.
Does this remove a conditional? Conditionals in test code that check for element existence before asserting are often environment-aware guards, not code smells. Verify that the element exists in all environments before removing the check.
Would this suggestion survive a staging deploy? Mentally replay the suggestion against your slowest, most resource-constrained environment during a rolling deployment. If it would fail, reject it.
Can I verify this without running in production? If the only way to validate the suggestion is to merge it and see if CI breaks in staging, that’s a strong signal to reject.

The Deeper Lesson

AI code review is a tool. A good one. The 5 suggestions I accepted last week made genuine improvements to code quality — improvements I might have missed in manual review. I’m not arguing against AI code review.

I’m arguing against treating AI code review as a trusted authority on infrastructure it has never observed. The AI analyzes code. It does not analyze the environments that code runs in, the deployment states those environments can be in, or the historical context for why a particular guard exists.

The human reviewer’s job in an AI-assisted review process isn’t to rubber-stamp suggestions. It’s to be the context bridge — the one who knows that staging takes 12 seconds to render, that the feature flag propagates slowly, and that the conditional check on line 47 is there because someone learned the hard way.

Don’t turn off AI code review. Just stop treating it as infallible. Five of seven. Accept the five. Catch the two. That’s the discipline that keeps your pipeline green.

Back to all posts