#Testing #Flaky Tests #Playwright #Ci Cd

680 Runs, Zero Retries: How to Actually Prove a Flaky Test Is Fixed

Erik Treviño

•Apr 10, 2026

Cover for 680 Runs, Zero Retries: How to Actually Prove a Flaky Test Is Fixed

“The flaky test is fixed — it passed.”

No. It passed once. That’s not fixed.

If the original failure rate was 0.5%, a single passing run gives you 99.5% confidence. That sounds high until your CI runs 200 times a week and you’re seeing “random” failures every Monday morning. One passing run is not evidence. It’s a coin flip you happened to win.

I run every flaky test fix 680 consecutive times before calling it done. Zero retries. Zero allowedFlakes. If it fails once in 680 runs, the fix isn’t real.

The Math That Changes Your Mind

This is a binomial probability problem. If a test has a true failure rate of p, the probability of it passing n consecutive runs is (1 − p)^n. The probability of seeing at least one failure is 1 − (1 − p)^n.

For a test with a 0.5% failure rate:

Consecutive Passes	P(all pass)	P(at least 1 failure)	Confidence the bug still exists
1	99.50%	0.50%	Almost none — you learned nothing
10	95.11%	4.89%	Low
50	77.83%	22.17%	Moderate
100	60.65%	39.35%	Getting somewhere
200	36.77%	63.23%	Coin flip
460	10.00%	90.00%	90% confident the fix is real
680	3.31%	96.69%	~97% confident at 0.5% failure rate
1000	0.67%	99.33%	99%+ confident

There’s also a useful shortcut called the Rule of Three: if you observe zero failures in n trials, the upper bound of the failure rate at 95% confidence is approximately 3/n. For n = 680, that’s 3/680 = 0.44%. So 680 passes with zero failures means you can be 95% confident the true failure rate is below 0.44%.

Why 680 specifically? It’s a practical number: 8 parallel workers × 85 iterations each. It fits in a single stress run, completes in reasonable time, and pushes the upper bound of likely failure rates below the threshold where a typical CI pipeline would see weekly failures.

The point isn’t that 680 is a magic number. The point is that one is not enough, ten is not enough, and you need to do the math for your specific situation.

Why Most “Fixes” Fail

After months of running this protocol, three patterns emerged in failed fixes — tests that passed once or even fifty times and then failed again.

Pattern 1: The Extended Timeout

The most common non-fix. A test fails because an element takes 6 seconds to appear. The “fix” changes the timeout from 5 seconds to 15 seconds. It passes today. Tomorrow, under CI load with parallel test workers competing for resources, that element takes 18 seconds.

// ❌ The non-fix: extending the timeout
await page.getByRole('button', { name: 'Submit' }).click();
await expect(page.getByText('Success')).toBeVisible({ timeout: 15000 });

// ✅ The real fix: wait for the actual condition, not an arbitrary timer
await page.getByRole('button', { name: 'Submit' }).click();
await page.waitForResponse(resp =>
  resp.url().includes('/api/submit') && resp.status() === 200
);
await expect(page.getByText('Success')).toBeVisible();

A timeout isn’t a fix — it’s a bet that the system will always be at least this fast. That bet loses eventually.

Pattern 2: The State Leak

Test B passes when run alone, fails when run after Test A. Test A leaves behind a cookie, a local storage entry, or a database row that changes the starting conditions for Test B. The “fix” adds a beforeEach cleanup. But the cleanup itself is fragile — it clears known state but misses the one artifact that only gets created under specific Test A conditions.

// ❌ Fragile cleanup: clearing known state
test.beforeEach(async ({ page }) => {
  await page.evaluate(() => localStorage.clear());
});

// ✅ Robust isolation: fresh context per test
// In playwright.config.ts — each test gets a pristine browser context
const config: PlaywrightTestConfig = {
  use: {
    // Every test starts with zero cookies, zero storage, zero history
    storageState: undefined,
    contextOptions: {
      ignoreHTTPSErrors: true,
    },
  },
  // Fully parallel — no shared state between workers
  fullyParallel: true,
};

Pattern 3: The Race Condition

A click fires. A network request starts. The test asserts on the result. But the assertion runs before the response arrives. It works on your fast local machine. It fails 0.5% of the time in CI where resource contention adds 200ms of latency to every network hop.

// ❌ Race condition: assert before the data arrives
await page.getByRole('button', { name: 'Load Data' }).click();
await expect(page.getByTestId('results-count')).toHaveText('42');

// ✅ Wait for the network response, then assert on the rendered result
await page.getByRole('button', { name: 'Load Data' }).click();
await page.waitForResponse(resp =>
  resp.url().includes('/api/data') && resp.status() === 200
);
await expect(page.getByTestId('results-count')).toHaveText('42');

Stop Saying “Flaky” — Start Classifying

The word “flaky” is a diagnostic dead-end. It stops investigation. It’s the test equivalent of a doctor saying “you’re sick.” Technically true. Completely useless.

Every test failure has a root cause. Classify it:

Root Cause Class	Symptom	Fix Category
Timing defect	Passes locally, fails in CI under load	Wait for conditions, not timers
State leak	Fails only when run after specific tests	Isolation — fresh context per test
Race condition	Fails intermittently on fast assertions	Wait for network/state, then assert
Resource contention	Fails in parallel, passes in serial	Worker isolation or resource locks
Environment drift	Fails in staging but not dev	Environment-aware fixtures

Once you name the class, you can search for it. That’s predictive hardening.

Predictive Hardening: Fix Tests Before They Fail

This is the methodology shift that matters more than the 680-run protocol itself.

After fixing 3 known flaky tests that all shared the same root cause — race conditions between click handlers and network responses — I searched the entire suite for the same pattern. Grep for assertion statements that immediately follow click actions without an intervening waitForResponse or waitForLoadState.

# Find potential race conditions: assertions immediately after clicks
# with no waitForResponse in between
grep -n "\.click()" tests/**/*.spec.ts | while read line; do
  file=$(echo "$line" | cut -d: -f1)
  linenum=$(echo "$line" | cut -d: -f2)
  # Check if the next 3 lines contain a waitForResponse
  nextlines=$(sed -n "$((linenum+1)),$((linenum+3))p" "$file")
  if echo "$nextlines" | grep -q "expect\|toHave\|toBe" && \
     ! echo "$nextlines" | grep -q "waitFor"; then
    echo "POTENTIAL RACE: $file:$linenum"
  fi
done

I found 3 more tests with the same vulnerability pattern — and hardened them before they ever failed in CI.

That’s the difference between reactive testing (wait for it to break, then fix it) and predictive hardening (classify the failure pattern, then sweep the suite).

The FAST / STANDARD / EXTENDED Pattern

Replace ad-hoc magic numbers with standardized polling constants. Every waitFor call in the suite references a named constant, not a guess:

export const TIMEOUTS = {
  FAST: 2_000,       // Elements that should appear immediately after navigation
  STANDARD: 10_000,  // API responses and re-renders
  EXTENDED: 30_000,  // Complex operations, streaming, file uploads
  STRESS: 60_000,    // Only used in stress test configurations
} as const;

// In tests:
await expect(page.getByRole('heading')).toBeVisible({
  timeout: TIMEOUTS.FAST
});

await expect(page.getByTestId('search-results')).toHaveCount(10, {
  timeout: TIMEOUTS.STANDARD
});

When a test needs EXTENDED, that’s a signal. It means either the operation is genuinely slow (acceptable) or something in the architecture is blocking (investigate). Named constants make these signals visible in code review.

The Stress Test Protocol

Here’s the exact configuration I use for stress validation:

import { defineConfig } from '@playwright/test';

export default defineConfig({
  workers: 8,
  repeatEach: 85,     // 8 × 85 = 680 total runs
  retries: 0,         // Zero tolerance — a single failure invalidates the fix
  timeout: 30_000,
  use: {
    trace: 'on-first-retry',  // Won't fire with 0 retries — that's the point
  },
  reporter: [
    ['list'],
    ['json', { outputFile: 'stress-results.json' }],
  ],
});

# Run stress validation against a specific test file
npx playwright test tests/checkout-flow.spec.ts --config=playwright.stress.config.ts

# Verify: all 680 must pass
cat stress-results.json | jq '.stats | {total: .expected, passed: .expected, failed: .unexpected}'

If all 680 pass: the fix is real. Ship it.

If any fail: don’t just look at the failure count. Look at which workers failed and which iterations within those workers. Failures clustered in later iterations suggest resource exhaustion (memory leak, connection pool depletion). Failures scattered randomly suggest the fix is incomplete. Failures only in specific workers suggest an isolation problem.

The Deeper Lesson

Flaky tests are not a nuisance category. They are a diagnostic opportunity. Every “flaky” test is telling you something specific about your system’s behavior under conditions you didn’t design for. The test that fails intermittently in CI but passes locally is reporting real information: your application behaves differently under resource contention. That’s not the test being unreliable. That’s the test being more honest than your local environment.

The shift from “fix this flaky test” to “classify and eliminate this failure pattern across the entire suite” is the most impactful change I’ve made to my testing methodology. It turns a reactive game of whack-a-mole into a systematic reduction of failure surface area.

Stop calling your tests flaky. Start classifying your failures. And stop calling a fix done because it passed once.

680 runs. Zero retries. Statistical confidence, not hope.

Back to all posts