680 Runs, Zero Retries: How to Actually Prove a Flaky Test Is Fixed


“The flaky test is fixed — it passed.”
No. It passed once. That’s not fixed.
If the original failure rate was 0.5%, a single passing run gives you 99.5% confidence. That sounds high until your CI runs 200 times a week and you’re seeing “random” failures every Monday morning. One passing run is not evidence. It’s a coin flip you happened to win.
I run every flaky test fix 680 consecutive times before calling it done. Zero retries. Zero allowedFlakes. If it fails once in 680 runs, the fix isn’t real.
The Math That Changes Your Mind
This is a binomial probability problem. If a test has a true failure rate of p, the probability of it passing n consecutive runs is (1 − p)^n. The probability of seeing at least one failure is 1 − (1 − p)^n.
For a test with a 0.5% failure rate:
| Consecutive Passes | P(all pass) | P(at least 1 failure) | Confidence the bug still exists |
|---|---|---|---|
| 1 | 99.50% | 0.50% | Almost none — you learned nothing |
| 10 | 95.11% | 4.89% | Low |
| 50 | 77.83% | 22.17% | Moderate |
| 100 | 60.65% | 39.35% | Getting somewhere |
| 200 | 36.77% | 63.23% | Coin flip |
| 460 | 10.00% | 90.00% | 90% confident the fix is real |
| 680 | 3.31% | 96.69% | ~97% confident at 0.5% failure rate |
| 1000 | 0.67% | 99.33% | 99%+ confident |
There’s also a useful shortcut called the Rule of Three: if you observe zero failures in n trials, the upper bound of the failure rate at 95% confidence is approximately 3/n. For n = 680, that’s 3/680 = 0.44%. So 680 passes with zero failures means you can be 95% confident the true failure rate is below 0.44%.
Why 680 specifically? It’s a practical number: 8 parallel workers × 85 iterations each. It fits in a single stress run, completes in reasonable time, and pushes the upper bound of likely failure rates below the threshold where a typical CI pipeline would see weekly failures.
The point isn’t that 680 is a magic number. The point is that one is not enough, ten is not enough, and you need to do the math for your specific situation.
Why Most “Fixes” Fail
After months of running this protocol, three patterns emerged in failed fixes — tests that passed once or even fifty times and then failed again.
Pattern 1: The Extended Timeout
The most common non-fix. A test fails because an element takes 6 seconds to appear. The “fix” changes the timeout from 5 seconds to 15 seconds. It passes today. Tomorrow, under CI load with parallel test workers competing for resources, that element takes 18 seconds.
// ❌ The non-fix: extending the timeoutawait page.getByRole('button', { name: 'Submit' }).click();await expect(page.getByText('Success')).toBeVisible({ timeout: 15000 });
// ✅ The real fix: wait for the actual condition, not an arbitrary timerawait page.getByRole('button', { name: 'Submit' }).click();await page.waitForResponse(resp => resp.url().includes('/api/submit') && resp.status() === 200);await expect(page.getByText('Success')).toBeVisible();A timeout isn’t a fix — it’s a bet that the system will always be at least this fast. That bet loses eventually.
Pattern 2: The State Leak
Test B passes when run alone, fails when run after Test A. Test A leaves behind a cookie, a local storage entry, or a database row that changes the starting conditions for Test B. The “fix” adds a beforeEach cleanup. But the cleanup itself is fragile — it clears known state but misses the one artifact that only gets created under specific Test A conditions.
// ❌ Fragile cleanup: clearing known statetest.beforeEach(async ({ page }) => { await page.evaluate(() => localStorage.clear());});
// ✅ Robust isolation: fresh context per test// In playwright.config.ts — each test gets a pristine browser contextconst config: PlaywrightTestConfig = { use: { // Every test starts with zero cookies, zero storage, zero history storageState: undefined, contextOptions: { ignoreHTTPSErrors: true, }, }, // Fully parallel — no shared state between workers fullyParallel: true,};Pattern 3: The Race Condition
A click fires. A network request starts. The test asserts on the result. But the assertion runs before the response arrives. It works on your fast local machine. It fails 0.5% of the time in CI where resource contention adds 200ms of latency to every network hop.
// ❌ Race condition: assert before the data arrivesawait page.getByRole('button', { name: 'Load Data' }).click();await expect(page.getByTestId('results-count')).toHaveText('42');
// ✅ Wait for the network response, then assert on the rendered resultawait page.getByRole('button', { name: 'Load Data' }).click();await page.waitForResponse(resp => resp.url().includes('/api/data') && resp.status() === 200);await expect(page.getByTestId('results-count')).toHaveText('42');Stop Saying “Flaky” — Start Classifying
The word “flaky” is a diagnostic dead-end. It stops investigation. It’s the test equivalent of a doctor saying “you’re sick.” Technically true. Completely useless.
Every test failure has a root cause. Classify it:
| Root Cause Class | Symptom | Fix Category |
|---|---|---|
| Timing defect | Passes locally, fails in CI under load | Wait for conditions, not timers |
| State leak | Fails only when run after specific tests | Isolation — fresh context per test |
| Race condition | Fails intermittently on fast assertions | Wait for network/state, then assert |
| Resource contention | Fails in parallel, passes in serial | Worker isolation or resource locks |
| Environment drift | Fails in staging but not dev | Environment-aware fixtures |
Once you name the class, you can search for it. That’s predictive hardening.
Predictive Hardening: Fix Tests Before They Fail
This is the methodology shift that matters more than the 680-run protocol itself.
After fixing 3 known flaky tests that all shared the same root cause — race conditions between click handlers and network responses — I searched the entire suite for the same pattern. Grep for assertion statements that immediately follow click actions without an intervening waitForResponse or waitForLoadState.
# Find potential race conditions: assertions immediately after clicks# with no waitForResponse in betweengrep -n "\.click()" tests/**/*.spec.ts | while read line; do file=$(echo "$line" | cut -d: -f1) linenum=$(echo "$line" | cut -d: -f2) # Check if the next 3 lines contain a waitForResponse nextlines=$(sed -n "$((linenum+1)),$((linenum+3))p" "$file") if echo "$nextlines" | grep -q "expect\|toHave\|toBe" && \ ! echo "$nextlines" | grep -q "waitFor"; then echo "POTENTIAL RACE: $file:$linenum" fidoneI found 3 more tests with the same vulnerability pattern — and hardened them before they ever failed in CI.
That’s the difference between reactive testing (wait for it to break, then fix it) and predictive hardening (classify the failure pattern, then sweep the suite).
The FAST / STANDARD / EXTENDED Pattern
Replace ad-hoc magic numbers with standardized polling constants. Every waitFor call in the suite references a named constant, not a guess:
export const TIMEOUTS = { FAST: 2_000, // Elements that should appear immediately after navigation STANDARD: 10_000, // API responses and re-renders EXTENDED: 30_000, // Complex operations, streaming, file uploads STRESS: 60_000, // Only used in stress test configurations} as const;
// In tests:await expect(page.getByRole('heading')).toBeVisible({ timeout: TIMEOUTS.FAST});
await expect(page.getByTestId('search-results')).toHaveCount(10, { timeout: TIMEOUTS.STANDARD});When a test needs EXTENDED, that’s a signal. It means either the operation is genuinely slow (acceptable) or something in the architecture is blocking (investigate). Named constants make these signals visible in code review.
The Stress Test Protocol
Here’s the exact configuration I use for stress validation:
import { defineConfig } from '@playwright/test';
export default defineConfig({ workers: 8, repeatEach: 85, // 8 × 85 = 680 total runs retries: 0, // Zero tolerance — a single failure invalidates the fix timeout: 30_000, use: { trace: 'on-first-retry', // Won't fire with 0 retries — that's the point }, reporter: [ ['list'], ['json', { outputFile: 'stress-results.json' }], ],});# Run stress validation against a specific test filenpx playwright test tests/checkout-flow.spec.ts --config=playwright.stress.config.ts
# Verify: all 680 must passcat stress-results.json | jq '.stats | {total: .expected, passed: .expected, failed: .unexpected}'If all 680 pass: the fix is real. Ship it.
If any fail: don’t just look at the failure count. Look at which workers failed and which iterations within those workers. Failures clustered in later iterations suggest resource exhaustion (memory leak, connection pool depletion). Failures scattered randomly suggest the fix is incomplete. Failures only in specific workers suggest an isolation problem.
The Deeper Lesson
Flaky tests are not a nuisance category. They are a diagnostic opportunity. Every “flaky” test is telling you something specific about your system’s behavior under conditions you didn’t design for. The test that fails intermittently in CI but passes locally is reporting real information: your application behaves differently under resource contention. That’s not the test being unreliable. That’s the test being more honest than your local environment.
The shift from “fix this flaky test” to “classify and eliminate this failure pattern across the entire suite” is the most impactful change I’ve made to my testing methodology. It turns a reactive game of whack-a-mole into a systematic reduction of failure surface area.
Stop calling your tests flaky. Start classifying your failures. And stop calling a fix done because it passed once.
680 runs. Zero retries. Statistical confidence, not hope.