
The test was simple. Click a button. Wait for a dialog to close. Verify the result.
It passed 95% of the time. In the other 5%, the dialog took 30+ seconds to close. Only in CI. Locally it was fine. The team had already labeled it: flaky.
Most teams would add a longer timeout, set retries: 2, and move on. “CI is slow, tests are flaky, ship it.” I’ve seen that response so many times it has its own muscle memory. But “flaky” is not a diagnosis. It’s the absence of one. So I dug in.
What I found wasn’t a test problem. It was a real UX bug that affects every user in production.
The Test
Here’s what the test looked like — straightforward Playwright interacting with a mutation dialog:
test('user can update record status', async ({ page }) => { await page.goto('/records');
// Open the status update dialog await page.getByRole('row', { name: /Record-1042/ }) .getByRole('button', { name: 'Update Status' }) .click();
// Select new status and confirm await page.getByRole('combobox', { name: 'Status' }).selectOption('approved'); await page.getByRole('button', { name: 'Confirm' }).click();
// Wait for dialog to close, then verify the result await expect(page.getByRole('dialog')).not.toBeVisible({ timeout: 10000 }); await expect( page.getByRole('row', { name: /Record-1042/ }).getByText('Approved') ).toBeVisible();});The failure always happened at the same line: await expect(page.getByRole('dialog')).not.toBeVisible({ timeout: 10000 }). The dialog stayed open for 30+ seconds. The 10-second timeout expired. Test failed.
The Investigation
Step 1 was verifying the failure was real and not an artifact of CI infrastructure. I ran the test 100 times locally with --repeat-each=100. It passed every time. I ran it with throttled CPU (6x slowdown via Playwright’s emulation) to simulate CI resource constraints. It failed 3 times out of 100. The failure was environment-dependent — resource contention made it manifest.
Step 2 was understanding what the dialog’s close handler actually does. I opened the frontend source and traced the mutation flow.
The Root Cause: Mutation Lifecycle
The dialog used TanStack Query (React Query) for its mutation. When the user clicks “Confirm,” the mutation fires. On success, the dialog’s onSuccess handler runs. Here’s where the problem lives:
// The mutation hook — simplified from the actual codebaseconst updateStatus = useMutation({ mutationFn: (data: StatusUpdate) => api.patch(`/records/${data.id}/status`, { status: data.status }),
onSuccess: async () => { // This is the problem: awaiting all cache invalidations // before closing the dialog await Promise.all([ queryClient.invalidateQueries({ queryKey: ['records'] }), queryClient.invalidateQueries({ queryKey: ['record-detail'] }), queryClient.invalidateQueries({ queryKey: ['record-history'] }), queryClient.invalidateQueries({ queryKey: ['dashboard-stats'] }), queryClient.invalidateQueries({ queryKey: ['team-metrics'] }), queryClient.invalidateQueries({ queryKey: ['audit-log'] }), queryClient.invalidateQueries({ queryKey: ['notifications'] }), queryClient.invalidateQueries({ queryKey: ['pending-reviews'] }), ]);
// Dialog only closes AFTER all 8 invalidations resolve onClose(); },});There it is. The onSuccess handler calls Promise.all() on 8 query cache invalidations. In TanStack Query, invalidateQueries doesn’t just mark the cache as stale — it triggers a refetch. Each invalidation fires a network request to reload that data.
The onClose() call — the one that dismisses the dialog — sits after the Promise.all(). The dialog doesn’t close until all 8 refetches complete.
On a fast local machine with low latency to the API server, those 8 requests complete in under a second. In CI, where the test runner shares resources with other parallel workers and the API server is running in a container with limited CPU, those 8 requests take 10-30+ seconds to resolve.
Why This Is a User Problem, Not a Test Problem
The test was telling the truth. It was reporting exactly what happens to a real user.
Every user who clicks “Confirm” in this dialog waits for 8 cache refetches to complete before the dialog closes. On a fast corporate network, they might wait 1-2 seconds and not notice. On a mobile connection, or during peak load, or when the API is under heavy traffic — they’re staring at an open dialog wondering if their action worked.
The E2E test in CI was experiencing what a user on a slow connection experiences. The test wasn’t flaky. The UX was slow.
The Fix
The fix separates the dialog close from the cache invalidation. The dialog should close the moment the mutation succeeds (HTTP 200). Cache invalidation is a background concern — it updates the data on the page behind the scenes, but the user’s action is already complete.
// AFTER: Dialog closes on success. Cache invalidation is fire-and-forget.const updateStatus = useMutation({ mutationFn: (data: StatusUpdate) => api.patch(`/records/${data.id}/status`, { status: data.status }),
onSuccess: () => { // Close the dialog immediately — the user's action succeeded onClose();
// Invalidate caches in the background — don't await // TanStack Query will handle the refetches asynchronously queryClient.invalidateQueries({ queryKey: ['records'] }); queryClient.invalidateQueries({ queryKey: ['record-detail'] }); queryClient.invalidateQueries({ queryKey: ['record-history'] }); queryClient.invalidateQueries({ queryKey: ['dashboard-stats'] }); queryClient.invalidateQueries({ queryKey: ['team-metrics'] }); queryClient.invalidateQueries({ queryKey: ['audit-log'] }); queryClient.invalidateQueries({ queryKey: ['notifications'] }); queryClient.invalidateQueries({ queryKey: ['pending-reviews'] }); },});The key change: onClose() moves to the top of onSuccess, and the invalidateQueries calls are no longer awaited. The Promise.all() wrapper is gone. TanStack Query handles the refetches asynchronously — the data on the page updates in the background as each query resolves, which is what the user expects.
After the fix, the dialog closes in under 200ms. The page data refreshes over the next 1-2 seconds as the cache invalidations complete. The user sees their action confirmed immediately, then sees the updated data flow in.
The test passed 680 consecutive runs after the fix. Zero failures.
The TanStack Query Lifecycle Trap
This bug pattern exists in any codebase that uses TanStack Query (or similar cache-management libraries) with the following combination:
- A mutation with an
onSuccesscallback invalidateQueriescalls insideonSuccessthat areawait-ed- A UI action (dialog close, navigation, toast) that depends on
onSuccesscompleting
The trap is that invalidateQueries returns a Promise. If your onSuccess function is async and you await the invalidation, the mutation stays in a pending-like state — onSuccess hasn’t finished yet, so anything that depends on it completing is blocked.
TanStack Query’s documentation is explicit about this: returning a Promise from onSuccess keeps isPending true until the Promise resolves. This is useful when you want to show a loading state until the data is refreshed. It’s harmful when you accidentally chain UI interactions to network latency.
The Diagnostic Pattern
When you encounter a dialog, modal, or navigation that is slow to respond after a mutation, check the onSuccess handler:
// 🔍 Red flag: async onSuccess with awaited invalidationsonSuccess: async () => { await queryClient.invalidateQueries(/* ... */); // <-- blocks closeDialog(); // <-- delayed}
// ✅ Fix: separate the user-facing action from the background refreshonSuccess: () => { closeDialog(); // <-- immediate queryClient.invalidateQueries(/* ... */); // <-- background}This pattern applies beyond TanStack Query. Any state management library that ties UI actions to cache refresh operations can produce the same bug: the user waits for data operations that should be invisible to them.
The Flaky Test Investigation Protocol
When an E2E test times out on a UI interaction — a dialog not closing, a button not becoming enabled, a navigation not completing — the instinct is to increase the timeout. Resist that instinct. The timeout is a symptom. The question is: what is the UI waiting for?
Step 1: Reproduce Under Constraint
Run the test with CPU throttling to simulate CI conditions. If it fails under throttling but passes normally, the failure is resource-dependent, which means there’s a performance issue hiding under normal conditions.
// playwright.config.ts — add CPU throttling for investigationuse: { launchOptions: { args: ['--disable-gpu'], }, // Simulate CI-like resource constraints contextOptions: { reducedMotion: 'reduce', },},Step 2: Trace the Interaction
Use Playwright’s trace viewer to capture what happens between the user action and the expected result. Look for network requests that fire between the click and the dialog close. Count them. Time them.
# Run with trace on to capture the full interaction timelinenpx playwright test flaky-test.spec.ts --trace=on# Open the trace viewernpx playwright show-trace test-results/trace.zipStep 3: Check the Mutation Lifecycle
Open the frontend source for the component involved. Find the mutation hook. Read the onSuccess, onError, and onSettled handlers. Look for:
awaitinsideonSuccess(blocks UI)Promise.all()wrapping multiple async operations (multiplies latency)- Network requests that must complete before UI updates (sequential dependency)
Step 4: Count the Cache Invalidations
If you find invalidateQueries calls, count them. Each one is a network request. Under constrained resources, each request adds latency. 8 invalidations × 3 seconds each = 24 seconds of blocking time. That’s your “flaky” test.
Step 5: Propose the Separation
The fix is almost always the same: separate the user-facing action from the background data refresh. Close the dialog, then invalidate caches. Navigate the page, then refetch data. Show the toast, then update the dashboard.
The Deeper Lesson
Your E2E suite is the closest thing you have to a real user. It clicks buttons at the speed a user would. It waits for responses the way a user would. It experiences latency the way a user on a constrained network would.
When your E2E suite says something is slow, believe it. When a test times out on a UI interaction, it’s not the test being impatient — it’s the test experiencing your application the way a real user experiences it under non-ideal conditions.
That “flaky” test saved every user from a 30-second hang on a dialog that should close in milliseconds. The investigation took a few hours. The fix was straightforward. The UX improvement affects every user, every time they use that feature.
Stop increasing timeouts. Start investigating root causes. The test is trying to tell you something.
