We Deleted 37 Tests and Our Coverage Got Better

Erik Treviño avatar
Erik Treviño
Cover for We Deleted 37 Tests and Our Coverage Got Better

We deleted 37 tests last month.

Our coverage got better.

Not “stayed the same” — got better. The suite ran faster, flaked less, and the workflow tests that remained already covered every path the deleted tests were checking. We didn’t lose a single meaningful assertion. We lost dead weight.

Here’s the thing most teams won’t admit: a test suite that only grows is like a codebase that only grows. Eventually the maintenance cost exceeds the value. Every team has a “let’s add more tests” instinct. Almost no team has a “let’s audit what we already have” discipline.

The Four-Label Classification System

I built a test classifier. Not an AI tool — a taxonomy. A systematic way to look at every test in a suite and answer one question: Does this test earn its place?

Every test gets one of four labels:

  • WORKFLOW — Tests a real user journey end-to-end. Login → navigate → perform action → verify result. These are the backbone. Keep.
  • FLUFF — Asserts something already covered by another test at a better layer. The login form validation test that also exists as a unit test. Remove.
  • MERGE — Two tests that share 80% of their setup and assertions but test slightly different branches. Combine into one parameterized test.
  • KEEP_DETAILED — An edge case that genuinely matters (timezone handling, permission boundaries, data migration paths). Keep, but evaluate whether it belongs at the E2E layer or could move to integration/unit.

The taxonomy is deliberately simple. Four labels. No ambiguity scale. No “maybe” category. Every test gets a definitive classification.

What Each Label Looks Like in Practice

WORKFLOW: A Real User Journey

// WORKFLOW — This test earns its place. It validates a complete user journey
// that can only be tested through a browser.
test('user creates a report and shares it with a teammate', async ({ page }) => {
await page.goto('/reports');
await page.getByRole('button', { name: 'New Report' }).click();
// Fill out the report form
await page.getByLabel('Report Name').fill('Q1 Sales Summary');
await page.getByRole('combobox', { name: 'Template' }).selectOption('quarterly');
await page.getByRole('button', { name: 'Generate' }).click();
// Wait for the report to generate (involves backend processing)
await page.waitForResponse(resp =>
resp.url().includes('/api/reports') && resp.status() === 201
);
await expect(page.getByText('Q1 Sales Summary')).toBeVisible();
// Share with a teammate
await page.getByRole('button', { name: 'Share' }).click();
await page.getByLabel('Email').fill('teammate@company.com');
await page.getByRole('button', { name: 'Send' }).click();
await expect(page.getByText('Report shared successfully')).toBeVisible();
});

This test exercises navigation, form submission, backend processing, rendering, and a secondary action. It can only run in a browser. It validates a real user story. WORKFLOW.

FLUFF: Already Covered Elsewhere

// FLUFF — This test validates form validation rules through the browser.
// The exact same validation logic is covered by unit tests on the
// validation schema. The E2E test adds 45 seconds of browser execution
// to verify something the unit test covers in 12 milliseconds.
test('report name field shows error when empty', async ({ page }) => {
await page.goto('/reports');
await page.getByRole('button', { name: 'New Report' }).click();
await page.getByLabel('Report Name').fill('');
await page.getByLabel('Report Name').blur();
await expect(page.getByText('Report name is required')).toBeVisible();
});

This test is not wrong. The assertion is valid. But the WORKFLOW test above already navigates to this form. If the validation were broken, the workflow test would fail at the form submission step. The fluff test adds 45 seconds of execution time to re-verify a validation rule that a unit test covers in milliseconds.

MERGE: Two Tests That Should Be One

// MERGE candidate A
test('admin can delete a report', async ({ page }) => {
await loginAs(page, 'admin');
await page.goto('/reports');
await page.getByRole('row', { name: 'Q1 Sales' }).getByRole('button', { name: 'Delete' }).click();
await page.getByRole('button', { name: 'Confirm' }).click();
await expect(page.getByRole('row', { name: 'Q1 Sales' })).not.toBeVisible();
});
// MERGE candidate B — 90% identical setup, just checks a different element
test('admin sees delete confirmation dialog', async ({ page }) => {
await loginAs(page, 'admin');
await page.goto('/reports');
await page.getByRole('row', { name: 'Q1 Sales' }).getByRole('button', { name: 'Delete' }).click();
await expect(page.getByRole('dialog', { name: 'Confirm deletion' })).toBeVisible();
});

These two tests share the same setup, the same navigation, the same role, and the same initial action. Test B is a subset of Test A. Merge them — the first test already clicks through the confirmation dialog and verifies the deletion.

The Audit Results

I applied this taxonomy to a 180-test E2E suite. Not a single pass — three rounds of classification with manual review at each stage.

LabelCountPercentageAction
WORKFLOW13474%Keep — these are the suite
FLUFF3721%Remove — covered by other layers or by workflow tests
MERGE21%Combine into 1 test each
KEEP_DETAILED74%Keep at E2E, or migrate to integration layer

21% of the suite was fluff. One in five tests was consuming CI time, contributing to flake rates, and requiring maintenance — while providing zero unique coverage.

The Three-Round Audit Loop

A single-pass classification is not trustworthy. I missed classifications on my first pass, caught them on the second, and refined the taxonomy rules on the third.

Round 1: Initial classification. Apply the four labels to every test file. Use static analysis to flag tests that don’t contain navigation actions (likely fluff), tests that share 80%+ of their locators with another test (likely merge), and tests that only assert on a single element (likely fluff or keep-detailed).

Terminal window
# Quick static analysis: find tests that never call page.goto() or navigate
# These are often fluff tests that rely on another test's setup
grep -rL "page\.goto\|page\.click.*nav\|page\.getByRole.*link" tests/e2e/*.spec.ts

Round 2: Manual review of every FLUFF label. For each test labeled FLUFF, answer two questions: (1) Is there a WORKFLOW test that already covers this path? (2) Does a unit or integration test already assert this same behavior? If both answers are yes, the FLUFF label holds. If either is no, re-classify.

Round 3: Refine and re-audit. Update the classification rules based on what you learned in Round 2. Re-run the static analysis. Check for false negatives — tests you labeled WORKFLOW that should actually be MERGE or KEEP_DETAILED.

Three rounds. Not one. The cost of a false positive (deleting a test that provides unique coverage) is much higher than the cost of an extra audit round.

Before and After

Suite Performance

MetricBefore (180 tests)After (141 tests)Change
Total suite duration24 min 12 sec17 min 45 sec-27%
Flaky test rate3.2% of runs1.1% of runs-66%
Unique workflow coverage134 paths134 pathsNo change
Weekly maintenance hours~4 hrs~2.5 hrs-38%

The flake rate drop is the most telling metric. The 37 removed tests had the highest flake rate in the suite — because they were testing implementation details through the most fragile layer possible. A form validation rule tested through a browser is sensitive to rendering timing, DOM mutation observers, and focus management. The same rule tested in a unit test has zero flake surface.

What We Didn’t Lose

This is the number that matters: 134 workflow paths before, 134 workflow paths after. Every real user journey that was tested before the audit is still tested after. We didn’t lose coverage — we lost redundancy.

The distinction matters. Coverage that exists at two layers (unit + E2E) is not “more coverage” than coverage that exists at one correct layer. It’s duplicated cost with the same protection.

Why Fluff Tests Accumulate

Fluff tests are not created by careless engineers. They accumulate for structural reasons:

  1. Missing test layers. When your codebase has no component test layer or no API integration test layer, E2E becomes the dumping ground. Every behavior check, every edge case, every “let me just verify this one thing” ends up in E2E because there’s nowhere else for it to go.

  2. The “more tests = better” assumption. Team velocity metrics that count tests added per sprint incentivize quantity over architecture. Nobody gets credit for deleting a test, even when deletion improves the suite.

  3. Copy-paste test creation. A developer copies an existing E2E test as a template, changes the assertion, and now there are two tests that share 90% of their setup. Neither is wrong individually. Together, they’re a merge candidate.

  4. Fear of removing tests. “What if we need it later?” This is the test equivalent of hoarding. If the coverage exists at another layer, removing the E2E test doesn’t remove the safety net — it removes the duplicate.

The Audit Protocol: What You Can Do Monday Morning

  1. Export your test file list. Every E2E test file in your suite, one per line.

  2. Tag each test with its primary assertion. What is this test actually checking? “User can log in.” “Error message appears on invalid input.” “Data loads after navigation.”

  3. Cross-reference against your unit/integration tests. For each E2E assertion, does a faster test already cover the same behavior?

  4. Apply the four labels. WORKFLOW, FLUFF, MERGE, KEEP_DETAILED.

  5. Review FLUFF labels with the team. Don’t delete unilaterally. Show the team which tests you’re proposing to remove and which workflow tests already cover those paths.

  6. Delete in a single PR with before/after metrics. Run the full suite before and after. Measure duration, flake rate, and unique paths covered. The numbers should tell the story.

The Skill of Subtraction

Every engineer knows how to add a test. Write the setup, write the assertion, watch it pass, commit. It feels productive. The test count goes up. The coverage report shows green.

Removing a test requires a different skill set. You have to understand what the test actually covers (not what its name says), whether that coverage exists elsewhere in the suite, whether the assertion belongs at this layer, and whether removing it creates a gap or just eliminates a duplicate.

That’s harder. It requires understanding the full test architecture, not just the individual test file. It requires confidence that your remaining suite is sufficient — confidence backed by analysis, not hope.

A lean, workflow-focused suite with 141 tests that run in 17 minutes is more valuable than a bloated suite with 180 tests that runs in 24 minutes and flakes twice as often. The smaller suite is faster to run, cheaper to maintain, more reliable to interpret, and covers exactly the same user journeys.

Adding tests is easy. Removing tests is engineering.