Your CI Pipeline Has Been Lying to You for 13 Days

Erik Treviño

•Apr 20, 2026

Cover for Your CI Pipeline Has Been Lying to You for 13 Days

Our CI pipeline failed for 13 days straight.

Nobody noticed.

20+ consecutive failures. Red builds every single day. The auto-deploy pipeline was dead. Every E2E run timed out after 30 minutes, burned CI credits, and produced a red badge that nobody looked at.

For 13 days, the team was merging code with zero E2E validation. Every PR that shipped during those two weeks went to production without the test suite that was supposed to catch regressions. The safety net had a hole in it, and the team was walking the tightrope without looking down.

Day 0: The PR That Broke Everything

A frontend refactor landed. Necessary work — the team was consolidating URL endpoint definitions across the application. Route keys that were duplicated across 17 files got centralized into a single configuration module. Clean refactoring. Well-scoped PR.

All code reviews passed. All unit tests passed. All 4 approvers signed off. The most experienced reviewer on the PR even flagged the URL mapping area as a risk — and reviewed it carefully.

The gap still slipped through.

The E2E test infrastructure maintained its own URL mapping. A separate configuration file that mapped route names to actual URLs. When the frontend refactored its route keys, nobody updated the test infrastructure’s mapping. So every E2E test that navigated to a route — which is every E2E test — tried to hit a URL that no longer existed.

// test-config/routes.ts — THE BROKEN FILE
// These route keys matched the frontend's OLD naming convention.
// The frontend PR renamed them. This file was not updated.
export const ROUTES = {
  dashboard: '/app/dashboard',
  userProfile: '/app/user/profile',
  settings: '/app/settings/general',
  // ... 14 more routes
  reports: '/app/reports/overview',     // OLD: was renamed to 'reporting'
  analytics: '/app/analytics/main',     // OLD: was renamed to 'insights'
} as const;

// Every test used these routes:
test('user can view analytics', async ({ page }) => {
  await page.goto(ROUTES.analytics);  // Navigates to a URL that no longer exists
  // Test times out waiting for a page that will never load
});

Days 1-12: The Silent Failure

Here’s what happened for the next 12 days: nothing.

The E2E pipeline ran on every merge to main. It failed. It sent a notification. The notification went to a channel that the team had muted months ago because it was too noisy. The workflow run showed red in the GitHub Actions tab, which nobody checked because the PR checks (unit tests, linting) all passed.

Every day, the pipeline:

Checked out the code
Installed dependencies
Launched the browsers
Tried to navigate to routes that no longer existed
Waited 30 seconds for each page to load
Timed out
Reported failure
Repeated for every test in the suite

30 minutes of CI time, burned. Every single day. For almost two weeks.

Day 13: Noticing

I found it the way most silent failures are found — by accident. I was investigating a different issue and opened the GitHub Actions tab. The wall of red was unmissable once you looked at it. 20+ consecutive failures on the main branch E2E pipeline.

The investigation started with a simple question: when did this start?

# Find the first failing run on main
gh run list --workflow=e2e.yml --branch=main --limit=30 --json conclusion,createdAt \
  | jq '.[] | select(.conclusion == "failure") | .createdAt' | tail -1

# Result: 13 days ago

Then: what changed 13 days ago?

# Find the merge commit from 13 days ago
git log --oneline --since="13 days ago" --until="12 days ago" --merges

# Cross-reference with the first failure timestamp
git log --oneline --all --after="2026-03-15T00:00:00" --before="2026-03-16T00:00:00"

# Found it: the URL consolidation PR, merged on March 15

Then: what exactly broke?

# Compare the test config routes against the frontend route definitions
# The frontend had renamed keys — the test config still used the old names
git diff HEAD~50..HEAD -- frontend/src/config/routes.ts
git diff HEAD~50..HEAD -- tests/config/routes.ts  # This file had ZERO changes

The diff told the story immediately. The frontend routes file showed 17 renamed keys across a 50-commit window. The test routes file showed zero changes in the same window. The mapping had drifted.

The Fix

The fix was 10 lines. Update the test route configuration to match the new frontend route keys.

// test-config/routes.ts — FIXED
export const ROUTES = {
  dashboard: '/app/dashboard',
  userProfile: '/app/user/profile',
  settings: '/app/settings/general',
  // Updated to match the frontend refactor
  reporting: '/app/reporting/overview',   // Was 'reports'
  insights: '/app/insights/main',         // Was 'analytics'
  // ...
} as const;

Pipeline went from 30-minute timeouts to 4-minute green runs. Instantly.

The fix took 10 minutes. Finding it took 30 minutes of git archaeology. The problem existing for 13 days is the part that keeps me up at night.

The URL Mapping Contract

The real fix isn’t the 10-line route update. The real fix is ensuring this category of failure can’t go undetected again.

URL mappings between your frontend routing and your test infrastructure are a first-class contract. They need their own validation — not just “I hope someone remembers to update the test config.”

Here’s what a contract between frontend routes and test infrastructure looks like in code:

// This test validates that the test route config matches the frontend route config.
// It runs in CI on every PR and fails if the mappings drift.

import { ROUTES as TEST_ROUTES } from '../config/routes';
import { ROUTES as APP_ROUTES } from '../../frontend/src/config/routes';

test('test routes must match application routes', () => {
  const testRouteKeys = Object.keys(TEST_ROUTES).sort();
  const appRouteKeys = Object.keys(APP_ROUTES).sort();

  // Every app route should have a corresponding test route
  const missingInTests = appRouteKeys.filter(key => !testRouteKeys.includes(key));
  const staleInTests = testRouteKeys.filter(key => !appRouteKeys.includes(key));

  expect(missingInTests).toEqual([]);
  expect(staleInTests).toEqual([]);
});

test('test route URLs must match application route URLs', () => {
  for (const [key, url] of Object.entries(TEST_ROUTES)) {
    expect(APP_ROUTES[key]).toBeDefined();
    expect(APP_ROUTES[key]).toBe(url);
  }
});

This test runs in under a second. It requires no browser. It catches the exact failure that went undetected for 13 days. If the frontend PR had included this contract test in the codebase, the PR itself would have failed CI — because changing route keys in the frontend without updating the test config would have been caught as a contract violation.

The Pipeline Health Monitoring Checklist

The 13-day failure was possible because of multiple gaps in pipeline observability. Here’s what I implemented afterward:

1. Consecutive Failure Alerts

# Runs daily and alerts if the main branch E2E pipeline has failed
# more than 3 consecutive times
name: Pipeline Health Check
on:
  schedule:
    - cron: '0 9 * * 1-5'  # Every weekday at 9 AM

jobs:
  check-health:
    runs-on: ubuntu-latest
    steps:
      - name: Check E2E pipeline status
        run: |
          RECENT=$(gh run list --workflow=e2e.yml --branch=main --limit=5 \
            --json conclusion -q '.[].conclusion')
          FAILURES=$(echo "$RECENT" | grep -c "failure" || true)
          if [ "$FAILURES" -ge 3 ]; then
            echo "🚨 E2E pipeline has failed $FAILURES of the last 5 runs"
            # Send alert to team channel
            exit 1
          fi
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

2. Pipeline Success Rate Dashboard

Track the 7-day rolling success rate. If it drops below 80%, something is structurally wrong — not just a flaky test.

3. Maximum Acceptable Failure Duration

Decide as a team: what is the maximum number of consecutive days the E2E pipeline can fail before it becomes a blocking priority? For my team, the answer is now 2 days. If the pipeline is red for 2 consecutive days, it gets escalated.

4. Notification Channel Discipline

The team had muted the CI notification channel because it was too noisy. That’s a symptom of a different problem — the pipeline was sending alerts for flaky tests alongside genuine failures, and the signal got lost in the noise. The fix: separate channels for “pipeline failure” (hard failures that need attention) and “test instability” (flaky tests being tracked).

5. Contract Tests for Shared Configuration

Every piece of configuration that is shared between application code and test infrastructure needs a contract test. Routes, feature flags, environment variables, API endpoint URLs — if the test suite depends on it staying in sync with the application, validate the sync in CI.

Why Silent Failures Are the Most Dangerous

Loud failures get fixed. A test that fails on every PR blocks merging. A deployment that errors out gets rolled back. A build that crashes gets investigated immediately.

Silent failures erode confidence gradually. The pipeline is red, but PRs still merge (because the E2E suite runs on main, not as a PR check). The dashboard shows failures, but the team has stopped looking at it. The auto-deploy is broken, but manual deploys still work, so nobody’s truly blocked.

By day 5, the team has adapted to the broken state. By day 10, the broken state is the state. By day 13, someone notices and the reaction is “oh, that’s been broken for a while” instead of “this is an emergency.”

This is the most dangerous failure mode in test automation. Not the test that flakes occasionally — that’s visible and annoying. The pipeline that fails silently for weeks, training the team to ignore it, until the safety net it provides is purely theoretical.

The Deeper Lesson

If your CI pipeline can fail for two weeks and nobody acts on it, you don’t have CI. You have a cron job sending emails to /dev/null.

Continuous integration means continuous. Not “continuous unless the E2E suite is broken, in which case we’ll just run the unit tests and hope for the best.” The entire point of CI is that it catches failures early. A pipeline that fails silently for 13 days is catching nothing. It’s spending compute time to produce red badges that nobody reads.

The hardest part of fixing this wasn’t the 10-line route update. It wasn’t the contract test. It wasn’t the monitoring workflow. It was noticing. Everything else took an afternoon. The noticing took 13 days — and would have taken longer if I hadn’t been looking at the Actions tab for an unrelated reason.

Build systems that notice for you. Contracts that fail fast. Alerts that reach the right people. The pipeline should never be able to lie to you for 13 days.

Because the pipeline wasn’t flaky. It was broken. And the hardest part wasn’t fixing it — it was noticing.

Back to all posts