The Directive Platform

The Problem

The testing ecosystem bifurcated around AI. Model evaluation tools (promptfoo, DeepEval) assess output quality at the API layer. Browser automation frameworks (Playwright, Cypress) validate UI mechanics. Neither tests what the end user actually experiences: AI-generated content rendered in the browser, subject to the full chain of frontend state management, streaming protocols, error handling, and display logic.

174 existing E2E tests at a cybersecurity company covered navigation, forms, and data display — with zero assertions on AI output behavior, safety, or correctness.

The Framework: Behavioral Contract Testing

Behavioral contracts define deterministic boundaries that AI features must satisfy without constraining content. The framework is formalized as a mathematical tuple C = (S, T, Σ, Φ, Ψ):

  • S — State machine: transitions must follow declared order (idle → thinking → streaming → complete)
  • T — Timing constraints: response times within production-derived bounds with configurable headroom
  • Σ — Structural schema: response length, required sections, expected format
  • Φ — Safety invariants: PII detection, system prompt leakage prevention, tenant isolation
  • Ψ — Semantic similarity: content alignment with reference descriptions

Contracts are declarative YAML specifications that serve simultaneously as behavioral documentation, test oracles, and compliance evidence.

The 8 Custom Matchers

The framework implements 8 custom expect() matchers registered via Playwright’s expect.extend():

  1. toTransitionThrough — Validates AI features follow declared state machine sequences
  2. toMeetTimingContract — Enforces response time bounds from production performance data
  3. toMatchAiSchema — Validates structure (length, sections, format) without asserting content
  4. toNotContainPii — Detects SSNs, credit cards (Luhn checksum), API keys, bearer tokens
  5. toRespectTenantBoundary — Prevents cross-tenant identifier leakage
  6. toNotLeakSystemPrompt — Detects system prompt fragments in AI responses (strict + relaxed modes)
  7. toNotExceedTokenBudget — Enforces maximum token generation limits
  8. toBeSemanticallySimilar — Validates semantic alignment using TF-IDF cosine similarity with pluggable providers

The AiStateObserver

The technical innovation behind state machine validation: a DOM mutation observer injected via Playwright’s addInitScript() API. The observer attaches a MutationObserver to designated DOM elements, recording state transitions with high-resolution timestamps (performance.now()).

A critical edge case discovered during validation: AI responses arriving in under 16 milliseconds — faster than a single browser animation frame at 60fps. The thinking indicator and first response token render in the same frame. The observer uses configuration-order priority scanning within a single requestAnimationFrame callback to distinguish states that would otherwise be temporally collapsed.

This 100% sub-frame collision incidence rate across all validation runs demonstrates that this is not an edge case but a routine characteristic of modern LLM response times.

Industrial Validation

Validated against a deployed cybersecurity investigation platform with live Amazon Bedrock AI:

  • 10 contract executions (2 contracts × 5 consecutive runs) — 100% compliance rate
  • Each run produced distinct AI content (responses ranged from 4,577 to 4,893 characters) while all validation layers returned deterministic pass results
  • Response times: 10,916ms to 13,084ms across runs (19.9% coefficient of variation), all within contract bounds
  • Mean test execution time: 37.1 seconds, of which AI response wait constituted the dominant cost. Contract validation overhead is negligible — CPU-bound, no external API calls, single runtime dependency (js-yaml)

Security Findings

The safety invariant layer identified two previously undetected security issues that 174 existing E2E tests had missed:

  1. PII patterns in AI responses. The toNotContainPii matcher detected Social Security Number and credit card formats in AI-generated breach analysis output. The AI feature was surfacing raw credential patterns in natural language summaries rather than masking them — a presentation layer failure that only a browser-layer assertion could catch.

  2. System prompt fragment leakage. The toNotLeakSystemPrompt matcher detected fragments of the system prompt appearing in AI error state responses — an information disclosure vulnerability that could expose internal system architecture to end users.

Compliance-as-Code

Each test execution generates an AI Bill of Materials (AI-BOM) mapping contract layers to regulatory requirements:

  • State machine validation → EU AI Act Article 14 (human oversight)
  • Timing constraints → EU AI Act Article 15 (accuracy, robustness)
  • Safety invariants → EU AI Act Article 9 (risk management)
  • Structural validation → NIST AI RMF Measure 2.6 (AI system performance)
  • Compliance artifacts → EU AI Act Articles 11 & 12 (documentation, recordkeeping)

The same contract that validates AI behavior in CI/CD also generates the documentation required for regulatory audits — without additional tooling or manual documentation effort.

Dual-Mode Execution

  • Mocked mode (CI/CD): SSE responses intercepted and replaced with deterministic fixtures for fast, reproducible contract validation
  • Live mode (nightly/staging): Tests execute against real AI endpoints, validating behavioral contracts against actual nondeterministic model output

Both modes execute identical contract assertions.

The Three Components

SECUR-T — Behavioral Contract Testing Framework

The 8 custom matchers, the AiStateObserver, the YAML contract format, and the compliance artifact generator. 118 tests against live AI endpoints in CI.

CurioEVE — AI-Powered CLI

60+ commands. Generates test specs from acceptance criteria, produces Playwright code from specs, detects anti-patterns, scores sprints for automation feasibility. 80.40% test coverage.

M-O — Full-Stack Test Management Platform

Go/Gin backend + React frontend. 950+ tests, 71.5% coverage, 60+ REST API routes, 10 frontend pages. Flaky detection, error clustering, Slack/GitHub/Jira/Confluence integrations. Built in ~3 days. Scored from 3.0 → 7.8/10 maturity across 8 phases.

The Closed-Loop Pipeline

Ticket → spec → test → CI → results → gap analysis → next sprint. The Directive Platform automates the complete journey from requirement to validated code to coverage gap identification.

Publication

“Behavioral Contract Testing for AI Features at the Browser Layer” — submitted for double-blind review to IEEE AITest 2026, AI Testing in Practice Track. Read the paper →

Let's build something together.

Have a testing challenge? Let's talk.