How to Test Your Invoice Extraction Pipeline

Extraction pipelines turn unstructured documents into structured data. Your invoice goes in as a PDF; line items, totals, tax IDs, and payment terms come out as JSON — the end-to-end pipeline architecture spanning ingestion through export. The failure mode that matters is not whether your code runs without exceptions; it is whether the extracted values are correct. A unit test confirms your function returns the expected type. An integration test confirms your services communicate. Neither tells you whether your model read "€14,320.00" instead of "€1,432.00" on a German invoice with ambiguous formatting. That error passes every type check and schema validation you have. It enters your ERP, distorts your payables, and nobody catches it until reconciliation, if then.

The problem compounds at scale. A misread vendor tax ID does not throw an error. A transposed line item quantity silently propagates through purchase orders and financial reports. Poor software quality, including data processing errors that cascade through financial systems, cost the U.S. economy an estimated $2.41 trillion in 2022, with technical debt accounting for roughly $1.52 trillion of that figure. Extraction errors are particularly insidious because they corrupt data at the point of entry, before downstream logic can validate it.

Testing an extraction pipeline requires three components: a ground-truth dataset of annotated invoices to compare against, field-level accuracy metrics that distinguish exact matches from partial matches and outright misreads, and automated regression checks that catch accuracy drops in CI before they reach production. The sections that follow build each one in depth. If you are helping a finance team run a buyer-side pilot instead of an engineering test harness, this guide to testing invoice automation before you buy covers the business evaluation workflow.

Building a Ground-Truth Invoice Dataset

Every extraction test needs an expected output to compare against. In extraction testing, that expected output is your ground-truth invoice dataset: a collection of real invoice documents paired with structured files containing the known-correct extracted values for every field. When your pipeline extracts data from invoice #4471, the ground truth tells your test suite exactly what the correct invoice number, date, line items, and totals should be. Without this foundation, accuracy metrics are meaningless and regression tests have nothing to assert against.

Designing the Ground-Truth Schema

Your ground-truth schema should mirror the structure your extraction pipeline actually produces. If your pipeline outputs flat JSON with invoice-level fields and a nested line-items array, your ground truth should follow the same shape. This makes assertion logic straightforward: compare expected versus actual, field by field.

A practical schema covers both invoice-level fields (invoice number, invoice date, due date, vendor name, total amount, currency, tax amount) and line-item fields (description, quantity, unit price, line total). Here is a minimal example:

{
  "source_file": "invoices/acme-2024-0472.pdf",
  "annotations": {
    "invoice_number": "INV-0472",
    "invoice_date": "2024-03-15",
    "due_date": "2024-04-14",
    "vendor_name": "Acme Industrial Supply",
    "currency": "USD",
    "tax_amount": 47.25,
    "total_amount": 547.25,
    "line_items": [
      {
        "description": "Steel bracket, 4-inch",
        "quantity": 50,
        "unit_price": 8.50,
        "line_total": 425.00
      },
      {
        "description": "Shipping & handling",
        "quantity": 1,
        "unit_price": 75.00,
        "line_total": 75.00
      }
    ]
  }
}

Two details matter here. First, normalize your expected values consistently. Dates should use a single format (ISO 8601). Monetary amounts should be numbers, not locale-formatted strings. Vendor names should follow a canonical spelling you decide on up front. Second, define how you represent missing fields. If an invoice has no due date, your schema should use null rather than omitting the key entirely. Omitted keys and null keys behave differently in assertion logic, and conflating them will produce false positives.

Selecting and Annotating Your Initial Set

Start with 20 to 50 invoices that represent the real variation your pipeline handles. You are building a test fixture, not a training set, so coverage of variation matters more than volume. Select invoices that span:

Multiple vendors with different layouts and naming conventions
Different currencies and number formats (comma vs. period decimal separators)
Single-page and multi-page documents
Invoices with few line items and invoices with dozens
Scanned documents versus born-digital PDFs, if your pipeline handles both

Manual ground truth annotation is unavoidable for this initial set. Open each invoice, read every field, and record the correct value in your structured template. This is tedious. It is also the only way to get a reliable baseline.

You will encounter ambiguous cases during annotation. The vendor name on the invoice says "ACME IND. SUPPLY LLC" but your system normalizes it to "Acme Industrial Supply." A date field reads "03/15/24" and could be interpreted differently across locales. Whitespace and special characters in description fields vary. For each ambiguity, make a decision and document it in a conventions file alongside your dataset. "Vendor names use title case, spelled out in full, without legal suffixes." "Dates normalize to YYYY-MM-DD." These conventions become part of your test contract. Without them, half your test failures will be annotation inconsistencies rather than real extraction errors.

Annotation Workflow That Scales

Use a structured template from the start. JSON files following your schema, or a CSV with one row per field per invoice. Freeform notes in a spreadsheet will not survive contact with your assertion code.

For reliability, have two people independently annotate a subset of at least 10 invoices. Compare their outputs. Disagreements reveal genuinely ambiguous fields where your conventions need tightening, and they catch transcription errors that a single annotator would miss. An inter-annotator agreement rate below 95% on a field means your annotation conventions for that field are underspecified.

Store ground-truth files alongside the source invoice documents in version control. A clean directory structure looks like:

test-data/
  invoices/
    acme-2024-0472.pdf
    globex-2024-1100.pdf
  ground-truth/
    acme-2024-0472.json
    globex-2024-1100.json
  annotation-conventions.md

This keeps the dataset versioned with your test code. When someone checks out a commit from three months ago, they get the exact ground truth that was valid at that point, and test results remain reproducible.

Maintaining a Living Dataset

Your ground-truth invoice dataset is not a one-time artifact. It evolves with your pipeline.

When a new edge case surfaces in production (a vendor using a table layout your tests never covered, an invoice in a new currency), add it to the dataset with its annotation. When your extraction schema changes, perhaps you start extracting purchase order numbers or payment terms, update the ground-truth schema to match and backfill annotations for existing invoices where applicable. Tag or version your dataset explicitly so you can tie any test run to the exact ground-truth version it used.

A dataset that stagnates while your pipeline evolves will silently lose coverage. The invoices that cause production failures will be exactly the ones your ground truth never included.

Generating Synthetic Test Invoices for Edge Cases

Real invoices reflect your current vendor mix, which means they cover the formats, currencies, and layouts you already process successfully. The failure modes that cause production incidents are, by definition, the ones your real data underrepresents. Synthetic invoice test data fills this gap by letting you programmatically generate controlled documents where every value is known, so you can systematically probe the exact conditions your pipeline handles worst.

The key advantage of synthetic data generation is that every document is self-annotating. Because you define the generation parameters (vendor name, line item amounts, tax rates, currency codes), those parameters are the ground truth. There is no manual annotation step for synthetic invoices. You generate the PDF and the expected extraction output in a single pass, which means you can produce hundreds of edge-case documents in the time it would take to annotate ten real ones.

Two Approaches to Generation

Template-based generation gives you full control over layout and content. The workflow is straightforward:

Build a set of HTML/CSS invoice templates with placeholder fields for vendor name, address, line items, totals, tax breakdown, and currency symbols.
Populate those placeholders using a data generator library like Faker (available in Python, JavaScript, and most other languages) to produce realistic but controlled values. Generate names, addresses, tax IDs, and amounts that follow the formatting conventions of your target locales.
Render the populated HTML to PDF using a headless browser (Puppeteer, Playwright) or wkhtmltopdf.
Store the generation parameters as a JSON sidecar file, which becomes your ground-truth annotation.

Template-based generation scales well and lets you create visually distinct invoice layouts that mirror the variation you see across real vendors. Build five to ten base templates with different header positions, table structures, and footer layouts to avoid testing against a single format.

Dedicated synthetic document generation tools take this further by producing both the rendered document and structured annotations as a single output. These tools handle layout randomization, font variation, and format diversity automatically, which reduces the engineering effort needed to maintain a template library. If your team does not need pixel-level control over invoice appearance, a dedicated generator can get you to a working synthetic test suite faster.

Edge Case Categories Your Suite Must Cover

Structure your synthetic data generation around these categories, each targeting a specific class of extraction failure:

Multi-currency invoices. Generate invoices in USD, EUR, and GBP at minimum, but also include currencies with non-standard symbols and formatting (Japanese yen with no decimal places, Swiss francs with apostrophe thousand separators). Currency parsing errors compound across every line item.
Multi-page invoices. Create documents where the line item table spans two or three pages. Many extraction pipelines lose rows at page boundaries or duplicate header rows as line items. Test with 20, 50, and 100+ line items to find the threshold where extraction completeness degrades.
Credit notes. Generate documents with negative totals, negative line item amounts, and a document type label of "Credit Note" rather than "Invoice." These test both value extraction (does the pipeline preserve the negative sign?) and document classification (does it correctly identify the document type?).
Mixed-format batches. Produce the same invoice content as both a native PDF and a scanned image (JPEG, TIFF, PNG) to verify your pipeline handles format routing correctly when processing mixed batches.
Low-quality scans. Take your generated PDFs and degrade them programmatically: reduce resolution to 150 DPI or lower, add Gaussian noise, apply slight rotation (1 to 3 degrees), and introduce compression artifacts. This simulates the quality variation in real scanned documents without requiring a physical scanner.
High line-item counts. Invoices with 50+ line items stress extraction completeness. Generate documents with exactly known item counts so you can verify that every row was captured. This is where you discover whether your pipeline truncates results or silently drops items beyond a certain count.
Right-to-left and non-Latin scripts. If your product serves international markets, generate invoices with Arabic or Hebrew text to test bidirectional text handling. Even if these are rare today, a single mishandled document can block an entire accounts payable workflow.

Balancing Real and Synthetic Data

A practical target composition is 60 to 70% real annotated invoices for production representativeness and 30 to 40% synthetic invoices for systematic edge case coverage. The real portion ensures your tests reflect actual vendor formats and quality levels. The synthetic portion ensures you are testing failure modes that production data has not yet surfaced. Adjust the ratio based on where your pipeline actually fails in production.

Choosing the Right Accuracy Metrics for Each Field Type

A binary "did it match?" check will fail you almost immediately. Invoice data spans structured identifiers, free-text fields, monetary values, and dates, each with different legitimate variation. A vendor name extracted as "Acme Corp." versus ground truth "Acme Corp" is not an error. A total of $1,234.56 versus $1,234.57 due to floating-point rounding is not a regression. Your test harness needs matching strategies that reflect how each field type actually behaves.

The practical framework breaks down into three matching strategies, applied per field type.

Exact Match: Identifiers and Codes

Fields with canonical formats that permit zero variation belong here: invoice numbers, PO numbers, currency codes, tax identification numbers. The extracted value must be character-identical to ground truth after normalization (trim whitespace, standardize case). An invoice number of "INV-2024-0881" either matches or it does not. There is no partial credit.

Normalization matters. Before comparing, strip leading and trailing whitespace, collapse internal whitespace to single spaces, and uppercase both values for case-insensitive fields like currency codes. This prevents your tests from flagging "USD " versus "USD" as a failure while still catching genuine extraction errors.

Fuzzy Match: Text Fields

Vendor names, addresses, and line item descriptions rarely extract with perfect character fidelity. OCR artifacts, minor formatting differences, and abbreviation inconsistencies are normal. Penalizing these as failures creates noise that drowns out real problems.

Levenshtein distance (edit distance) gives you a tunable threshold for acceptable variation. Set a similarity ratio above 0.95 or a maximum edit distance of 2 characters for short fields. "Müller & Associates" matching against "Muller & Associates" should pass. "Müller & Associates" matching against "Miller Associates" should fail. The threshold you choose depends on your pipeline's typical OCR quality and the field lengths you encounter. Start conservative (0.95 similarity) and adjust as you accumulate test data showing where legitimate variation actually lands.

For a deeper treatment of how OCR quality affects these thresholds, see our guide on understanding invoice OCR accuracy metrics and benchmarks.

Tolerance-Based Numerical Matching: Amounts and Quantities

Monetary amounts and quantities are vulnerable to floating-point representation differences, rounding during currency conversion, and minor OCR misreads of decimal separators. A rigid exact match on these fields generates false failures.

Set a tolerance window appropriate to the field:

Currency fields: plus or minus $0.01 (one cent). A total of $1,234.56 matching against ground truth of $1,234.57 is a pass.
Quantities: plus or minus 0.5%. A quantity of 1,000 matching against 1,002 within tolerance is a pass.
Tax percentages: plus or minus 0.1 percentage points.

If a value falls outside tolerance, that is a genuine extraction failure worth investigating.

Date Matching: The Format Trap

Dates are a common and frustrating failure mode. Invoices represent dates as "03/28/2026", "28.03.2026", "March 28, 2026", "2026-03-28", and dozens of other formats. Your test will report false failures constantly if it compares raw strings.

Normalize both extracted and ground-truth dates to ISO 8601 (YYYY-MM-DD) before comparison. Your comparison function should parse both values into a date object, then compare the date objects. This eliminates format variation entirely and lets you focus on whether the pipeline extracted the correct date, not whether it formatted it the way your ground truth expected.

Per-Field Reporting and Classification Confusion

Calculate accuracy (matches divided by total test cases) for each field independently. Aggregate accuracy across all fields masks critical weaknesses — a 99% date-extraction rate paired with 85% on vendor names averages to a deceptively healthy 94%. Track per-field accuracy in every test run; this is what reveals where to invest improvement effort and catches regressions in specific field types that aggregate metrics smooth over.

For classification fields (document type, currency code, line-item category), pair the per-field accuracy number with a confusion matrix that shows which values get confused with which others. If your pipeline consistently extracts "EUR" when the actual currency is "CHF", that systematic error has a specific cause (visual similarity in source documents or a training-data gap) that an aggregate accuracy drop will not surface.

Centralizing the Logic

A match(field_type, extracted, expected) function dispatches to the right strategy per field type. Map each field in your schema to one:

invoice_number → exact match
vendor_name → fuzzy match (threshold 0.95)
total_amount → tolerance match (±$0.01)
invoice_date → date normalization then exact match
currency_code → exact match (feed mismatches into a confusion matrix)

Line items need an additional alignment step: match rows by composite key (description similarity + amount) and compare order-independently. A missing row counts as a miss for every field in that row.

Designing Regression Tests That Catch Accuracy Drops

Regression testing for extraction means running the same set of invoices through your pipeline on every change, comparing results against ground truth, and failing the build when per-field accuracy drops below a defined threshold. The goal is to catch the moment accuracy degrades, not to prove the system works. Provider swaps, model updates, preprocessing changes, prompt edits: any of these can silently degrade accuracy on field types you are not manually checking. A regression suite makes that degradation visible and blocking. That matters even more once your runtime acceptance layer includes schema and arithmetic validation rules for extracted invoice data, because extraction changes can shift both model accuracy and downstream auto-accept behavior.

Test Harness Architecture

Your test harness follows a four-step loop for each invoice in the test set:

Send the test invoice through the extraction pipeline. If you are testing against the invoice extraction API, the Python SDK's extract() method handles upload, submission, polling, and download in a single call. Point it at your test invoice directory, pass your prompt configuration, and request JSON output.
Parse the structured output. JSON is the most test-friendly format because it maps directly to your ground-truth data structures without any format conversion. If you want a stricter contract before scoring accuracy, this is also the point to add Pydantic validation for extracted invoice JSON in Python. Each extraction result includes page-level tracking through pages.successful and pages.failed arrays, so your harness knows exactly which pages returned data and which did not.
Apply field-level matching logic. For each extracted field, compare the value against the corresponding ground-truth entry using the metric appropriate to that field type (exact match, normalized match, numeric tolerance, or fuzzy string similarity from the previous section).
Aggregate per-field accuracy scores. Calculate the percentage of test invoices where each field was correctly extracted. This gives you a single accuracy number per field across your entire test set.

Handle partial-success scenarios explicitly. If the API reports that 3 out of 4 pages in a multi-page invoice extracted successfully but one failed, your test should count the failed page as a miss for any fields expected from that page rather than silently skipping it. The API response also includes an ai_uncertainty_notes array that flags assumptions the extraction model made (such as inferring a currency from context or disambiguating similar vendor names). Log these notes alongside test results. They will not trigger a test failure on their own, but they identify invoices where extraction confidence is lower and accuracy is more likely to shift between runs.

The Baseline-and-Threshold Pattern

After your first full test run, record per-field accuracy as your baseline. Store it in a version-controlled JSON file:

{
  "baseline_date": "2026-03-28",
  "extraction_provider": "invoicedataextraction",
  "prompt_version": "v3",
  "sample_size": 250,
  "field_accuracy": {
    "invoice_number": 98.4,
    "invoice_date": 97.2,
    "total_amount": 99.2,
    "currency": 96.8,
    "vendor_name": 95.6,
    "line_item_description": 91.2,
    "line_item_amount": 94.8
  },
  "thresholds": {
    "invoice_number": 96.5,
    "invoice_date": 95.5,
    "total_amount": 97.5,
    "currency": 95.0,
    "vendor_name": 93.5,
    "line_item_description": 89.0,
    "line_item_amount": 93.0
  }
}

Set each threshold 1 to 2 percentage points below the baseline. This tolerance absorbs minor non-deterministic variation in extraction output without masking real regressions. Future test runs load this file, run the full suite, and fail if any field drops below its threshold. On small test sets (under 100 invoices), a 2-percentage-point drop could be within normal sampling variance rather than a real regression. If you are running fewer than 100 invoices, use a wider threshold buffer (3 to 4 percentage points) until you can increase your dataset size.

The baseline file updates only through deliberate commits. When you intentionally improve extraction (better prompts, a new provider, additional preprocessing), run the suite, verify the improvement, and commit the updated baseline. This creates an auditable history of accuracy changes tied to specific code changes in your version control log.

Structuring Tests with pytest

The examples below use Python and pytest, but the architecture (parameterized test cases, baseline comparison, structured reporting) applies to any test framework and language.

Organize your extraction regression tests as parameterized fixtures where each test case is an invoice file paired with its ground-truth file:

import pytest
import json
from invoicedataextraction import InvoiceDataExtraction

BASELINE_PATH = "tests/extraction/baseline.json"
TEST_DATA_DIR = "tests/extraction/invoices/"
GROUND_TRUTH_DIR = "tests/extraction/ground_truth/"

@pytest.fixture(scope="session")
def baseline():
    with open(BASELINE_PATH) as f:
        return json.load(f)

@pytest.fixture(scope="session")
def extraction_results():
    client = InvoiceDataExtraction(api_key="your-test-key")
    result = client.extract(
        folder_path=TEST_DATA_DIR,
        prompt=load_prompt_config(),
        output_structure="per_invoice",
        download={"formats": ["json"], "output_path": "tests/extraction/output/"}
    )
    return parse_extraction_output("tests/extraction/output/")

def test_field_accuracy_above_threshold(extraction_results, baseline):
    accuracy = compute_per_field_accuracy(extraction_results, GROUND_TRUTH_DIR)
    failures = []
    for field, score in accuracy.items():
        threshold = baseline["thresholds"].get(field)
        if threshold and score < threshold:
            failures.append(
                f"{field}: {score:.1f}% (threshold: {threshold}%)"
            )
    assert not failures, f"Fields below threshold:\n" + "\n".join(failures)

Use pytest markers to tag test invoices by category so you can run targeted subsets:

@pytest.mark.multi_currency for invoices with non-USD amounts
@pytest.mark.multi_page for invoices spanning multiple pages
@pytest.mark.credit_note for credit memos and adjustments
@pytest.mark.low_quality for scanned or degraded documents

This lets you isolate which categories regressed. A 2% overall drop might be a 15% drop concentrated entirely in multi-page invoices, which points directly at where the problem is.

Generate a structured test report in JUnit XML (for CI integration) or JSON (for custom dashboards) that records per-invoice, per-field match results. This granular output feeds into downstream analysis: you can sort by field to find which ones are trending downward, or sort by invoice to identify specific documents that consistently break.

Provider Comparison Testing

When evaluating a new extraction provider or model, run the same regression suite against both solutions. Keep the test dataset and ground truth identical. Compare per-field accuracy side by side:

Field	Current Provider	Candidate Provider	Delta
invoice_number	98.4%	97.8%	-0.6%
total_amount	99.2%	99.6%	+0.4%
vendor_name	95.6%	97.1%	+1.5%
line_item_description	91.2%	88.4%	-2.8%

This turns a subjective evaluation into an objective, data-driven decision. The candidate might be better on header fields but worse on line items. Without the test suite, you would discover this only after deploying to production and fielding complaints from your accounting team.

If you are already extracting invoice data with Python and the SDK, adding a second provider to the comparison harness is straightforward: swap the extraction call, keep everything else identical. The same harness applies to prompt changes — lock the prompt configuration as a versioned artifact, run the suite before and after any wording change, and block the merge if any field drops below threshold. This catches the common failure mode where a prompt tweak fixes one invoice format and silently breaks three others.

Running Extraction Tests in CI and Reporting Results

Until your test suite runs in CI with hard accuracy gates, it remains a suggestion rather than a contract. This section covers how to integrate extraction accuracy tests into GitHub Actions so that no pull request merges and no release ships when field-level accuracy drops below your defined thresholds.

GitHub Actions Workflow

A GitHub Actions workflow handles four steps in order — checkout the test dataset and thresholds, run the suite against your staging endpoint, compare per-field accuracy against baseline, exit non-zero on any breach. If you are wiring CI to a third-party service, use an API security due-diligence checklist for document extraction before pointing automated tests at anything production-adjacent. Trigger the workflow on pull requests touching extraction code and on a scheduled cadence for drift detection:

name: Extraction Accuracy Tests

on:
  pull_request:
    paths:
      - 'src/extraction/**'
      - 'prompts/**'
      - 'schemas/**'
  schedule:
    - cron: '0 6 * * 1'  # Weekly Monday 6 AM UTC

jobs:
  extraction-tests:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        config: [default, multi-currency, eu-format]
    steps:
      - uses: actions/checkout@v4

      - name: Cache test dataset
        uses: actions/cache@v4
        with:
          path: tests/fixtures/invoices
          key: test-dataset-${{ hashFiles('tests/fixtures/manifest.json') }}

      - name: Run extraction test suite
        env:
          EXTRACTION_API_KEY: ${{ secrets.EXTRACTION_API_KEY }}
          EXTRACTION_ENDPOINT: ${{ secrets.STAGING_ENDPOINT }}
          TEST_CONFIG: ${{ matrix.config }}
        run: |
          python -m pytest tests/extraction/ \
            --config $TEST_CONFIG \
            --report-output results/$TEST_CONFIG.json \
            --tb=short

      - name: Evaluate accuracy gates
        run: python scripts/check_accuracy_gates.py results/${{ matrix.config }}.json

      - name: Upload accuracy report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: accuracy-report-${{ matrix.config }}
          path: results/

Balancing Thoroughness Against Credit Cost

Extraction tests consume API credits, typically one credit per successfully processed page. Running 500 invoices on every pull request will burn through your budget fast and slow down your feedback loop.

Split your test suite into two tiers:

Smoke tests (every pull request): A focused subset of 10 to 20 invoices covering the most critical field types and your highest-risk edge cases. This runs in under a minute and catches obvious regressions before a reviewer even looks at the code.

Full regression suite (scheduled or release-branch only): The complete 50 to 100 invoice dataset with full edge case coverage. Run this on your weekly schedule and on merges to release branches. This is where you catch subtle accuracy shifts across the long tail of invoice formats.

You can control this with a simple environment flag:

- name: Run smoke tests (PR)
  if: github.event_name == 'pull_request'
  run: pytest tests/extraction/ -m smoke --report-output results/smoke.json

- name: Run full regression suite (scheduled)
  if: github.event_name == 'schedule'
  run: pytest tests/extraction/ --report-output results/full.json

Accuracy Gate Design

When a gate fails, the developer needs to know exactly what broke and by how much. A generic "tests failed" message forces them to dig through logs. Instead, your gate-checking script should output a structured summary table:

╔══════════════════╦══════════╦═══════════╦═══════════╦════════╗
║ Field            ║ Current  ║ Baseline  ║ Threshold ║ Status ║
╠══════════════════╬══════════╬═══════════╬═══════════╬════════╣
║ vendor_name      ║ 97.8%    ║ 98.2%     ║ 97.0%     ║ PASS   ║
║ invoice_number   ║ 99.1%    ║ 99.1%     ║ 98.0%     ║ PASS   ║
║ total_amount     ║ 93.2%    ║ 96.5%     ║ 95.0%     ║ FAIL   ║
║ line_items       ║ 91.7%    ║ 92.0%     ║ 90.0%     ║ PASS   ║
║ tax_amount       ║ 88.4%    ║ 94.1%     ║ 93.0%     ║ FAIL   ║
╚══════════════════╩══════════╩═══════════╩═══════════╩════════╝

PIPELINE BLOCKED: 2 field(s) below threshold.
  - total_amount: 93.2% (threshold 95.0%, dropped 3.3pp from baseline)
  - tax_amount: 88.4% (threshold 93.0%, dropped 5.7pp from baseline)

This fail-fast output tells the developer which fields regressed, the magnitude of the drop in percentage points, and exactly where the threshold sits. No guessing, no log diving.

Stakeholder Reporting

Engineers read pytest output. Product managers and operations leads do not. Generate a human-readable accuracy report as a CI artifact, formatted in Markdown or HTML, that non-technical stakeholders can open directly from the build summary.

A minimal example:

### Extraction Accuracy Report — 2026-03-28

**Overall:** 4 of 6 fields stable or improving. 2 degraded.

| Field        | Accuracy | vs. Baseline | Trend      |
|--------------|----------|--------------|------------|
| vendor_name  | 97.8%    | -0.4pp       | Stable     |
| total_amount | 93.2%    | -3.3pp       | Degrading  |
| tax_amount   | 88.4%    | -5.7pp       | Degrading  |

**Dataset:** 87 invoices (62 real, 25 synthetic).

Upload this as a build artifact so it is accessible from the GitHub Actions run summary without cloning the repository or reading raw logs.

Scheduled Runs for Drift Detection

PR-triggered tests catch regressions caused by code changes in your repository. They will not catch accuracy drift caused by factors outside your codebase: upstream model updates from your extraction provider, infrastructure changes, or subtle shifts in the documents your customers submit.

Schedule your full regression suite to run weekly (or monthly, depending on your tolerance for detection latency). When a scheduled run fails its accuracy gates, it surfaces degradation that no recent commit caused. This is your early warning system. Pipe these failures into your alerting channel (Slack, PagerDuty, email) so the team investigates before customers notice.

The weekly cron trigger in the workflow above handles this. For higher-stakes pipelines processing financial documents, consider running it more frequently or adding a secondary schedule tied to your extraction provider's release notes.