Extraction pipelines turn unstructured documents into structured data. Your invoice goes in as a PDF; line items, totals, tax IDs, and payment terms come out as JSON. The failure mode that matters is not whether your code runs without exceptions. It is whether the extracted values are correct.
This is what makes extraction pipeline testing fundamentally different from standard software testing. A unit test confirms your function returns the expected type. An integration test confirms your services communicate. Neither tells you whether your model read "€14,320.00" instead of "€1,432.00" on a German invoice with ambiguous formatting. That error passes every type check and schema validation you have. It enters your ERP, distorts your payables, and nobody catches it until reconciliation, if then.
The problem compounds at scale. A misread vendor tax ID does not throw an error. A transposed line item quantity silently propagates through purchase orders and financial reports. Poor software quality, including data processing errors that cascade through financial systems, cost the U.S. economy an estimated $2.41 trillion in 2022, with technical debt accounting for roughly $1.52 trillion of that figure. Extraction errors are a particularly insidious contributor because they corrupt data at the point of entry, before any downstream logic has a chance to validate it.
Testing an invoice extraction pipeline requires three core components:
- A ground-truth dataset of annotated invoices. You need real and synthetic invoices where every field has a verified correct value. This is your source of truth for measuring whether extraction output is right or wrong.
- Field-level accuracy metrics. Binary pass/fail is not enough. A date field that returns "2025-03-15" instead of "2025-15-03" is a different kind of error than one that returns null. You need metrics that distinguish exact matches, partial matches, normalization differences, and outright misreads, broken down by field type.
- Automated regression checks in CI. Accuracy must be measured on every code change, model update, or provider swap. A regression test suite that runs in your pipeline catches accuracy drops before they reach production, the same way unit tests catch code regressions.
These three components form the full lifecycle when you test an invoice extraction pipeline. The sections that follow cover each one in depth: building and curating ground-truth datasets, generating synthetic invoices for edge cases, selecting the right accuracy metrics for different field types, designing regression tests with meaningful thresholds, and wiring everything into CI with clear pass/fail reporting.
Building a Ground-Truth Invoice Dataset
Every extraction test needs an expected output to compare against. In extraction testing, that expected output is your ground-truth invoice dataset: a collection of real invoice documents paired with structured files containing the known-correct extracted values for every field. When your pipeline extracts data from invoice #4471, the ground truth tells your test suite exactly what the correct invoice number, date, line items, and totals should be. Without this foundation, accuracy metrics are meaningless and regression tests have nothing to assert against.
Designing the Ground-Truth Schema
Your ground-truth schema should mirror the structure your extraction pipeline actually produces. If your pipeline outputs flat JSON with invoice-level fields and a nested line-items array, your ground truth should follow the same shape. This makes assertion logic straightforward: compare expected versus actual, field by field.
A practical schema covers both invoice-level fields (invoice number, invoice date, due date, vendor name, total amount, currency, tax amount) and line-item fields (description, quantity, unit price, line total). Here is a minimal example:
{
"source_file": "invoices/acme-2024-0472.pdf",
"annotations": {
"invoice_number": "INV-0472",
"invoice_date": "2024-03-15",
"due_date": "2024-04-14",
"vendor_name": "Acme Industrial Supply",
"currency": "USD",
"tax_amount": 47.25,
"total_amount": 547.25,
"line_items": [
{
"description": "Steel bracket, 4-inch",
"quantity": 50,
"unit_price": 8.50,
"line_total": 425.00
},
{
"description": "Shipping & handling",
"quantity": 1,
"unit_price": 75.00,
"line_total": 75.00
}
]
}
}
Two details matter here. First, normalize your expected values consistently. Dates should use a single format (ISO 8601). Monetary amounts should be numbers, not locale-formatted strings. Vendor names should follow a canonical spelling you decide on up front. Second, define how you represent missing fields. If an invoice has no due date, your schema should use null rather than omitting the key entirely. Omitted keys and null keys behave differently in assertion logic, and conflating them will produce false positives.
Selecting and Annotating Your Initial Set
Start with 20 to 50 invoices that represent the real variation your pipeline handles. You are building a test fixture, not a training set, so coverage of variation matters more than volume. Select invoices that span:
- Multiple vendors with different layouts and naming conventions
- Different currencies and number formats (comma vs. period decimal separators)
- Single-page and multi-page documents
- Invoices with few line items and invoices with dozens
- Scanned documents versus born-digital PDFs, if your pipeline handles both
Manual ground truth annotation is unavoidable for this initial set. Open each invoice, read every field, and record the correct value in your structured template. This is tedious. It is also the only way to get a reliable baseline.
You will encounter ambiguous cases during annotation. The vendor name on the invoice says "ACME IND. SUPPLY LLC" but your system normalizes it to "Acme Industrial Supply." A date field reads "03/15/24" and could be interpreted differently across locales. Whitespace and special characters in description fields vary. For each ambiguity, make a decision and document it in a conventions file alongside your dataset. "Vendor names use title case, spelled out in full, without legal suffixes." "Dates normalize to YYYY-MM-DD." These conventions become part of your test contract. Without them, half your test failures will be annotation inconsistencies rather than real extraction errors.
Annotation Workflow That Scales
Use a structured template from the start. JSON files following your schema, or a CSV with one row per field per invoice. Freeform notes in a spreadsheet will not survive contact with your assertion code.
For reliability, have two people independently annotate a subset of at least 10 invoices. Compare their outputs. Disagreements reveal genuinely ambiguous fields where your conventions need tightening, and they catch transcription errors that a single annotator would miss. An inter-annotator agreement rate below 95% on a field means your annotation conventions for that field are underspecified.
Store ground-truth files alongside the source invoice documents in version control. A clean directory structure looks like:
test-data/
invoices/
acme-2024-0472.pdf
globex-2024-1100.pdf
ground-truth/
acme-2024-0472.json
globex-2024-1100.json
annotation-conventions.md
This keeps the dataset versioned with your test code. When someone checks out a commit from three months ago, they get the exact ground truth that was valid at that point, and test results remain reproducible.
Maintaining a Living Dataset
Your ground-truth invoice dataset is not a one-time artifact. It evolves with your pipeline.
When a new edge case surfaces in production (a vendor using a table layout your tests never covered, an invoice in a new currency), add it to the dataset with its annotation. When your extraction schema changes, perhaps you start extracting purchase order numbers or payment terms, update the ground-truth schema to match and backfill annotations for existing invoices where applicable. Tag or version your dataset explicitly so you can tie any test run to the exact ground-truth version it used.
A dataset that stagnates while your pipeline evolves will silently lose coverage. The invoices that cause production failures will be exactly the ones your ground truth never included.
Generating Synthetic Test Invoices for Edge Cases
Real invoices reflect your current vendor mix, which means they cover the formats, currencies, and layouts you already process successfully. The failure modes that cause production incidents are, by definition, the ones your real data underrepresents. Synthetic invoice test data fills this gap by letting you programmatically generate controlled documents where every value is known, so you can systematically probe the exact conditions your pipeline handles worst.
The key advantage of synthetic data generation is that every document is self-annotating. Because you define the generation parameters (vendor name, line item amounts, tax rates, currency codes), those parameters are the ground truth. There is no manual annotation step for synthetic invoices. You generate the PDF and the expected extraction output in a single pass, which means you can produce hundreds of edge-case documents in the time it would take to annotate ten real ones.
Two Approaches to Generation
Template-based generation gives you full control over layout and content. The workflow is straightforward:
- Build a set of HTML/CSS invoice templates with placeholder fields for vendor name, address, line items, totals, tax breakdown, and currency symbols.
- Populate those placeholders using a data generator library like Faker (available in Python, JavaScript, and most other languages) to produce realistic but controlled values. Generate names, addresses, tax IDs, and amounts that follow the formatting conventions of your target locales.
- Render the populated HTML to PDF using a headless browser (Puppeteer, Playwright) or wkhtmltopdf.
- Store the generation parameters as a JSON sidecar file, which becomes your ground-truth annotation.
Template-based generation scales well and lets you create visually distinct invoice layouts that mirror the variation you see across real vendors. Build five to ten base templates with different header positions, table structures, and footer layouts to avoid testing against a single format.
Dedicated synthetic document generation tools take this further by producing both the rendered document and structured annotations as a single output. These tools handle layout randomization, font variation, and format diversity automatically, which reduces the engineering effort needed to maintain a template library. If your team does not need pixel-level control over invoice appearance, a dedicated generator can get you to a working synthetic test suite faster.
Edge Case Categories Your Suite Must Cover
Structure your synthetic data generation around these categories, each targeting a specific class of extraction failure:
- Multi-currency invoices. Generate invoices in USD, EUR, and GBP at minimum, but also include currencies with non-standard symbols and formatting (Japanese yen with no decimal places, Swiss francs with apostrophe thousand separators). Currency parsing errors compound across every line item.
- Multi-page invoices. Create documents where the line item table spans two or three pages. Many extraction pipelines lose rows at page boundaries or duplicate header rows as line items. Test with 20, 50, and 100+ line items to find the threshold where extraction completeness degrades.
- Credit notes. Generate documents with negative totals, negative line item amounts, and a document type label of "Credit Note" rather than "Invoice." These test both value extraction (does the pipeline preserve the negative sign?) and document classification (does it correctly identify the document type?).
- Mixed-format batches. Produce the same invoice content as both a native PDF and a scanned image (JPEG, TIFF, PNG) to verify your pipeline handles format routing correctly when processing mixed batches.
- Low-quality scans. Take your generated PDFs and degrade them programmatically: reduce resolution to 150 DPI or lower, add Gaussian noise, apply slight rotation (1 to 3 degrees), and introduce compression artifacts. This simulates the quality variation in real scanned documents without requiring a physical scanner.
- High line-item counts. Invoices with 50+ line items stress extraction completeness. Generate documents with exactly known item counts so you can verify that every row was captured. This is where you discover whether your pipeline truncates results or silently drops items beyond a certain count.
- Right-to-left and non-Latin scripts. If your product serves international markets, generate invoices with Arabic or Hebrew text to test bidirectional text handling. Even if these are rare today, a single mishandled document can block an entire accounts payable workflow.
Balancing Real and Synthetic Data
A practical target composition is 60 to 70% real annotated invoices for production representativeness and 30 to 40% synthetic invoices for systematic edge case coverage. The real portion ensures your tests reflect actual vendor formats and quality levels. The synthetic portion ensures you are testing failure modes that production data has not yet surfaced. Adjust the ratio based on where your pipeline actually fails in production.
Choosing the Right Accuracy Metrics for Each Field Type
A binary "did it match?" check will fail you almost immediately. Invoice data spans structured identifiers, free-text fields, monetary values, and dates, each with different legitimate variation. A vendor name extracted as "Acme Corp." versus ground truth "Acme Corp" is not an error. A total of $1,234.56 versus $1,234.57 due to floating-point rounding is not a regression. Your test harness needs matching strategies that reflect how each field type actually behaves.
The practical framework breaks down into three matching strategies, applied per field type.
Exact Match: Identifiers and Codes
Fields with canonical formats that permit zero variation belong here: invoice numbers, PO numbers, currency codes, tax identification numbers. The extracted value must be character-identical to ground truth after normalization (trim whitespace, standardize case). An invoice number of "INV-2024-0881" either matches or it does not. There is no partial credit.
Normalization matters. Before comparing, strip leading and trailing whitespace, collapse internal whitespace to single spaces, and uppercase both values for case-insensitive fields like currency codes. This prevents your tests from flagging "USD " versus "USD" as a failure while still catching genuine extraction errors.
Fuzzy Match: Text Fields
Vendor names, addresses, and line item descriptions rarely extract with perfect character fidelity. OCR artifacts, minor formatting differences, and abbreviation inconsistencies are normal. Penalizing these as failures creates noise that drowns out real problems.
Levenshtein distance (edit distance) gives you a tunable threshold for acceptable variation. Set a similarity ratio above 0.95 or a maximum edit distance of 2 characters for short fields. "Müller & Associates" matching against "Muller & Associates" should pass. "Müller & Associates" matching against "Miller Associates" should fail. The threshold you choose depends on your pipeline's typical OCR quality and the field lengths you encounter. Start conservative (0.95 similarity) and adjust as you accumulate test data showing where legitimate variation actually lands.
For a deeper treatment of how OCR quality affects these thresholds, see our guide on understanding invoice OCR accuracy metrics and benchmarks.
Tolerance-Based Numerical Matching: Amounts and Quantities
Monetary amounts and quantities are vulnerable to floating-point representation differences, rounding during currency conversion, and minor OCR misreads of decimal separators. A rigid exact match on these fields generates false failures.
Set a tolerance window appropriate to the field:
- Currency fields: plus or minus $0.01 (one cent). A total of $1,234.56 matching against ground truth of $1,234.57 is a pass.
- Quantities: plus or minus 0.5%. A quantity of 1,000 matching against 1,002 within tolerance is a pass.
- Tax percentages: plus or minus 0.1 percentage points.
If a value falls outside tolerance, that is a genuine extraction failure worth investigating.
Date Matching: The Format Trap
Dates are a common and frustrating failure mode. Invoices represent dates as "03/28/2026", "28.03.2026", "March 28, 2026", "2026-03-28", and dozens of other formats. Your test will report false failures constantly if it compares raw strings.
Normalize both extracted and ground-truth dates to ISO 8601 (YYYY-MM-DD) before comparison. Your comparison function should parse both values into a date object, then compare the date objects. This eliminates format variation entirely and lets you focus on whether the pipeline extracted the correct date, not whether it formatted it the way your ground truth expected.
Per-Field Accuracy Reporting
Calculate accuracy (matches divided by total test cases) for each field independently. Aggregate accuracy across all fields masks critical weaknesses. A pipeline might extract dates at 99% accuracy but vendor names at 85%. An aggregate score of 94% looks acceptable. The vendor name problem stays hidden until it causes downstream failures in your accounts payable workflow.
Report per-field accuracy in every test run. Track it over time. This is what reveals exactly where to invest improvement effort and what catches regressions in specific field types that aggregate metrics would smooth over.
Confusion Matrices for Classification Fields
Some extraction tasks involve classification: identifying a document as an invoice versus a credit note, extracting the correct currency code from context, or categorizing line items. For these fields, a single accuracy number tells you that errors happen but not which errors.
A confusion matrix shows which values are being confused with which others. If your pipeline consistently extracts "EUR" when the actual currency is "CHF", that is a systematic error with a specific cause (likely visual similarity in the source documents or a training data gap). A confusion matrix surfaces this pattern immediately, while an accuracy drop from 98% to 95% on currency codes tells you almost nothing actionable.
Centralizing the Logic
Implement a match(field_type, extracted, expected) function that dispatches to the correct matching strategy based on field type. Map each field in your schema to one of the three strategies:
- invoice_number → exact match
- vendor_name → fuzzy match (threshold 0.95)
- total_amount → tolerance match (±$0.01)
- invoice_date → date normalization then exact match
- currency_code → exact match (feed mismatches into a confusion matrix)
This centralizes your matching logic in one place. When you discover that a 0.95 fuzzy threshold is too strict for address fields, you adjust one threshold in one function rather than hunting through scattered test assertions. As you learn where your pipeline's boundaries are, the match function evolves with your understanding.
Line items require an additional alignment step before field matching. When an invoice has 12 line items and the pipeline returns 11, or returns them in a different order, your test harness needs to decide which extracted row maps to which ground-truth row. Match rows using a composite key (description similarity combined with amount) and compare order-independently to avoid false failures from reordered rows. A missing row should count as a miss for every field in that row.
Designing Regression Tests That Catch Accuracy Drops
Regression testing for extraction means running the same set of invoices through your pipeline on every change, comparing results against ground truth, and failing the build when per-field accuracy drops below a defined threshold. The goal is to catch the moment accuracy degrades, not to prove the system works. Provider swaps, model updates, preprocessing changes, prompt edits: any of these can silently degrade accuracy on field types you are not manually checking. A regression suite makes that degradation visible and blocking.
Test Harness Architecture
Your test harness follows a four-step loop for each invoice in the test set:
-
Send the test invoice through the extraction pipeline. If you are testing against the invoice extraction API, the Python SDK's extract() method handles upload, submission, polling, and download in a single call. Point it at your test invoice directory, pass your prompt configuration, and request JSON output.
-
Parse the structured output. JSON is the most test-friendly format because it maps directly to your ground-truth data structures without any format conversion. Each extraction result includes page-level tracking through pages.successful and pages.failed arrays, so your harness knows exactly which pages returned data and which did not.
-
Apply field-level matching logic. For each extracted field, compare the value against the corresponding ground-truth entry using the metric appropriate to that field type (exact match, normalized match, numeric tolerance, or fuzzy string similarity from the previous section).
-
Aggregate per-field accuracy scores. Calculate the percentage of test invoices where each field was correctly extracted. This gives you a single accuracy number per field across your entire test set.
Handle partial-success scenarios explicitly. If the API reports that 3 out of 4 pages in a multi-page invoice extracted successfully but one failed, your test should count the failed page as a miss for any fields expected from that page rather than silently skipping it. The API response also includes an ai_uncertainty_notes array that flags assumptions the extraction model made (such as inferring a currency from context or disambiguating similar vendor names). Log these notes alongside test results. They will not trigger a test failure on their own, but they identify invoices where extraction confidence is lower and accuracy is more likely to shift between runs.
The Baseline-and-Threshold Pattern
After your first full test run, record per-field accuracy as your baseline. Store it in a version-controlled JSON file:
{
"baseline_date": "2026-03-28",
"extraction_provider": "invoicedataextraction",
"prompt_version": "v3",
"sample_size": 250,
"field_accuracy": {
"invoice_number": 98.4,
"invoice_date": 97.2,
"total_amount": 99.2,
"currency": 96.8,
"vendor_name": 95.6,
"line_item_description": 91.2,
"line_item_amount": 94.8
},
"thresholds": {
"invoice_number": 96.5,
"invoice_date": 95.5,
"total_amount": 97.5,
"currency": 95.0,
"vendor_name": 93.5,
"line_item_description": 89.0,
"line_item_amount": 93.0
}
}
Set each threshold 1 to 2 percentage points below the baseline. This tolerance absorbs minor non-deterministic variation in extraction output without masking real regressions. Future test runs load this file, run the full suite, and fail if any field drops below its threshold. On small test sets (under 100 invoices), a 2-percentage-point drop could be within normal sampling variance rather than a real regression. If you are running fewer than 100 invoices, use a wider threshold buffer (3 to 4 percentage points) until you can increase your dataset size.
The baseline file updates only through deliberate commits. When you intentionally improve extraction (better prompts, a new provider, additional preprocessing), run the suite, verify the improvement, and commit the updated baseline. This creates an auditable history of accuracy changes tied to specific code changes in your version control log.
Structuring Tests with pytest
The examples below use Python and pytest, but the architecture (parameterized test cases, baseline comparison, structured reporting) applies to any test framework and language.
Organize your extraction regression tests as parameterized fixtures where each test case is an invoice file paired with its ground-truth file:
import pytest
import json
from invoicedataextraction import InvoiceDataExtraction
BASELINE_PATH = "tests/extraction/baseline.json"
TEST_DATA_DIR = "tests/extraction/invoices/"
GROUND_TRUTH_DIR = "tests/extraction/ground_truth/"
@pytest.fixture(scope="session")
def baseline():
with open(BASELINE_PATH) as f:
return json.load(f)
@pytest.fixture(scope="session")
def extraction_results():
client = InvoiceDataExtraction(api_key="your-test-key")
result = client.extract(
folder_path=TEST_DATA_DIR,
prompt=load_prompt_config(),
output_structure="per_invoice",
download={"formats": ["json"], "output_path": "tests/extraction/output/"}
)
return parse_extraction_output("tests/extraction/output/")
def test_field_accuracy_above_threshold(extraction_results, baseline):
accuracy = compute_per_field_accuracy(extraction_results, GROUND_TRUTH_DIR)
failures = []
for field, score in accuracy.items():
threshold = baseline["thresholds"].get(field)
if threshold and score < threshold:
failures.append(
f"{field}: {score:.1f}% (threshold: {threshold}%)"
)
assert not failures, f"Fields below threshold:\n" + "\n".join(failures)
Use pytest markers to tag test invoices by category so you can run targeted subsets:
- @pytest.mark.multi_currency for invoices with non-USD amounts
- @pytest.mark.multi_page for invoices spanning multiple pages
- @pytest.mark.credit_note for credit memos and adjustments
- @pytest.mark.low_quality for scanned or degraded documents
This lets you isolate which categories regressed. A 2% overall drop might be a 15% drop concentrated entirely in multi-page invoices, which points directly at where the problem is.
Generate a structured test report in JUnit XML (for CI integration) or JSON (for custom dashboards) that records per-invoice, per-field match results. This granular output feeds into downstream analysis: you can sort by field to find which ones are trending downward, or sort by invoice to identify specific documents that consistently break.
Provider Comparison Testing
When evaluating a new extraction provider or model, run the same regression suite against both solutions. Keep the test dataset and ground truth identical. Compare per-field accuracy side by side:
| Field | Current Provider | Candidate Provider | Delta |
|---|---|---|---|
| invoice_number | 98.4% | 97.8% | -0.6% |
| total_amount | 99.2% | 99.6% | +0.4% |
| vendor_name | 95.6% | 97.1% | +1.5% |
| line_item_description | 91.2% | 88.4% | -2.8% |
This turns a subjective evaluation into an objective, data-driven decision. The candidate might be better on header fields but worse on line items. Without the test suite, you would discover this only after deploying to production and fielding complaints from your accounting team.
If you are already extracting invoice data with Python and the SDK, adding a second provider to the comparison harness is straightforward: swap the extraction call, keep everything else identical.
Prompt Sensitivity Testing
When using natural language prompts for extraction, even small wording changes can shift accuracy in unexpected ways. The extraction API accepts both freeform prompt strings (up to 2,500 characters) and structured field definitions with optional per-field instructions. Either format can introduce subtle accuracy changes when modified.
Include your prompt configuration in the test setup as a versioned artifact. When someone proposes a prompt change, the workflow is:
- Run the regression suite with the current prompt. Record results.
- Update the prompt configuration.
- Run the suite again. Compare field-by-field accuracy.
- If accuracy holds or improves, commit both the new prompt and the updated baseline together.
- If accuracy drops on any field, the test fails. The prompt change is blocked until the regression is resolved.
This pattern prevents a common failure mode: someone tweaks the prompt to fix extraction on one invoice format and unknowingly breaks extraction on three others. The regression suite catches the collateral damage before it reaches production.
Running Extraction Tests in CI and Reporting Results
Until your test suite runs in CI with hard accuracy gates, it remains a suggestion rather than a contract. This section covers how to integrate extraction accuracy tests into GitHub Actions so that no pull request merges and no release ships when field-level accuracy drops below your defined thresholds.
CI Pipeline Structure
Your extraction test stage needs four discrete steps, executed in order:
- Check out the test dataset and baseline thresholds from version control. Your ground-truth invoices, expected field values, and per-field accuracy thresholds all live in the repository (or a referenced artifact store). This guarantees every pipeline run uses the same benchmark.
- Run the extraction test suite against your live or staging extraction endpoint. Each test invoice is submitted, and the returned fields are compared against ground truth.
- Compare results against baseline thresholds. Per-field accuracy scores are calculated and checked against the minimum thresholds you defined (e.g., vendor_name >= 98%, line_item_total >= 95%).
- Exit with a non-zero code if any field fails its threshold. This blocks the pipeline. No ambiguity, no manual review required.
GitHub Actions Workflow
Define a workflow that triggers in two contexts: on pull requests that touch extraction-related code, and on a scheduled cadence for drift detection.
name: Extraction Accuracy Tests
on:
pull_request:
paths:
- 'src/extraction/**'
- 'prompts/**'
- 'schemas/**'
schedule:
- cron: '0 6 * * 1' # Weekly Monday 6 AM UTC
jobs:
extraction-tests:
runs-on: ubuntu-latest
strategy:
matrix:
config: [default, multi-currency, eu-format]
steps:
- uses: actions/checkout@v4
- name: Cache test dataset
uses: actions/cache@v4
with:
path: tests/fixtures/invoices
key: test-dataset-${{ hashFiles('tests/fixtures/manifest.json') }}
- name: Run extraction test suite
env:
EXTRACTION_API_KEY: ${{ secrets.EXTRACTION_API_KEY }}
EXTRACTION_ENDPOINT: ${{ secrets.STAGING_ENDPOINT }}
TEST_CONFIG: ${{ matrix.config }}
run: |
python -m pytest tests/extraction/ \
--config $TEST_CONFIG \
--report-output results/$TEST_CONFIG.json \
--tb=short
- name: Evaluate accuracy gates
run: python scripts/check_accuracy_gates.py results/${{ matrix.config }}.json
- name: Upload accuracy report
if: always()
uses: actions/upload-artifact@v4
with:
name: accuracy-report-${{ matrix.config }}
path: results/
Balancing Thoroughness Against Credit Cost
Extraction tests consume API credits, typically one credit per successfully processed page. Running 500 invoices on every pull request will burn through your budget fast and slow down your feedback loop.
Split your test suite into two tiers:
Smoke tests (every pull request): A focused subset of 10 to 20 invoices covering the most critical field types and your highest-risk edge cases. This runs in under a minute and catches obvious regressions before a reviewer even looks at the code.
Full regression suite (scheduled or release-branch only): The complete 50 to 100 invoice dataset with full edge case coverage. Run this on your weekly schedule and on merges to release branches. This is where you catch subtle accuracy shifts across the long tail of invoice formats.
You can control this with a simple environment flag:
- name: Run smoke tests (PR)
if: github.event_name == 'pull_request'
run: pytest tests/extraction/ -m smoke --report-output results/smoke.json
- name: Run full regression suite (scheduled)
if: github.event_name == 'schedule'
run: pytest tests/extraction/ --report-output results/full.json
Accuracy Gate Design
When a gate fails, the developer needs to know exactly what broke and by how much. A generic "tests failed" message forces them to dig through logs. Instead, your gate-checking script should output a structured summary table:
╔══════════════════╦══════════╦═══════════╦═══════════╦════════╗
║ Field ║ Current ║ Baseline ║ Threshold ║ Status ║
╠══════════════════╬══════════╬═══════════╬═══════════╬════════╣
║ vendor_name ║ 97.8% ║ 98.2% ║ 97.0% ║ PASS ║
║ invoice_number ║ 99.1% ║ 99.1% ║ 98.0% ║ PASS ║
║ total_amount ║ 93.2% ║ 96.5% ║ 95.0% ║ FAIL ║
║ line_items ║ 91.7% ║ 92.0% ║ 90.0% ║ PASS ║
║ tax_amount ║ 88.4% ║ 94.1% ║ 93.0% ║ FAIL ║
╚══════════════════╩══════════╩═══════════╩═══════════╩════════╝
PIPELINE BLOCKED: 2 field(s) below threshold.
- total_amount: 93.2% (threshold 95.0%, dropped 3.3pp from baseline)
- tax_amount: 88.4% (threshold 93.0%, dropped 5.7pp from baseline)
This fail-fast output tells the developer which fields regressed, the magnitude of the drop in percentage points, and exactly where the threshold sits. No guessing, no log diving.
Stakeholder Reporting
Engineers read pytest output. Product managers and operations leads do not. Generate a human-readable accuracy report as a CI artifact, formatted in Markdown or HTML, that non-technical stakeholders can open directly from the build summary.
Here is an example of what that report looks like:
### Extraction Accuracy Report — 2026-03-28
**Overall status:** 4 of 6 fields stable or improving. 2 fields degraded.
| Field | Accuracy | vs. Baseline | Trend |
|----------------|----------|--------------|------------|
| vendor_name | 97.8% | -0.4pp | Stable |
| invoice_number | 99.1% | +0.0pp | Stable |
| total_amount | 93.2% | -3.3pp | Degrading |
| line_items | 91.7% | -0.3pp | Stable |
| tax_amount | 88.4% | -5.7pp | Degrading |
| due_date | 96.0% | +1.2pp | Improving |
**Dataset:** 87 invoices (62 real, 25 synthetic).
Edge cases covered: multi-currency (12), EU-format dates (8), handwritten (5).
Upload this as a build artifact so it is accessible from the GitHub Actions run summary without cloning the repository or reading raw logs.
Scheduled Runs for Drift Detection
PR-triggered tests catch regressions caused by code changes in your repository. They will not catch accuracy drift caused by factors outside your codebase: upstream model updates from your extraction provider, infrastructure changes, or subtle shifts in the documents your customers submit.
Schedule your full regression suite to run weekly (or monthly, depending on your tolerance for detection latency). When a scheduled run fails its accuracy gates, it surfaces degradation that no recent commit caused. This is your early warning system. Pipe these failures into your alerting channel (Slack, PagerDuty, email) so the team investigates before customers notice.
The weekly cron trigger in the workflow above handles this. For higher-stakes pipelines processing financial documents, consider running it more frequently or adding a secondary schedule tied to your extraction provider's release notes.
About the author
David Harding
Founder, Invoice Data Extraction
David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.
Profile
View author pageEditorial process
This page is reviewed as part of Invoice Data Extraction's editorial process.
If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.
Related Articles
Explore adjacent guides and reference articles on this topic.
How to Build an MCP Server for Invoice Extraction
Build an MCP server that exposes invoice extraction as a tool for AI assistants. Covers tool definition, API integration, and structured JSON responses.
Python PDF Table Extraction: pdfplumber vs Camelot vs Tabula
Compare pdfplumber, Camelot, and tabula-py for extracting tables from PDF invoices. Code examples, invoice-specific tests, and a decision framework.
How to Reduce Invoice Extraction API Costs at Scale
Seven engineering techniques that reduce invoice extraction API costs by 30-60% at high volume, with estimated savings and implementation priorities for each.
Extract invoice data to Excel with natural language prompts
Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.