Pydantic Invoice Extraction in Python: Validate JSON Output

Learn how to validate extracted invoice JSON with Pydantic in Python, from schema design and normalization to business-rule handoff.

Published
Updated
Reading Time
16 min
Topics:
API & Developer IntegrationPythonPydanticschema validationJSON outputinvoice data contracts

Pydantic invoice extraction means taking extracted invoice JSON and validating it against typed Python models before the data reaches your ETL jobs, databases, or finance workflows. At that boundary you enforce required fields, optional nested line items, date and decimal normalization, and readable validation errors, so raw extraction output does not silently drift into downstream automation.

That boundary matters because an extraction can finish successfully and still leave you with string amounts, nullable dates, missing vendor identifiers, or ambiguous line items. If you pass that JSON straight into application code, the failure often shows up later, inside matching logic, approval workflows, ledger imports, or batch jobs, where the root cause is harder to trace and more expensive to fix. According to 2025 developer survey data on AI output accuracy trust, more developers actively distrusted the accuracy of AI tools (46%) than trusted it (33%), which is exactly why AI-generated invoice output needs a verification layer before it reaches financial logic.

The rest of the workflow is about making that boundary useful: stabilize the extraction output, model invoice and line-item variation, validate and normalize the payload, then run business rules only after the schema contract holds.

Start With Stable Extraction Output, Not Ad Hoc Parsing

If you want dependable Python invoice workflows, the extraction step has one job: produce JSON whose field names and row shape stay stable enough for validation. Ad hoc parsing breaks when one run calls a field "total," another calls it "invoice total," and a third flattens line items differently. Pydantic should sit behind a consistent JSON contract, not behind constantly shifting extraction output.

With Invoice Data Extraction, the current Python SDK requires Python 3.9+ and installs with pip install invoicedataextraction-sdk. The InvoiceDataExtraction client gives you a one-call extract(...) method that handles upload, submission, polling, and download in one step, returning structured output as JSON, CSV, or XLSX. If you are evaluating a programmatic handoff from extraction into validation, the invoice extraction API and Python SDK is the point where that raw JSON contract begins.

The most important choice here is the prompt shape. A string prompt is valid, but when schema validation matters you should prefer a dict prompt with explicit field names and field-level instructions. That keeps your extraction output predictable because the named fields are emitted exactly as specified, instead of relying on the model to invent headers from a natural-language request. For invoice data, that usually means declaring fields like Invoice Number, Invoice Date, Vendor Name, and Total Amount, then adding instructions where ambiguity matters, such as telling the extractor to use the issue date rather than the due date or to return dates in YYYY-MM-DD format. That is what makes downstream validation possible rather than aspirational.

You also need to choose the raw output shape up front. The output_structure options are automatic, per_invoice, and per_line_item. Think of these as the shape of the JSON you hand to validation, not as the validation layer itself. Automatic lets the AI choose the structure from your prompt and documents. Per_invoice gives you one object per invoice. Per_line_item gives you one object per extracted line item, with invoice context repeated as needed. Pydantic should validate whichever structure you deliberately requested, rather than trying to clean up an unclear extraction mode after the fact.

If your architecture needs more control, the SDK also exposes staged methods such as upload_files(...), submit_extraction(...), wait_for_extraction_to_finish(...), and download_output(...). Either way, do not treat the result as "done" just because you got JSON back. The response includes pages.failed_count, which tells you whether any pages failed processing, and ai_uncertainty_notes, which tells you where the model had to make assumptions. Those two signals matter because stable extraction is not just about field names, it is also about knowing when the raw data is incomplete or ambiguous before it reaches your validation layer.

Design Invoice and Line-Item Models for Missing, Optional, and Nested Data

A good invoice extraction Pydantic model starts by treating the invoice as a data contract, not just a convenient Python container. That contract tells your prompt what to extract, tells your JSON output what shape to follow, and tells downstream code what it can trust. If those three layers drift apart, you end up with brittle parsers, special cases per supplier, and finance logic built on guesses.

For a reusable invoice data model, split fields into three categories:

  • Required: values your pipeline cannot process without, such as invoice number, invoice date, vendor name, currency, and total amount, if your use case truly depends on them.
  • Optional: values that may legitimately be missing on some documents, such as due date, PO number, tax rate, or tax amount.
  • Nullable or ambiguous: values that may appear in the source but still be unreadable, blank, or uncertain after extraction.

That distinction matters because optional versus required fields is not the same as business importance. A PO number can be operationally useful but still optional in the schema because many invoices do not include one. A currency field might need to be required even if some documents omit it, because your downstream systems should fail fast rather than quietly assume a default.

A compact contract might look like this:

from datetime import date
from decimal import Decimal
from pydantic import BaseModel, Field

class InvoiceLineItem(BaseModel):
    invoice_number: str
    line_index: int
    description: str | None = None
    quantity: Decimal | None = None
    unit_price: Decimal | None = None
    line_tax_amount: Decimal | None = None
    line_total: Decimal | None = None

class InvoiceDocument(BaseModel):
    invoice_number: str
    invoice_date: date
    vendor_name: str
    currency: str
    total_amount: Decimal

    due_date: date | None = None
    po_number: str | None = None
    tax_amount: Decimal | None = None
    net_amount: Decimal | None = None

    document_type: str | None = None
    line_items: list[InvoiceLineItem] = Field(default_factory=list)

This is the conceptual contract. The production validator in the next section hardens the same idea with stricter normalization and error handling. Nested line items matter because invoice-level fields and row-level fields have different lifecycles. Your approval workflow may care about the header total, while spend analysis, SKU reconciliation, or ETL loads care about each row. A dedicated Pydantic line item model lets you validate quantity, unit price, and line totals separately without losing the parent invoice context.

You also need to distinguish absent from present but empty. If the supplier never includes a due date, that field is absent. If the document has a due date label but the value is blank, smudged, or contradictory, the field is present but unresolved. Those cases should not collapse into the same outcome. Your contract can represent both with nullable fields plus separate extraction notes, confidence flags, or status metadata, so downstream code knows whether a value was not provided or could not be trusted.

In practice, it is often worth repeating a small set of invoice-level identifiers on each line item, especially invoice_number, and sometimes invoice_date, vendor_name, or currency. That duplication is not bad modeling if it supports row-level exports, warehouse joins, or data frame operations where line items are processed independently from the parent object.

This is also where broader invoice schema design comes in. You do not need a long standards debate in your application code, but you do need a stable field structure that can be shared across prompts, JSON output, validators, and storage. If you want a deeper view of comparing invoice JSON schema options and field structures, start there, then keep your Pydantic models as the executable version of that contract. Pydantic can also emit JSON Schema, which makes the same contract easier to document, test, and reuse across services.

Validate and Normalize Raw Invoice JSON Before It Touches Your App

Once you have extraction output, your next job is not mapping fields into application code. It is establishing a contract. In practice, that means every payload crosses a single Pydantic boundary where raw JSON becomes typed invoice objects, or it fails fast with readable errors. That is the core pattern behind invoice JSON validation in Python: your app never handles loose dictionaries full of date strings, currency fragments, or ambiguous totals.

If your extractor returns a Python dict, pass it to Invoice.model_validate(payload). If it returns a JSON string, use Invoice.model_validate_json(raw_json). Either way, the rest of your system should receive an Invoice object with date values, Decimal amounts, normalized currency codes, and consistent line-item data, not stringly typed finance data.

The handoff should be compact and explicit:

result = client.extract(
    folder_path="./invoices",
    prompt=prompt,
    output_structure="per_invoice",
    download={"formats": ["json"], "output_path": "./output"},
)

if result["pages"]["failed_count"] > 0:
    raise RuntimeError(result["pages"]["failed"])

if result["ai_uncertainty_notes"]:
    queue_uncertain_extraction(result["ai_uncertainty_notes"])

records = load_downloaded_json("./output")
invoices = [Invoice.model_validate(record) for record in records]

That sequence matters more than the helper names: run extract(...), inspect the SDK response for failed pages and uncertainty, then validate the invoice payloads that came back from the JSON output.

from datetime import date, datetime
from decimal import Decimal, InvalidOperation, ROUND_HALF_UP
from typing import Literal

from pydantic import BaseModel, ConfigDict, ValidationError, field_validator, model_validator

TWOPLACES = Decimal("0.01")


def parse_decimal(value: object) -> Decimal | None:
    if value in (None, ""):
        return None

    text = str(value).replace(",", "").strip()

    try:
        return Decimal(text).quantize(TWOPLACES, rounding=ROUND_HALF_UP)
    except InvalidOperation as exc:
        raise ValueError(f"invalid decimal value: {value}") from exc


class LineItem(BaseModel):
    model_config = ConfigDict(extra="forbid")

    description: str
    quantity: Decimal | None = None
    unit_price: Decimal | None = None
    line_total: Decimal

    _normalize_numbers = field_validator(
        "quantity", "unit_price", "line_total", mode="before"
    )(parse_decimal)

    @field_validator("description", mode="before")
    @classmethod
    def clean_description(cls, value: object) -> str:
        text = str(value).strip()
        if not text:
            raise ValueError("description is required")
        return text


class Invoice(BaseModel):
    model_config = ConfigDict(extra="forbid")

    invoice_number: str
    document_type: Literal["invoice", "credit_note"] = "invoice"
    invoice_date: date
    currency_code: str
    net_amount: Decimal
    tax_amount: Decimal | None = None
    total_amount: Decimal
    line_items: list[LineItem]

    @field_validator("invoice_number", mode="before")
    @classmethod
    def clean_invoice_number(cls, value: object) -> str:
        text = str(value).strip()
        if not text:
            raise ValueError("invoice_number is required")
        return text

    @field_validator("invoice_date", mode="before")
    @classmethod
    def normalize_invoice_date(cls, value: object) -> date:
        if isinstance(value, date):
            return value

        text = str(value).strip()
        for fmt in ("%Y-%m-%d", "%d/%m/%Y", "%m/%d/%Y"):
            try:
                return datetime.strptime(text, fmt).date()
            except ValueError:
                continue

        raise ValueError("invoice_date must be a valid date")

    @field_validator("currency_code", mode="before")
    @classmethod
    def normalize_currency_code(cls, value: object) -> str:
        code = str(value).strip().upper()
        if len(code) != 3 or not code.isalpha():
            raise ValueError("currency_code must be a 3-letter ISO code")
        return code

    _normalize_amounts = field_validator(
        "net_amount", "tax_amount", "total_amount", mode="before"
    )(parse_decimal)

    @model_validator(mode="after")
    def normalize_credit_note_amounts(self) -> "Invoice":
        if self.document_type == "credit_note":
            self.net_amount = -abs(self.net_amount)
            self.total_amount = -abs(self.total_amount)

            if self.tax_amount is not None:
                self.tax_amount = -abs(self.tax_amount)

            for item in self.line_items:
                item.line_total = -abs(item.line_total)
                if item.unit_price is not None:
                    item.unit_price = -abs(item.unit_price)

        return self

The important part is not the exact model shape. It is where normalization happens. Dates should become date objects at parse time. Monetary values should become Decimal, not float, with consistent precision. Currency codes should be uppercased and validated. Credit-note amounts should have one predictable sign convention. Line totals should be normalized before anything downstream tries to compare, aggregate, or post them. When you validate extracted invoice JSON this way, your ETL jobs, payment logic, and finance exports can trust the contract.

Be careful with type coercion. It should convert acceptable variations, such as "1,250.00" to Decimal("1250.00") or " usd " to "USD". It should not paper over extraction mistakes. If the model quietly turns "O" into 0, accepts an impossible date, or swallows a malformed total as None, you have not solved reliability, you have hidden the defect. Good schema validation is selective about what it fixes and strict about what it rejects.

When validation fails, make the error payload useful enough for operations and prompt tuning. Pydantic already gives you structured validation errors with field locations and messages, so capture them in a format your pipeline can act on.

def parse_invoice(payload: dict) -> Invoice:
    try:
        return Invoice.model_validate(payload)
    except ValidationError as exc:
        normalized_errors = [
            {
                "field": ".".join(str(part) for part in err["loc"]),
                "message": err["msg"],
                "input": err.get("input"),
            }
            for err in exc.errors()
        ]

        raise ValueError(
            {
                "status": "invoice_validation_failed",
                "errors": normalized_errors,
            }
        ) from exc

At this stage you are still handling schema failures, not approval logic or posting rules. If you want dependable invoice pipelines, this boundary should be boringly strict: raw JSON comes in, typed invoice objects come out, and everything else stops there.

Separate Schema Validation From Business Rules and Failure Handling

Once the payload is structurally valid, a different set of checks begins. Your invoice extraction Pydantic model answers one question first: does this JSON conform to the contract your code expects? It does not tell you whether the invoice should be posted, approved, paid, or trusted for downstream finance actions.

That second layer is business-rule validation. It sits after schema validation and asks workflow questions your model should not try to own: Is the invoice number a duplicate? Does the header total equal the sum of line items plus tax? Does the vendor match your master data? Is the VAT treatment plausible for that supplier and jurisdiction? Does this document violate a cross-document policy, such as duplicate amounts across the same period? A record can pass every Pydantic field check and still fail these finance checks. When that happens, route it as a business exception, not as malformed data.

In production, also verify that the extraction itself finished cleanly before your app treats the batch as validated. In the current Python SDK, the result includes pages.failed_count and pages.failed. If pages.failed_count is greater than zero, some pages were not processed and are not included in the output, even if the returned JSON for the successful pages fits your invoice extraction Pydantic model. That means you have two different failure classes to handle:

  • Schema or validation errors: the returned record shape is wrong for your application contract.
  • Extraction coverage failures: the batch is incomplete because specific files or pages failed and appear in pages.failed.

The same idea applies to ai_uncertainty_notes. Treat those notes as workflow signals, not as ignorable metadata. If the extraction succeeded but the result includes uncertainty notes about ambiguous totals, unclear vendor names, or prompt assumptions, you should either refine the prompt for the next run or send the affected records to human review. That gives you a cleaner operational boundary: Pydantic catches contract problems, while uncertainty notes highlight places where the contract may be satisfied but the extracted meaning still deserves scrutiny.

You should also make an explicit choice about output_structure before business logic runs. Choose per_invoice when the next system cares about one approved record per invoice, such as AP posting, invoice approvals, or payment scheduling. Choose per_line_item when downstream automation depends on row-level detail, such as spend analysis, GL coding, purchase-order matching, or category-level controls. If you let the extractor choose automatically, inspect the returned output_structure and branch accordingly, because invoice-level rules and line-level rules are rarely interchangeable.

A safe production flow is simple: validate the JSON contract with Pydantic, inspect pages.failed_count, pages.failed, ai_uncertainty_notes, and output_structure, then run business-rule validation only on the records that are structurally valid and fully accounted for. That separation keeps malformed payloads, partial extraction failures, and finance-policy exceptions in different queues, which is exactly what you want when validation errors should not corrupt the rest of your automation.

Test the Contract So Your Extraction Pipeline Stays Safe Over Time

Once your extraction flow works on a few invoices, the next risk is not accuracy in isolation. It is silent breakage. A prompt tweak, a new supplier layout, or a small schema change can turn valid-looking invoice JSON into bad downstream data for ETL pipelines, ledger imports, or approval workflows. That is why Pydantic should stay at the center of your Python invoice schema validation strategy long after the first prototype ships.

Keep a fixture set that reflects the invoices you actually process: clean PDFs, scanned invoices, multi-page files, credit notes, supplier variants, and documents that previously caused extraction issues. For each fixture, store the raw extraction response and the expected validated model output. That gives you a stable contract for invoice data in Python, and it lets you catch regressions when fields move, optional values disappear, or nested line items start arriving in a different shape.

Your tests should cover both success and failure paths. Happy-path cases prove that totals, dates, vendor details, and line items normalize correctly. Failure-path cases prove that your boundary behaves safely when the input is wrong or incomplete, including:

  • Missing required fields such as invoice number or invoice date
  • Malformed decimals like currency strings, commas in the wrong place, or non-numeric totals
  • Unexpected date formats that do not match your accepted normalization rules
  • Failed pages or partial document results that should not be treated as complete invoices
  • Uncertainty notes or ambiguous matches that should be surfaced for review instead of silently accepted

In practice, that means asserting more than "the model parsed." You want to verify whether the record was accepted, rejected, or routed for manual review, and why. This is also where testing validated invoice extraction pipelines becomes useful as a separate discipline, especially once multiple suppliers and document types feed the same service.

Monitoring should mirror the same contract. Track validation failure rate over time, recurring uncertainty patterns, supplier-specific drift, and which fields fail most often. If one vendor suddenly starts producing date parsing errors or empty tax values, that is usually a signal to update your extraction prompt, revise a field definition, or expand the schema to reflect a real document variation. If you later expose the workflow through an API or scheduled service, the same contract should sit behind it, much like when building a FastAPI invoice extraction endpoint with Python, rather than letting raw model output leak into application logic.

If you are productizing this workflow, the safest order of operations is simple:

  1. Stabilize the extraction fields and output shape first.
  2. Validate every payload at the boundary with Pydantic.
  3. Add business-rule checks after schema validation, not inside it.
  4. Introduce the job runner, ETL step, or service interface only after the contract is stable and well tested.

Your next step should be operational, not theoretical: build a fixture suite from real invoices, define pass or fail expectations for each case, start tracking validation outcomes per supplier, and treat every prompt or schema edit as a contract change that must earn its way into production.

About the author

DH

David Harding

Founder, Invoice Data Extraction

David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.

Editorial process

This page is reviewed as part of Invoice Data Extraction's editorial process.

If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.

Continue Reading

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours