Vision LLM Invoice Extraction with Python: Practical Guide

Vision LLM invoice extraction with Python means sending invoice images or PDFs directly to a multimodal model, such as GPT-4o, Claude, or Gemini, and asking it to return a constrained JSON object that your application can validate before using. In practice, the useful version of this workflow is not just "model in, JSON out." You pair the model response with Pydantic schemas, totals checks, line-item sanity rules, and fallback handling for low-confidence or malformed outputs. That is the core pattern behind modern Python invoice OCR with LLMs, and it is why this article focuses on a multi-provider implementation playbook rather than a single demo.

The main difference between vision LLMs and traditional OCR for invoices is where the structure comes from. A classic OCR pipeline first turns the document into text, then tries to reconstruct fields, tables, and relationships from that text with rules, templates, or post-processing. A vision-first pipeline lets the model read the page as a document, not just as a character stream, so it can reason about layout, labels, table boundaries, and visual context in one pass. That matters on scanned invoices, rotated pages, vendor-specific layouts, and dense line-item tables. A 2025 invoice-processing benchmark found that native image processing reached 92.71% accuracy on scanned invoices, versus 64.03% for parsed-text pipelines. That gap is a practical reason to start with image-native extraction when your input quality and document variety are unpredictable.

For many teams, vision LLM invoice extraction in Python is the better starting point when vendor formats vary, invoices include complex tables, scans are noisy, or regex and template-specific rules keep breaking as new suppliers arrive. It is especially attractive if you already have Python services and want to orchestrate uploads, schema validation, retries, and downstream accounting workflows in code. Traditional OCR still makes sense when you mostly process clean, text-based PDFs, need fully local processing, or run a highly deterministic vendor-specific pipeline where fixed parsing rules are stable and cheap to maintain. If you want that more classic baseline first, see traditional Python invoice extraction approaches.

Send Invoice Images and PDFs to GPT-4o, Claude, and Gemini

Across OpenAI, Anthropic, and Google, the multimodal invoice extraction pattern in Python is mostly the same:

Load the invoice file.
Decide whether to send it as an image or as a native PDF or document input.
Give the model a tight extraction instruction for invoice fields such as invoice number, invoice date, vendor name, tax, total, and line items.
Capture the model output as JSON so your pipeline can validate it before anything reaches accounting logic.

A practical prompt usually looks like this in every provider:

Extract structured invoice data from this document. Return invoice_number, invoice_date, vendor_name, currency, subtotal, tax_amount, total_amount, and line_items. For each line item return description, quantity, unit_price, and line_total. If a field is missing, return null instead of guessing.

That shared pattern matters because it lets you swap providers without redesigning your whole pipeline. The transport changes. The extraction goal does not. If you want a JavaScript reference point too, compare this with the Node.js version of this vision-LLM workflow.

OpenAI: image-first flow with structured outputs

For GPT-4o invoice extraction in Python, OpenAI is usually the cleanest option when your invoice intake is already image-heavy or you want to keep visual input and schema-constrained output in one API family. In Python integrations, image input can be passed to the Responses API as an image item using a URL, a Base64 data URL, or a file ID. Structured outputs can then be paired with a Pydantic schema so the response lands closer to your target JSON shape.

from openai import OpenAI

client = OpenAI()

prompt = """
Extract invoice_number, invoice_date, vendor_name, tax_amount, total_amount,
and line_items from this invoice. Return valid JSON only.
"""

response = client.responses.create(
    model="gpt-4o",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": prompt},
                {
                    "type": "input_image",
                    "image_url": "data:image/png;base64,BASE64_INVOICE_IMAGE"
                }
            ],
        }
    ],
)

print(response.output_text)

OpenAI is a good fit when your invoice intake is already image-heavy, such as JPG scans from email or phone captures. If you reuse the same file across retries or multiple extraction passes, file IDs are usually cleaner than resending the same Base64 payload on every request.

Claude: images and PDFs through Messages API content blocks

For Claude invoice extraction in Python, the main appeal is native document handling when suppliers send full PDFs rather than cropped images. The Messages API accepts both images and PDFs as explicit content blocks. Images can be supplied as base64, URL, or file reference. PDFs can be sent as document blocks by URL, base64, or file_id, which is useful when vendors send long, multi-page invoices and you want the model to see the document as a document, not as a pile of page screenshots. The exact model snapshot changes over time, so keep the request shape stable and swap in the current Claude Sonnet model ID from Anthropic's docs.

from anthropic import Anthropic

client = Anthropic()

message = client.messages.create(
    model="claude-sonnet-model-id",
    max_tokens=2000,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": "BASE64_PDF"
                    }
                },
                {
                    "type": "text",
                    "text": (
                        "Extract invoice_number, invoice_date, vendor_name, "
                        "tax_amount, total_amount, and line_items. "
                        "Return JSON only."
                    )
                }
            ]
        }
    ]
)

print(message.content)

That native document path reduces preprocessing when your pipeline receives PDFs directly from ERP exports, vendor portals, or inbox attachments. It also avoids writing your own page-rasterization layer for every multi-page invoice before you even test extraction quality.

Gemini: inline image data or uploaded files, plus native PDF handling

For Gemini invoice extraction in Python, the practical draw is flexible file handling when you want to mix inline image input with uploaded files and native PDF support. Gemini can take inline image data for image-based workflows or uploaded files when you want cleaner reuse and lower request bloat. It also supports PDFs natively, so you do not always need to convert a document into separate images before extraction. As with Claude, use the current Gemini model ID that fits your latency and accuracy target.

from google import genai

client = genai.Client()

prompt = """
Extract invoice_number, invoice_date, vendor_name, tax_amount, total_amount,
and line_items from this invoice. Return JSON only.
"""

response = client.models.generate_content(
    model="gemini-model-id",
    contents=[
        prompt,
        {
            "mime_type": "application/pdf",
            "data": b"PDF_BYTES_HERE"
        }
    ],
)

print(response.text)

For teams comparing Gemini with OpenAI and Claude in Python, the practical question is less about headline capability and more about file handling. If your workload is mostly PDFs, native document support lowers the amount of glue code you have to maintain. For a broader model-by-model trade-off breakdown, see the dedicated comparison.

Here is the compact comparison that usually matters more than vendor marketing:

Provider	Best input path	Native PDF handling	Schema-control strength	Best when	Watch out for
OpenAI	Image URL, Base64 data URL, or file ID	Works well if you already rasterize pages or manage file storage yourself	Strong when you want schema-constrained output close to the API boundary	Image-heavy intake, quick prototypes, unified OpenAI stack	Repeated image retries can bloat requests if you keep resending Base64
Claude	Image blocks or document blocks	Strong fit for native PDF workflows	Good, but you still want local validation for finance fields	Suppliers mostly send full PDFs and you want document-level input	You need careful prompt discipline when multiple totals or dates appear
Gemini	Inline image data or uploaded files	Strong native PDF path	Good when you want uploaded-file reuse plus structured-output guidance	Mixed image and PDF workloads, cleaner file reuse	Model and file choices can affect latency more than the headline capability suggests

What actually changes your Python pipeline

Treat OpenAI, Claude, and Gemini as three ways to run the same loop: submit the invoice, request invoice-specific JSON, hand the response to validation. The provider choice mainly affects document plumbing — Base64 vs file IDs, image-only vs native PDF input, and how much retry-and-reuse logic you have to write yourself.

Force Structured Invoice JSON with Pydantic Before You Trust the Model

Raw model output is where many invoice demos break down. You might get valid JSON once, then see a currency symbol in one run, a localized date in the next, and invented line-item values when a scan is unclear. If you want reliable structured JSON across providers, define the contract first and make every provider target the same schema.

Put your invoice shape in Pydantic, then use it as both a prompt constraint and a local validation layer — your downstream code only has to trust one typed object instead of three provider-specific response formats.

from datetime import date
from decimal import Decimal

from pydantic import BaseModel, Field, field_validator, model_validator


class LineItem(BaseModel):
    description: str = Field(
        description="Product or service description exactly as shown on the invoice."
    )
    quantity: str | None = Field(
        default=None,
        description="Digits only when present, for example 2 or 15.5. No units, no text."
    )
    unit_price: Decimal | None = Field(
        default=None,
        description="Numeric amount only. No currency symbols or thousands separators."
    )
    line_total: Decimal | None = Field(
        default=None,
        description="Numeric amount only. No currency symbols. Use the invoice currency."
    )
    tax_amount: Decimal | None = Field(
        default=None,
        description="Line-level tax amount if explicitly shown, otherwise null."
    )
    sku: str | None = Field(
        default=None,
        description="Product code or SKU if explicitly visible, otherwise null."
    )


class InvoiceExtraction(BaseModel):
    invoice_number: str = Field(
        description="Invoice identifier exactly as printed on the document."
    )
    vendor_name: str = Field(
        description="Supplier legal or trading name shown on the invoice."
    )
    invoice_date: str | None = Field(
        default=None,
        description="Invoice date in YYYY-MM-DD format. Null if missing or ambiguous."
    )
    due_date: str | None = Field(
        default=None,
        description="Payment due date in YYYY-MM-DD format. Null if not shown."
    )
    currency: str | None = Field(
        default=None,
        description="Three-letter ISO currency code such as USD, EUR, or GBP."
    )
    subtotal: Decimal | None = Field(
        default=None,
        description="Pre-tax invoice subtotal as numeric value only."
    )
    tax_amount: Decimal | None = Field(
        default=None,
        description="Total invoice tax as numeric value only."
    )
    total_amount: Decimal = Field(
        description="Final invoice total as numeric value only, no currency symbol."
    )
    line_items: list[LineItem] = Field(
        default_factory=list,
        description="List each invoice line item separately when visible."
    )
    review_required: bool = Field(
        default=False,
        description="True when any important field is missing, ambiguous, or inconsistent."
    )
    review_notes: list[str] = Field(
        default_factory=list,
        description="Short notes explaining ambiguity, missing values, or failed checks."
    )

    @field_validator("invoice_date", "due_date")
    @classmethod
    def must_use_iso_date(cls, value: str | None) -> str | None:
        if value is None:
            return value
        date.fromisoformat(value)
        return value

    @field_validator("quantity")
    @classmethod
    def quantity_should_be_numeric_text(cls, value: str | None) -> str | None:
        if value is None:
            return value
        allowed = set("0123456789.")
        if any(ch not in allowed for ch in value):
            raise ValueError("quantity must contain digits only")
        return value

    @model_validator(mode="after")
    def totals_should_make_finance_sense(self):
        if (
            self.subtotal is not None
            and self.tax_amount is not None
            and self.total_amount is not None
            and self.subtotal + self.tax_amount != self.total_amount
        ):
            self.review_required = True
            self.review_notes.append("subtotal + tax_amount does not equal total_amount")
        return self

The field descriptions matter more than most teams realize. They are not filler. They tell the model exactly what finance-grade formatting looks like: YYYY-MM-DD dates, digits-only quantities where relevant, numeric fields without currency symbols, separate subtotal, tax, and total fields, plus explicit currency handling. Those instructions reduce drift before validation even runs. If you want a deeper walkthrough of this pattern, Pydantic schema validation for invoice JSON covers the schema-design side in more detail.

Provider support changes how strict you can be at generation time, but it does not remove the need for local checks. OpenAI and Gemini can both work well with schema-constrained outputs, which makes invoice schema validation Python pipelines much cleaner because the model is asked to emit the exact shape you expect. Claude is still useful here, but the safer pattern is to instruct it to return JSON matching your schema, then parse and validate locally with Pydantic.

The missing bridge in many examples is the adapter layer that turns any provider response into one validated Python object:

from typing import Any


def normalize_invoice(payload: str | dict[str, Any]) -> InvoiceExtraction:
    if isinstance(payload, str):
        return InvoiceExtraction.model_validate_json(payload)
    return InvoiceExtraction.model_validate(payload)

That small boundary is what makes multi-provider swapping practical. Each API client can return its own raw payload type, but the rest of your application only works with the validated InvoiceExtraction object. It is also the right place to raise validation errors, attach review notes, or route the invoice into a human-review queue before it touches finance systems.

You should also decide up front how to handle uncertainty. For invoice extraction, the safest default is never invent missing values. Use nullable fields when the document does not clearly contain the data. Use defaults only when they reflect an explicit business rule, not a model guess. Add a review flag and short notes when a field is ambiguous, when totals do not reconcile, or when line items appear incomplete. That gives your pipeline a controlled failure mode instead of silently poisoning accounting data.

Once every provider is forced through the same Pydantic contract, swapping models becomes much less painful. Your OpenAI, Gemini, and Claude integration layers can differ, but the object passed into the rest of your Python system stays the same. That is what turns a one-off prototype into a maintainable multi-provider extraction workflow. If you would rather let an agent framework handle the provider plumbing and retry-on-validation-error loop for you, a typed Pydantic AI agent for invoice extraction wraps the same schema-first idea with dependency injection and self-healing validation across providers.

Model Line Items, Multi-Page PDFs, and Totals So the Output Stays Usable

Your first modeling decision is whether to return one invoice object with nested line_items or one row per line item. Nested JSON is readable for short invoices:

{
  "invoice_id": "INV-10452",
  "vendor_name": "Northwind Components",
  "invoice_date": "2026-03-12",
  "currency": "USD",
  "subtotal": 1250.00,
  "tax": 125.00,
  "total": 1375.00,
  "line_items": [
    {
      "line_number": 1,
      "description": "Industrial sensor kit",
      "quantity": 5,
      "unit_price": 200.00,
      "line_total": 1000.00
    },
    {
      "line_number": 2,
      "description": "Calibration service",
      "quantity": 1,
      "unit_price": 250.00,
      "line_total": 250.00
    }
  ]
}

Dense invoices often work better as one object per line item with invoice-level fields repeated on every row. In practice, line-item-oriented output is easier to validate, filter, export, and regroup later, as long as you keep a stable invoice identifier such as invoice_id, source_file, and document_group_id. Your downstream code can then rebuild invoice-level views without guessing which rows belong together.

A practical rule is simple:

Use nested arrays when you mainly need one JSON document per invoice.
Use flat line-item rows when the output feeds databases, spreadsheets, warehouse tables, or reconciliation jobs.
Preserve a stable invoice ID in both models so you can move between them without lossy transforms.

For multi-page PDF invoice extraction in Python, the biggest mistake is treating each page as its own invoice. Real invoices often place the supplier header on page 1, the long item table across middle pages, and totals on the last page. Your pipeline should preserve page order, maintain one document-level identity across all pages, and merge repeated fields cautiously rather than overwriting them blindly.

A reliable approach for multi-page PDF invoices looks like this:

Keep the pages in original order and pass page metadata through the extraction result.
Assign one stable invoice_id or document_group_id to the full PDF before post-processing.
Merge repeated header fields such as vendor name or invoice number only when the values agree or one value is clearly more complete.
Treat footer totals, tax summaries, and payment terms as document-level fields, not page-level facts.
Allow line items to continue across page breaks without assuming a new table starts cleanly on each page.

That matters because tables drift at page boundaries. A long description may wrap onto the next visual line. Quantity and unit price columns may remain aligned while the description cell expands vertically. Tax columns may appear only on some pages. Some invoices repeat table headers on every page, while others continue rows with no separator at all. If your schema assumes every extracted row is already complete, you will end up with split descriptions, duplicated items, and totals that never reconcile.

The safest pattern is to model line items with enough fields to repair those cases after extraction: line_number when available, description, quantity, unit_price, tax_rate, tax_amount, line_total, page_number, and a continuation flag or confidence note when a row may have been wrapped or merged. That gives your post-processor somewhere to store uncertainty instead of silently flattening bad output into "valid" JSON.

After extraction, finance checks matter as much as the model response itself. At minimum, you should verify that:

Subtotal + tax = total, within a small tolerance for rounding.
Sum of line totals matches the stated subtotal or total, depending on how the document presents tax.
Credit notes are normalized consistently, either as negative amounts or as a separate document_type with enforced sign rules.
Currency stays consistent across all pages and all extracted amounts.

Here is the kind of reconciliation pass that catches many production issues before the data hits your accounting workflow:

from decimal import Decimal

def nearly_equal(a: Decimal, b: Decimal, tolerance: Decimal = Decimal("0.02")) -> bool:
    return abs(a - b) <= tolerance

def validate_invoice_totals(invoice: dict) -> list[str]:
    errors: list[str] = []

    subtotal = Decimal(str(invoice.get("subtotal", 0)))
    tax = Decimal(str(invoice.get("tax", 0)))
    total = Decimal(str(invoice.get("total", 0)))
    currency = invoice.get("currency")

    computed_lines = sum(
        Decimal(str(item.get("line_total", 0)))
        for item in invoice.get("line_items", [])
    )

    if not nearly_equal(subtotal + tax, total):
        errors.append("Subtotal plus tax does not reconcile to total")

    if invoice.get("line_items") and not nearly_equal(computed_lines, subtotal):
        errors.append("Line items do not reconcile to subtotal")

    page_currencies = {
        page.get("currency")
        for page in invoice.get("pages", [])
        if page.get("currency")
    }
    if currency:
        page_currencies.add(currency)
    if len(page_currencies) > 1:
        errors.append("Currency is inconsistent across pages")

    if invoice.get("document_type") == "credit_note" and total > 0:
        errors.append("Credit note total should be negative or explicitly normalized")

    return errors

This is also where you decide how strict your system should be. Some teams reject documents that fail totals reconciliation. Others accept them with warnings and route them for review. The important point is that usable output is not just extracted output. If your multi-page PDF invoice extraction Python flow cannot explain why the numbers do or do not tie out, you do not yet have production-ready data.

When invoices get messy, focus less on the model's raw response and more on whether your schema preserves enough structure to regroup pages, repair broken rows, and reconcile totals reliably. That is what keeps extracted data usable once the happy-path examples disappear.

Plan for Scans, Multilingual Layouts, and Other Failure Modes

Once the schema is stable, the next problem is failure routing. The awkward invoices are the ones that look normal to a person but are messy for a vision pipeline: low-resolution scans, phone photos with shadows, skewed pages, compressed PDFs, multilingual layouts, dense line-item grids, and supplier templates with several dates or totals on the same page. These cases often produce clean-looking JSON with financially wrong values, such as the due date captured as the invoice date or VAT dropped to zero because the label was unfamiliar.

Mixed PDFs are another common trap. A single file may contain an email cover sheet, a remittance page, and then the invoice you actually care about. If your pipeline treats every page as equally relevant, you can end up extracting reference numbers or summary totals that never belonged in the accounting record. That is why post-extraction validation rules for invoice APIs matter: they turn validation into a gate before anything reaches ERP imports, payment runs, tax reports, or three-way matching.

A production-grade validation layer should check rules such as:

Invoice date must parse and fall within a reasonable range for the supplier relationship.
Due date cannot be earlier than invoice date unless the document is clearly a credit note or adjustment.
Total should approximately equal subtotal plus tax, within a small tolerance for rounding.
Currency must be present when amounts are present.
Line-item sums should reconcile to the invoice total when the document is itemized.
Tax should not silently default to zero when tax labels or percentages appear elsewhere on the page.
Page-level metadata should confirm that the extracted page is actually an invoice, not an email thread, statement, or cover sheet.

When validation fails, do not just discard the result. Use a failure-specific fallback path. If the issue is skew or poor image quality, retry with a prompt that tells the model to focus on the corrected page orientation and ignore marginal handwriting or background artifacts. If the issue is multilingual labeling, retry with explicit field synonyms, such as asking the model to look for supplier tax, invoice number, and total under local-language headings as well as English equivalents. If multiple totals are present, run a second pass that asks the model to explain which value is the final payable amount and why. If line items look collapsed or incomplete, switch to a table-focused extraction prompt and compare item count, quantity totals, or extended prices against the first pass.

Low-confidence cases should be routed for review, not forced through the same automation path. In practice, that means flagging invoices where totals do not reconcile, required fields are missing, line-item counts vary across retries, or the model returns hedged explanations about ambiguous labels. Human review should focus on the specific fields that failed validation rather than rechecking the whole document from scratch.

You also want a regression set: a small but growing library of troublesome invoices that repeatedly expose weaknesses in your prompts and validators. Include bad scans, mobile photos, bilingual suppliers, invoices with multiple tax lines, documents with attached email pages, and long multi-page item tables. Run that set whenever you change prompts, schemas, or providers. Without it, you can improve the happy path while quietly making your worst real-world cases less reliable.

Decide When Raw Model Calls Are Enough and When a Managed SDK Wins

Raw provider calls are often the right starting point when your Python pipeline is still narrow in scope. If you are extracting a few invoices at a time, you already have file storage handled, and your team is comfortable owning prompts, schema validation, retries, and post-processing, calling GPT-4o, Claude, or Gemini directly can be perfectly reasonable. You get maximum control over prompts, response shaping, and provider choice. That matters when invoice extraction is part of your product differentiation rather than an internal workflow.

The trade-off is that invoice extraction stops being "just another API call" surprisingly quickly. Once you move beyond a demo, you usually end up building the same set of operational pieces over and over: upload orchestration for PDFs and images, async polling, retry logic, failed-page detection, output packaging, batch-level job tracking, shared prompt reuse, and rules for how invoice-level versus line-item outputs should flow into downstream systems. None of that is impossible, but it is real engineering work, and it keeps growing after the first successful extraction.

A useful way to compare the two paths is to focus on operational ownership, not model brand:

Dimension	Direct provider calls	Managed extraction SDK
File handling and reuse	You manage file IDs, storage, page rasterization, Base64 reuse, and second-pass extraction yourself	Upload and file reuse are modeled into the workflow, so the same invoice can move through multiple steps without resending payloads
Latency and polling	You own queues, backoff, timeouts, and partial failure handling — every retry can resend expensive visual input unless you manage reuse carefully	Polling, job-state management, and retry-friendly task state are already modeled for extraction tasks
Batch handling	You build your own job tracking and batch recovery patterns	Designed for batch processing including large sessions
Validation	You still need Pydantic or equivalent checks for fields, types, and totals	You still validate output in your app, but less plumbing sits around the extraction step
Export delivery	You usually transform JSON into CSV, XLSX, or downstream database records yourself	Output is available as XLSX, CSV, or JSON
Staff-time tipping point	Usually still fine for narrow internal tools or low-volume flows	Often cheaper once batching, exports, and failed-page handling dominate engineering time

That is the real threshold: raw model calls feel lightweight until your team is repeatedly rebuilding extraction infrastructure instead of improving extraction quality. A few warning signs usually show up together:

You are writing upload and polling code in multiple services.
You need consistent prompts reused across teams or customers.
You are handling partial failures at the page level instead of only whole-request failures.
You need export files, not just model JSON, because finance users want spreadsheets.
You are processing batches large enough that orchestration code now matters as much as prompt quality.

This is where an invoice extraction API Python SDK becomes less about convenience and more about avoiding undifferentiated plumbing. Invoice Data Extraction's Python SDK for invoice extraction installs with pip install invoicedataextraction-sdk, supports both a one-call extract(...) flow and staged upload_files(...), submit_extraction(...), and wait_for_extraction_to_finish(...) steps, returns XLSX, CSV, or JSON, supports per_invoice or per_line_item outputs, and handles sessions of up to 6,000 files. JSON output still needs application-side parsing where typed values matter, and operational signals such as pages.failed_count and AI uncertainty notes make it easier to route low-confidence results for review.

That is why the cost question is not just model tokens or per-page charges. It is model usage plus the staff time required to maintain uploads, retries, polling, downloads, batch recovery, and spreadsheet export. If extraction is a supporting workflow rather than your product, a managed invoice extraction API for production workflows can become cheaper in engineering time before it becomes cheaper on a raw unit-cost spreadsheet. Use invoice API accuracy, speed, and cost benchmarks when you want to benchmark the direct-model path against a purpose-built extraction service.

A practical decision framework looks like this:

Stay with raw provider calls when volume is modest, latency requirements are simple, and custom extraction behavior is part of your core product value.
Add your own validation layer if you stay DIY. Pydantic checks, total reconciliation, and exception routing still matter even when model outputs look structured.
Move to a managed SDK when your team is spending more time on upload orchestration, polling, retries, failed-page handling, exports, and batch operations than on the workflow your users actually care about.
Choose staged SDK methods instead of one-call extraction when you need tighter control over job lifecycle, queue integration, or downstream processing.
Treat managed extraction as a focus decision, not a capability surrender. If extraction is not the product you are selling, offloading the plumbing is often the better engineering choice.

Send Invoice Images and PDFs to GPT-4o, Claude, and Gemini

Across OpenAI, Anthropic, and Google, the multimodal invoice extraction pattern in Python is mostly the same:

Load the invoice file.
Decide whether to send it as an image or as a native PDF or document input.
Give the model a tight extraction instruction for invoice fields such as invoice number, invoice date, vendor name, tax, total, and line items.
Capture the model output as JSON so your pipeline can validate it before anything reaches accounting logic.

A practical prompt usually looks like this in every provider:

Extract structured invoice data from this document. Return invoice_number, invoice_date, vendor_name, currency, subtotal, tax_amount, total_amount, and line_items. For each line item return description, quantity, unit_price, and line_total. If a field is missing, return null instead of guessing.

OpenAI: image-first flow with structured outputs

from openai import OpenAI

client = OpenAI()

prompt = """
Extract invoice_number, invoice_date, vendor_name, tax_amount, total_amount,
and line_items from this invoice. Return valid JSON only.
"""

response = client.responses.create(
    model="gpt-4o",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": prompt},
                {
                    "type": "input_image",
                    "image_url": "data:image/png;base64,BASE64_INVOICE_IMAGE"
                }
            ],
        }
    ],
)

print(response.output_text)

Claude: images and PDFs through Messages API content blocks

from anthropic import Anthropic

client = Anthropic()

message = client.messages.create(
    model="claude-sonnet-model-id",
    max_tokens=2000,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": "BASE64_PDF"
                    }
                },
                {
                    "type": "text",
                    "text": (
                        "Extract invoice_number, invoice_date, vendor_name, "
                        "tax_amount, total_amount, and line_items. "
                        "Return JSON only."
                    )
                }
            ]
        }
    ]
)

print(message.content)

Gemini: inline image data or uploaded files, plus native PDF handling

from google import genai

client = genai.Client()

prompt = """
Extract invoice_number, invoice_date, vendor_name, tax_amount, total_amount,
and line_items from this invoice. Return JSON only.
"""

response = client.models.generate_content(
    model="gemini-model-id",
    contents=[
        prompt,
        {
            "mime_type": "application/pdf",
            "data": b"PDF_BYTES_HERE"
        }
    ],
)

print(response.text)

Here is the compact comparison that usually matters more than vendor marketing:

Provider	Best input path	Native PDF handling	Schema-control strength	Best when	Watch out for
OpenAI	Image URL, Base64 data URL, or file ID	Works well if you already rasterize pages or manage file storage yourself	Strong when you want schema-constrained output close to the API boundary	Image-heavy intake, quick prototypes, unified OpenAI stack	Repeated image retries can bloat requests if you keep resending Base64
Claude	Image blocks or document blocks	Strong fit for native PDF workflows	Good, but you still want local validation for finance fields	Suppliers mostly send full PDFs and you want document-level input	You need careful prompt discipline when multiple totals or dates appear
Gemini	Inline image data or uploaded files	Strong native PDF path	Good when you want uploaded-file reuse plus structured-output guidance	Mixed image and PDF workloads, cleaner file reuse	Model and file choices can affect latency more than the headline capability suggests

What actually changes your Python pipeline

Force Structured Invoice JSON with Pydantic Before You Trust the Model

from datetime import date
from decimal import Decimal

from pydantic import BaseModel, Field, field_validator, model_validator


class LineItem(BaseModel):
    description: str = Field(
        description="Product or service description exactly as shown on the invoice."
    )
    quantity: str | None = Field(
        default=None,
        description="Digits only when present, for example 2 or 15.5. No units, no text."
    )
    unit_price: Decimal | None = Field(
        default=None,
        description="Numeric amount only. No currency symbols or thousands separators."
    )
    line_total: Decimal | None = Field(
        default=None,
        description="Numeric amount only. No currency symbols. Use the invoice currency."
    )
    tax_amount: Decimal | None = Field(
        default=None,
        description="Line-level tax amount if explicitly shown, otherwise null."
    )
    sku: str | None = Field(
        default=None,
        description="Product code or SKU if explicitly visible, otherwise null."
    )


class InvoiceExtraction(BaseModel):
    invoice_number: str = Field(
        description="Invoice identifier exactly as printed on the document."
    )
    vendor_name: str = Field(
        description="Supplier legal or trading name shown on the invoice."
    )
    invoice_date: str | None = Field(
        default=None,
        description="Invoice date in YYYY-MM-DD format. Null if missing or ambiguous."
    )
    due_date: str | None = Field(
        default=None,
        description="Payment due date in YYYY-MM-DD format. Null if not shown."
    )
    currency: str | None = Field(
        default=None,
        description="Three-letter ISO currency code such as USD, EUR, or GBP."
    )
    subtotal: Decimal | None = Field(
        default=None,
        description="Pre-tax invoice subtotal as numeric value only."
    )
    tax_amount: Decimal | None = Field(
        default=None,
        description="Total invoice tax as numeric value only."
    )
    total_amount: Decimal = Field(
        description="Final invoice total as numeric value only, no currency symbol."
    )
    line_items: list[LineItem] = Field(
        default_factory=list,
        description="List each invoice line item separately when visible."
    )
    review_required: bool = Field(
        default=False,
        description="True when any important field is missing, ambiguous, or inconsistent."
    )
    review_notes: list[str] = Field(
        default_factory=list,
        description="Short notes explaining ambiguity, missing values, or failed checks."
    )

    @field_validator("invoice_date", "due_date")
    @classmethod
    def must_use_iso_date(cls, value: str | None) -> str | None:
        if value is None:
            return value
        date.fromisoformat(value)
        return value

    @field_validator("quantity")
    @classmethod
    def quantity_should_be_numeric_text(cls, value: str | None) -> str | None:
        if value is None:
            return value
        allowed = set("0123456789.")
        if any(ch not in allowed for ch in value):
            raise ValueError("quantity must contain digits only")
        return value

    @model_validator(mode="after")
    def totals_should_make_finance_sense(self):
        if (
            self.subtotal is not None
            and self.tax_amount is not None
            and self.total_amount is not None
            and self.subtotal + self.tax_amount != self.total_amount
        ):
            self.review_required = True
            self.review_notes.append("subtotal + tax_amount does not equal total_amount")
        return self

The missing bridge in many examples is the adapter layer that turns any provider response into one validated Python object:

from typing import Any


def normalize_invoice(payload: str | dict[str, Any]) -> InvoiceExtraction:
    if isinstance(payload, str):
        return InvoiceExtraction.model_validate_json(payload)
    return InvoiceExtraction.model_validate(payload)

Model Line Items, Multi-Page PDFs, and Totals So the Output Stays Usable

Your first modeling decision is whether to return one invoice object with nested line_items or one row per line item. Nested JSON is readable for short invoices:

{
  "invoice_id": "INV-10452",
  "vendor_name": "Northwind Components",
  "invoice_date": "2026-03-12",
  "currency": "USD",
  "subtotal": 1250.00,
  "tax": 125.00,
  "total": 1375.00,
  "line_items": [
    {
      "line_number": 1,
      "description": "Industrial sensor kit",
      "quantity": 5,
      "unit_price": 200.00,
      "line_total": 1000.00
    },
    {
      "line_number": 2,
      "description": "Calibration service",
      "quantity": 1,
      "unit_price": 250.00,
      "line_total": 250.00
    }
  ]
}

A practical rule is simple:

Use nested arrays when you mainly need one JSON document per invoice.
Use flat line-item rows when the output feeds databases, spreadsheets, warehouse tables, or reconciliation jobs.
Preserve a stable invoice ID in both models so you can move between them without lossy transforms.

A reliable approach for multi-page PDF invoices looks like this:

Keep the pages in original order and pass page metadata through the extraction result.
Assign one stable invoice_id or document_group_id to the full PDF before post-processing.
Merge repeated header fields such as vendor name or invoice number only when the values agree or one value is clearly more complete.
Treat footer totals, tax summaries, and payment terms as document-level fields, not page-level facts.
Allow line items to continue across page breaks without assuming a new table starts cleanly on each page.

After extraction, finance checks matter as much as the model response itself. At minimum, you should verify that:

Subtotal + tax = total, within a small tolerance for rounding.
Sum of line totals matches the stated subtotal or total, depending on how the document presents tax.
Credit notes are normalized consistently, either as negative amounts or as a separate document_type with enforced sign rules.
Currency stays consistent across all pages and all extracted amounts.

Here is the kind of reconciliation pass that catches many production issues before the data hits your accounting workflow:

from decimal import Decimal

def nearly_equal(a: Decimal, b: Decimal, tolerance: Decimal = Decimal("0.02")) -> bool:
    return abs(a - b) <= tolerance

def validate_invoice_totals(invoice: dict) -> list[str]:
    errors: list[str] = []

    subtotal = Decimal(str(invoice.get("subtotal", 0)))
    tax = Decimal(str(invoice.get("tax", 0)))
    total = Decimal(str(invoice.get("total", 0)))
    currency = invoice.get("currency")

    computed_lines = sum(
        Decimal(str(item.get("line_total", 0)))
        for item in invoice.get("line_items", [])
    )

    if not nearly_equal(subtotal + tax, total):
        errors.append("Subtotal plus tax does not reconcile to total")

    if invoice.get("line_items") and not nearly_equal(computed_lines, subtotal):
        errors.append("Line items do not reconcile to subtotal")

    page_currencies = {
        page.get("currency")
        for page in invoice.get("pages", [])
        if page.get("currency")
    }
    if currency:
        page_currencies.add(currency)
    if len(page_currencies) > 1:
        errors.append("Currency is inconsistent across pages")

    if invoice.get("document_type") == "credit_note" and total > 0:
        errors.append("Credit note total should be negative or explicitly normalized")

    return errors

Plan for Scans, Multilingual Layouts, and Other Failure Modes

A production-grade validation layer should check rules such as:

Invoice date must parse and fall within a reasonable range for the supplier relationship.
Due date cannot be earlier than invoice date unless the document is clearly a credit note or adjustment.
Total should approximately equal subtotal plus tax, within a small tolerance for rounding.
Currency must be present when amounts are present.
Line-item sums should reconcile to the invoice total when the document is itemized.
Tax should not silently default to zero when tax labels or percentages appear elsewhere on the page.
Page-level metadata should confirm that the extracted page is actually an invoice, not an email thread, statement, or cover sheet.

Decide When Raw Model Calls Are Enough and When a Managed SDK Wins

A useful way to compare the two paths is to focus on operational ownership, not model brand:

Dimension	Direct provider calls	Managed extraction SDK
File handling and reuse	You manage file IDs, storage, page rasterization, Base64 reuse, and second-pass extraction yourself	Upload and file reuse are modeled into the workflow, so the same invoice can move through multiple steps without resending payloads
Latency and polling	You own queues, backoff, timeouts, and partial failure handling — every retry can resend expensive visual input unless you manage reuse carefully	Polling, job-state management, and retry-friendly task state are already modeled for extraction tasks
Batch handling	You build your own job tracking and batch recovery patterns	Designed for batch processing including large sessions
Validation	You still need Pydantic or equivalent checks for fields, types, and totals	You still validate output in your app, but less plumbing sits around the extraction step
Export delivery	You usually transform JSON into CSV, XLSX, or downstream database records yourself	Output is available as XLSX, CSV, or JSON
Staff-time tipping point	Usually still fine for narrow internal tools or low-volume flows	Often cheaper once batching, exports, and failed-page handling dominate engineering time

You are writing upload and polling code in multiple services.
You need consistent prompts reused across teams or customers.
You are handling partial failures at the page level instead of only whole-request failures.
You need export files, not just model JSON, because finance users want spreadsheets.
You are processing batches large enough that orchestration code now matters as much as prompt quality.

A practical decision framework looks like this:

Stay with raw provider calls when volume is modest, latency requirements are simple, and custom extraction behavior is part of your core product value.
Add your own validation layer if you stay DIY. Pydantic checks, total reconciliation, and exception routing still matter even when model outputs look structured.
Move to a managed SDK when your team is spending more time on upload orchestration, polling, retries, failed-page handling, exports, and batch operations than on the workflow your users actually care about.
Choose staged SDK methods instead of one-call extraction when you need tighter control over job lifecycle, queue integration, or downstream processing.
Treat managed extraction as a focus decision, not a capability surrender. If extraction is not the product you are selling, offloading the plumbing is often the better engineering choice.

Vision LLM Invoice Extraction with Python: Practical Guide

Send Invoice Images and PDFs to GPT-4o, Claude, and Gemini

OpenAI: image-first flow with structured outputs

Claude: images and PDFs through Messages API content blocks

Gemini: inline image data or uploaded files, plus native PDF handling

What actually changes your Python pipeline

Force Structured Invoice JSON with Pydantic Before You Trust the Model

Model Line Items, Multi-Page PDFs, and Totals So the Output Stays Usable

Plan for Scans, Multilingual Layouts, and Other Failure Modes

Decide When Raw Model Calls Are Enough and When a Managed SDK Wins

Extract invoice data to Excel with natural language prompts

Pydantic AI Invoice Extraction: Build a Typed Agent

LangChain Invoice Extraction with Structured Output

Pydantic Invoice Extraction in Python: Validate JSON Output

Vision LLM Invoice Extraction with Python: Practical Guide

Send Invoice Images and PDFs to GPT-4o, Claude, and Gemini

OpenAI: image-first flow with structured outputs

Claude: images and PDFs through Messages API content blocks

Gemini: inline image data or uploaded files, plus native PDF handling

What actually changes your Python pipeline

Force Structured Invoice JSON with Pydantic Before You Trust the Model

Model Line Items, Multi-Page PDFs, and Totals So the Output Stays Usable

Plan for Scans, Multilingual Layouts, and Other Failure Modes

Decide When Raw Model Calls Are Enough and When a Managed SDK Wins

Extract invoice data to Excel with natural language prompts

Pydantic AI Invoice Extraction: Build a Typed Agent

LangChain Invoice Extraction with Structured Output

Pydantic Invoice Extraction in Python: Validate JSON Output