Invoice Extraction with the Python SDK: A Practical Guide

Use the official Python SDK to extract structured data from invoice PDFs — one-call workflow, async polling, prompt control, and XLSX/CSV/JSON output.

Published
Updated
Reading Time
20 min
Topics:
API & Developer IntegrationPythonSDKPDF extraction

An invoice extraction Python SDK turns invoice PDFs and images into structured data — XLSX, CSV, or JSON — through a single library call that handles upload, prompt-based extraction, asynchronous polling, and result download. Unlike DIY OCR code or hand-written vision-LLM code, the SDK handles failed-page reporting, batch sessions, and per-invoice or per-line-item output without manual plumbing.

Inputs are PDFs (native and scanned), JPGs, JPEGs, and PNGs. Outputs are XLSX, CSV, or JSON files containing the extracted fields and, when the prompt asks for them, the line items.

The Invoice Data Extraction Python SDK installs from PyPI:

pip install invoicedataextraction-sdk

It requires Python 3.9 or later.

The rest of this article frames where this SDK fits among the three Python integration paths and walks the actual workflow with the operational depth most vendor pages skip.

The three Python paths to invoice extraction

A developer building invoice ingestion in Python is choosing between three paths, and the SERP keeps gesturing at all three without telling you which one is yours. Naming them plainly is the first useful move.

Direct REST integration. Talk to the API over HTTPS, manage the upload session yourself, post the extraction job, poll for completion, and download the output. This is the right path when the team needs full control over HTTP behaviour — custom retry policies, in-house authentication flows, edge proxies, observability hooks the SDK does not surface — or when the runtime is not Python at all (Go, Rust, a Node Lambda) and a Python SDK is simply not on offer. The cost is real: multipart upload sessions, presigned URL expiry, idempotency keys, polling cadence, failed-page handling, retry-on-rate-limit. Every operational layer is yours to own. If you want to see what the bare REST surface actually looks like, our REST API quickstart for invoice extraction walks through it with curl.

DIY OCR or vision-LLM extraction in Python. Build the extraction pipeline yourself, either with classical OCR libraries (Tesseract, invoice2data, pdfplumber, Camelot) or with a multimodal model called directly from your code. This path is right when invoice extraction is incidental to a larger LLM pipeline already running in your infrastructure — you have the model client, the prompt scaffolding, the retry logic, the parsing — and adding invoices is a feature within that, not a separate integration. It is also the right path when data must not leave your own environment under any circumstances. The cost is the long tail of edge cases: multi-page invoices that need stitching, line-item reconstruction across page breaks, mixed-language vendors, handwritten notes, scans of varying quality. On top of those you still own prompt design, schema discipline, batch fan-out, and retries. We have a longer write-up of building invoice extraction with a vision LLM in Python for teams sizing up that path specifically.

The official Python SDK. Use the vendor's library to call the vendor's workflow. Upload, submit, poll, download — composed for you, with failed pages reported, AI uncertainty notes attached, and output delivered in the format you asked for. This is the right path when the team wants the workflow as a building block and is willing to trust the vendor's accuracy and uptime in exchange for skipping the plumbing.

This is not a quality ranking. Each path is correct in different conditions. The article's job from here on is to help you recognise which condition you are in — and to do that, you need a clear picture of what the SDK actually does, which is the next section.

One reason to bias toward Python in the first place is that the developer pool around it is enormous and growing: IEEE Spectrum's 2025 top programming languages ranking puts Python first in both the Spectrum default ranking and the Jobs ranking for 2025, the latter measuring what skills employers are actively hiring for. If your invoice ingestion has to be maintainable by whoever joins the team next year, Python is the safe runtime.

What the SDK actually does: one call versus the staged workflow

The SDK is a wrapper over the underlying official invoice extraction API. The REST API is the source of truth for what the platform can do; the SDK is the ergonomic Python interface to it. Every method on the client maps to one or more REST endpoints, and any capability the API has, the SDK exposes.

Construction takes an API key. The conventional pattern is to read it from an environment variable rather than hard-code it:

import os
from invoicedataextraction import InvoiceDataExtraction

client = InvoiceDataExtraction(api_key=os.environ["INVOICE_DATA_EXTRACTION_API_KEY"])

From there, the SDK gives you two ways to extract: a single composed call, or a sequence of staged calls.

The one-call extract() method

client.extract() orchestrates the entire workflow in a single blocking call. Internally it creates an upload session, uploads each file in parts, submits the extraction job, polls for completion, and downloads the output. Its key parameters:

  • folder_path or files — point it at a folder of invoices or pass an explicit list of file paths.
  • prompt — a natural-language string or a structured field-prompt dict (the next section walks both shapes).
  • output_structure"automatic", "per_invoice", or "per_line_item".
  • download — a dict like {"formats": ["xlsx", "json"], "output_path": "./output"} controlling which format files are written locally.
  • task_name — a short human-readable label that surfaces in the dashboard alongside web extractions.
  • polling — optional {"interval_ms": 10000, "timeout_ms": None}.
  • on_update — optional callback invoked at each stage transition.
  • exclude_columns — drop columns like source_file from the output if you do not want them.

A typical script looks like this:

result = client.extract(
    folder_path="./inbox",
    prompt="Extract invoice number, invoice date, vendor name, net amount, tax, total. One row per invoice.",
    output_structure="per_invoice",
    download={"formats": ["xlsx"], "output_path": "./output"},
    task_name="ap-batch-2026-05-04",
)

The call returns when the extraction is complete and the requested files are on disk.

The staged workflow

For everything else, the SDK exposes the workflow as separable methods that map directly to REST endpoints:

  • upload_files(...) — creates the upload session and uploads files; returns an upload_session_id and the per-file file_ids.
  • submit_extraction(...) — posts the extraction job against an upload session; returns an extraction_id.
  • wait_for_extraction_to_finish(extraction_id, ...) — blocks while polling until the job reaches a terminal state.
  • check_extraction(extraction_id) — one-shot status read for systems that drive their own polling.
  • download_output(extraction_id, format, file_path) — writes the chosen format to disk.
  • get_download_url(extraction_id, format) — returns a presigned URL (5-minute TTL) you can hand to a downstream consumer.
  • delete_extraction(extraction_id) and get_credits_balance() — housekeeping.

Use the staged workflow when the phases need to be decoupled. A few common shapes:

  • Uploads happen in one service (an ingestion API receiving invoices from a vendor portal), and extraction submission happens in another (a worker reading from a job queue).
  • Partial-failure handling needs to differ per phase — the upload retry policy is not the same as the extraction retry policy.
  • The download step lives in a downstream consumer that just needs a presigned URL with a short TTL, not the bytes.
  • A long batch is queued overnight, and a separate scheduler polls in the morning rather than holding a Python process open.

submit_extraction accepts a submission_id parameter that acts as an idempotency key. Posting the same submission_id twice will not create a second extraction — useful if the submitting worker is at-least-once rather than exactly-once.

For long batches, the on_update callback is the right surface for logging and UI updates. It receives a dict shaped {stage, level, message, progress, extraction_id} at every internal transition (upload start, upload complete, submission accepted, polling tick, download start, completion). Wire it to your logger or your progress bar and you get visibility without inspecting status responses by hand.

Controlling the extraction: prompt strings and structured field prompts

The prompt parameter accepts two shapes: a single natural-language string, or a structured Python dict. They produce the same kind of output; they differ in who is composing the prompt and how.

The string prompt

The simplest form is a plain string. The platform is built around the idea that an extraction prompt can be written in finance-team English, not configured through a wizard:

prompt = (
    "Extract invoice number, invoice date, vendor name, net amount, tax, total. "
    "One row per invoice. Format dates as YYYY-MM-DD."
)

A string prompt can carry up to roughly 2,500 characters, which is enough room for field selection, business rules, formatting instructions, and document-handling hints in the same block. Most ingestion pipelines run on a string prompt that someone in the finance team wrote once, refined once or twice against a sample batch, and then saved.

The structured field prompt

The structured form is a Python dict with a fields list and an optional general_prompt string:

prompt = {
    "fields": [
        {"name": "Invoice Number"},
        {"name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the due date."},
        {"name": "Vendor Legal Name", "prompt": "Prefer extracting from the footer if present."},
        {"name": "Total Amount"},
    ],
    "general_prompt": "One row per invoice. Format dates as YYYY-MM-DD.",
}

Each field has a name (which becomes the column header in the output) and an optional per-field prompt that scopes its instruction to that field alone. The general_prompt carries instructions that apply across the whole extraction.

The trade-off is straightforward. Reach for the string form when the extraction is a standard recurring job, when the prompt is being authored by a domain expert in plain language, or when the prompt is being seeded from a saved Prompt Library entry. Reach for the structured form when the prompt is being built programmatically — assembled from a database of required fields, generated from a target schema, composed across a multi-team approval process — and per-field instructions need to live separately from the global one. If your calling system already has a structured representation of the schema it wants, the dict form lets you pass that representation through without flattening it back into prose.

Both shapes accept the same kinds of instructions:

  • Field selection and column naming. Name the fields you want; the column headers in the output match the names you give.
  • Output structure. One row per invoice, one row per line item, custom column ordering, joining line-item descriptions into a single cell.
  • Business logic. Defaults ("if Tax Amount is missing, set it to 0"), fallbacks ("find the PO Number in the header; if absent, take it from the Reference field"), conditionals ("if Currency is USD, extract Tax from State Tax; if EUR, from VAT").
  • Document handling. Skip pages whose title is "Email Cover Sheet", classify document type as Invoice or Credit Note and prefix credit-note invoice numbers with CR-, treat Statements of Account by extracting each invoice in the summary table as its own row.
  • Data formatting. Date format, decimal precision, native Excel typing for numeric columns.

For deeper guidance on writing prompts that hold up across document variation — vendor-specific layouts, multi-language invoices, edge cases — see designing the invoice extraction prompt. The shape of the prompt is the SDK's surface area; the craft of the prompt is its own discipline.

Output shape: per-invoice, per-line-item, and choosing XLSX, CSV, or JSON

Two output decisions sit in front of every extraction call: row granularity and file format. The SDK exposes both as parameters; the right answer depends entirely on what consumes the output.

Row granularity: per-invoice or per-line-item

The output_structure parameter takes three values:

  • "per_invoice" produces one row per invoice. Header fields, totals, vendor — each invoice condenses to a single record.
  • "per_line_item" produces one row per line item. The parent invoice's identifying fields (invoice number, date, vendor, currency) are repeated on every line so each row stands alone.
  • "automatic" lets the AI choose based on the prompt and the document content. The completed response echoes back the effective output_structure, so the caller always knows which shape was actually used.

Match the granularity to the consuming workflow. Per-invoice is right for AP processing, payment runs, vendor reconciliation, and most accounting handoffs where one invoice is one record in the receiving system. Per-line-item is right for line-level spend analysis, expense categorization, GL coding, and any work where the line is the unit of analysis — splitting a single invoice across cost centres, classifying line descriptions into expense categories, reconciling line items to a purchase order. If the downstream task answers questions about the invoice, you want per-invoice. If it answers questions about what was bought, you want per-line-item.

File format: XLSX, CSV, or JSON

All three formats are generated server-side simultaneously when the extraction completes. The download.formats parameter on extract() is a list, so a single run can write XLSX for the finance team and JSON for the downstream service:

result = client.extract(
    folder_path="./inbox",
    prompt=prompt,
    output_structure="per_invoice",
    download={"formats": ["xlsx", "json"], "output_path": "./output"},
)

The decision frame:

  • JSON is right for code paths and APIs — automated downstream consumption, validation pipelines, message queues, downstream LLM steps. Anywhere the next consumer is software, JSON is the format with the least friction.
  • CSV is right for lightweight tabular handoff, easy diffing, and tools that prefer plain text over a binary format. Useful when the receiving system is older accounting software, a data-warehouse loader, or a script that does not want to depend on a spreadsheet library.
  • XLSX is right for finance and accounting handoff where the recipient opens the file in Excel. Numbers come through as numbers, dates as dates, and the file is pivot-ready and visually reviewable without a conversion step. This matters more than developers usually credit it: a controller reviewing a month's extracted invoices wants to scan, sort, and pivot, not parse.

Most ranking content for this category defaults to JSON without explaining why, and that default quietly assumes the consumer is another piece of code. If the consumer is a human in finance, XLSX or CSV is often the right answer.

The output also carries operational metadata regardless of format. Every row references the source file and page number it came from, which makes cross-referencing back to the original PDF a one-click step rather than a forensic exercise. The completed response also carries the AI's per-task notes about assumptions made during extraction and a list of any pages that failed to process. The next section walks those operational details — the part most vendor pages skip.

The operational layer: polling, failed pages, AI uncertainty notes, and batch limits

Extraction is asynchronous on the server, large batches run for minutes rather than seconds, and individual pages can fail without taking the rest of the batch with them. These are the realities a production pipeline meets in its first week. The SDK exposes them directly rather than papering over them.

Polling

Extraction is asynchronous on the server. A small batch finishes in seconds; a 6,000-file batch runs for minutes. The one-call extract() method blocks while it polls internally, so for most callers the asynchrony is invisible. The polling parameter accepts {"interval_ms": 10000, "timeout_ms": None} and defaults to a 10-second poll interval with no timeout — set timeout_ms to a finite number when the calling context cannot afford to block indefinitely.

The staged workflow exposes the same blocking helper as wait_for_extraction_to_finish(extraction_id, ...) and a one-shot status read as check_extraction(extraction_id). The one-shot form is the right surface when the calling system already runs its own scheduler — a Celery beat job, a Cloud Scheduler tick, an AWS Lambda EventBridge rule — and wants to drive polling on its own cadence rather than holding a process open.

Timeouts

When timeout_ms is set and exceeded, the SDK raises SDK_TIMEOUT_ERROR. The extraction itself is not cancelled — the server is still working — and the developer can resume polling later with wait_for_extraction_to_finish or check_extraction against the same extraction_id. This is the right model: the client's patience and the server's progress are decoupled, and a slow batch does not orphan its work.

Failed-page reporting

Every completed response includes a pages block:

{
  "successful_count": 482,
  "failed_count": 3,
  "successful": [{"file_name": "...", "page": 1}, ...],
  "failed": [
    {"file_name": "vendor-acme-2026-04.pdf", "page": 7},
    {"file_name": "vendor-globex-2026-04.pdf", "page": 2},
    {"file_name": "vendor-initech-2026-04.pdf", "page": 11},
  ],
}

Each failed-page entry names the file and the page number. This is the difference between "something went wrong with this batch" and "page 7 of the Acme invoice did not parse — probably the watermark on the scan; flag it for re-scan and continue." Vendor pages elsewhere on the SERP rarely surface this level of detail; the SDK gives it to you on every run.

AI uncertainty notes

Every completed response also includes an ai_uncertainty_notes array. Each note carries a topic, a description of the assumption the AI made, and a suggested_prompt_additions list with concrete instruction text that would resolve the ambiguity on the next run:

{
  "topic": "Mixed document types",
  "description": "Some files contained both invoices and remittance advice on consecutive pages. Remittance advice was excluded from extraction.",
  "suggested_prompt_additions": [
    {
      "purpose": "Make the document-type rule explicit",
      "instructions": "Skip any pages classified as remittance advice or payment confirmations.",
    }
  ],
}

This is the iteration loop for refining a recurring extraction prompt. Run a representative batch, read the notes, fold the suggested additions into the saved prompt, and the next batch lands cleaner. When the array is empty, the AI made no assumptions worth flagging.

Error handling

The SDK raises two exception classes: SdkError for client-side and validation problems (filesystem, network, upload, timeout) and ApiResponseError for errors the server returned. Both expose a .body matching:

{"success": False, "error": {"code": "...", "message": "...", "retryable": bool, "details": None | dict}}

A practical handler:

from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    result = client.extract(...)
except (SdkError, ApiResponseError) as error:
    code = error.body["error"]["code"]
    if code == "INSUFFICIENT_CREDITS":
        ...  # alert finance, halt the batch
    elif code in ("UNAUTHENTICATED", "API_KEY_EXPIRED", "API_KEY_REVOKED"):
        ...  # rotate the key
    elif code == "INVALID_INPUT":
        ...  # inspect error.body["error"]["details"]["issues"]
    else:
        raise

RATE_LIMITED is auto-retried internally with Retry-After honoured, so application code does not need to handle it directly.

Batch behaviour

Per upload session: up to 6,000 files, individual PDFs up to 150 MB, JPGs and PNGs up to 5 MB, total batch up to 2 GB. A single PDF can carry up to 5,000 pages — useful for the common case where many invoices have been concatenated into one file. The platform identifies document types within mixed batches and filters non-relevant pages such as email cover sheets, remittance advice, and summary pages without a separate instruction, although you can override or refine that behaviour through the prompt. Throughput typically lands in the 1 to 8 seconds-per-page range, often 2 seconds-per-page or less for jobs over 500 documents.

Credits

The platform consumes 1 credit per successfully processed page. Pages that fail to process do not consume credits, which means the cost of a batch tracks the work that actually completed rather than the work that was attempted. Credits are shared between web and API usage from a single account balance — worth knowing if a team runs an interactive web user (the controller scanning month-end PDFs) and an automated SDK pipeline (the worker ingesting daily vendor email attachments) against the same account. The balance is one pool; both consumers draw from it.

After extraction: validating the structured output

The SDK's job ends when structured output is downloaded. Anything the application's own data model requires beyond that — type coercion, business-rule validation, missing-field policy, currency normalization, vendor canonicalization, idempotent persistence — is application code, not SDK code. Drawing that line cleanly is part of building the integration well.

A validation layer is worth the effort even when the extraction is reliable. The response is structurally consistent, but the values inside it were extracted from semi-structured input. Totals can come through as strings on some invoices and numbers on others depending on the source document. Dates can vary by locale. Vendor names need canonicalization before they hit the AP master, because "Acme, Inc.", "Acme Inc", and "ACME INCORPORATED" are the same supplier and your accounting system needs to know that. Downstream systems will reject records that violate their own constraints — required fields, foreign-key integrity, allowed values — and they will reject them noisily at the moment of insert, when the original context is gone. A typed validation layer at the boundary catches these problems where they are easy to handle, instead of letting them leak into the warehouse and surface as a reconciliation problem two weeks later.

Pydantic is the practical Python choice for this layer. It gives you a typed model of the extracted record, runtime validation on every parsed payload, and a clean place to attach business-rule checks: totals must equal the sum of line items within a tolerance, vendor IDs must resolve against the AP master, currency codes must be valid ISO 4217. The walk-through with concrete model definitions, validators, and integration patterns lives in our dedicated piece on validating extracted invoice JSON with Pydantic. Read it as the next step after the SDK call lands.


When the SDK is the right call

Three honest answers, depending on where you sit.

Use the SDK when the team is shipping in Python, invoice extraction is a primary need rather than an incidental feature, and the upload-submit-poll-download workflow is something you would rather consume than build. The operational layer earns its keep here: failed-page reporting, AI uncertainty notes, batch sessions up to 6,000 files, output-format choice across XLSX, CSV, and JSON. Building that infrastructure in-house is a project; consuming it as a library call is an afternoon.

Use direct REST integration when the runtime is not Python, when the team needs full control over HTTP behaviour (custom retry policies, edge proxies, in-house auth flows, observability the SDK does not surface), or when the integration lives inside an existing service that already owns a robust HTTP client and prefers to keep dependencies thin. The REST surface is the same workflow; you are just declining the ergonomic wrapper.

Use DIY OCR or a vision-LLM when invoice extraction is a small feature inside a larger LLM pipeline already running in your infrastructure, when data must not leave your own environment under any circumstances, or when the team has the operational appetite to own prompt design, retries, batch fan-out, and edge-case handling end to end. For teams sizing this option against the SDK approach, our broader survey of Python invoice extraction options compares the DIY libraries side by side.

If you are still weighing the call, the fastest way to decide is to install the SDK, run a small batch against your own representative invoices, and read both the structured output and the AI uncertainty notes the response comes back with. The iteration loop is shorter from a working extraction than from a planning document.

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours
Continue Reading