Python developers can extract invoice data using three distinct approaches: template-based parsing with open-source libraries like invoice2data, OCR engines like Tesseract for scanned documents, or API and SDK integration for production-ready extraction that returns structured JSON without template maintenance. Modern extraction SDKs handle the full pipeline (upload, process, return structured data) in a few lines of code, while template-based tools give you granular control at the cost of ongoing configuration work.
Python is the natural fit for this kind of work. According to the JetBrains Python Developers Survey 2024, co-produced with the Python Software Foundation and surveying over 25,000 developers, 51% of all surveyed Python developers are involved in data exploration and processing, making it the language's most common use case alongside web development. The task of extracting invoice data follows the same pattern: take unstructured financial documents, pull out the fields that matter, and feed them into a downstream pipeline.
Each approach occupies a different point on the control-versus-maintenance spectrum. Template-based extraction with invoice2data gives you full control over field mappings but requires a new template for every vendor format. OCR handles scanned documents and images that template parsers cannot read, but the output is raw text that still needs parsing logic. API and SDK-based extraction offloads both the recognition and the structuring — you send a file, get back structured data with labeled fields, and skip the template management entirely. The trade-off is external dependency and per-page pricing.
Template-Based Extraction with invoice2data
The most established open-source Python library for invoice extraction is invoice2data, available on PyPI. It takes a template-based approach: each vendor's invoice layout is described by a YAML template file that defines where fields like invoice number, date, and total amount appear in the document. When you run an extraction, invoice2data pulls raw text from the PDF using an underlying tool (pdftotext, pdfminer, or pdfplumber), then applies regex patterns from the matching template to parse out structured data.
Installation is straightforward:
pip install invoice2data
Basic usage requires just a few lines:
from invoice2data import extract_data
from invoice2data.extract.loader import read_templates
templates = read_templates("path/to/templates/")
result = extract_data("invoice.pdf", templates=templates)
print(result)
The real work lives in the template files. Each YAML template identifies a vendor by keywords found in the document text, then uses regex patterns to capture specific fields:
issuer: Acme Corp
keywords:
- Acme Corp
- acme-corp.com
fields:
invoice_number:
parser: regex
regex: Invoice\s*#?\s*(\d+)
date:
parser: regex
regex: Date:\s*(\d{2}/\d{2}/\d{4})
total:
parser: regex
regex: Total Due:\s*\$?([\d,]+\.\d{2})
When invoice2data processes a PDF, it extracts the full text, scans the keyword list to find the right template, then runs each field's regex against the text. The output is a dictionary of extracted values that you can pass into pandas for further analysis, converting results to DataFrames for validation, export to CSV or Excel, or integration with downstream accounting workflows.
This approach works well in a specific scenario: you process invoices from a small, stable set of vendors whose layouts rarely change. For a company receiving invoices from five suppliers with consistent formatting, writing and maintaining five templates is entirely manageable.
The limitations surface quickly beyond that narrow case:
- Every new vendor requires a new template. There is no generalization. A template written for one supplier's format extracts nothing useful from another supplier's invoice.
- Regex patterns are brittle. If a vendor updates their invoice layout, moves a field, or changes the label from "Invoice #" to "Inv. No.", the template breaks silently or extracts incorrect data.
- No built-in OCR. invoice2data works on native (digitally-created) PDFs by default. Scanned invoices or photos require additional configuration with Tesseract, adding setup complexity and a separate dependency.
- Line-item table extraction is painful. Pulling individual line items from invoice tables requires complex template engineering with multiline regex patterns, and results are inconsistent across different table layouts.
- Template maintenance does not scale. Organizations dealing with hundreds of vendor formats face a compounding maintenance burden. Each template must be authored, tested, and updated when vendors change their documents.
For a Python invoice parser handling a handful of known formats, invoice2data is a reasonable starting point. The library is mature, well-documented, and gives you direct control over extraction logic. But for teams processing invoices from diverse or changing vendor pools, the template-per-vendor model creates an operational bottleneck that grows with every new supplier relationship.
Handling Scanned and Image Invoices with OCR
The approaches covered so far assume your invoice PDF contains actual text. Many invoices, however, arrive as scanned documents or phone photos where the PDF is essentially a wrapped image. Libraries like pdfplumber can extract text and table structures from native PDFs with embedded text layers (and it is one of several Python libraries built specifically for PDF table extraction, each with different strengths on invoice layouts), but they return nothing useful from a scanned page because there is no text to extract. This is where OCR (Optical Character Recognition) enters the picture, converting document images into machine-readable text.
The Tesseract approach in Python
Tesseract is the most widely referenced open-source OCR engine for Python invoice processing, though it is no longer the only viable option — engines like EasyOCR, PaddleOCR, and Surya each handle invoices differently. The setup requires both a Python wrapper and the system-level Tesseract binary:
pip install pytesseract Pillow
# Also install Tesseract binary: https://github.com/tesseract-ocr/tesseract
Basic usage is straightforward:
import pytesseract
from PIL import Image
invoice_image = Image.open("scanned_invoice.png")
raw_text = pytesseract.image_to_string(invoice_image)
print(raw_text)
This gives you the full text content of the invoice as a single string. What it does not give you is structured data. The output is a block of text with no understanding of which parts are the invoice number, vendor name, line items, or totals.
Why tutorials from 2020 no longer reflect best practice
If you have searched for Python invoice OCR before, you have likely encountered the widely-shared PyImageSearch tutorial pattern from around 2020. That approach chains OpenCV preprocessing (rotation correction, noise removal, thresholding) with Tesseract to improve OCR accuracy on problematic scans. Five years ago, this was genuinely state-of-the-art for extracting data from PDF invoices in Python. The fundamental problem is what comes after OCR runs.
Raw OCR output is unstructured text. To extract specific invoice fields from it, you need to build your own parsing layer: regex patterns to find invoice numbers, positional heuristics to identify totals, and custom logic for every vendor format you encounter. This is the same template maintenance burden as rule-based tools, but built on shakier ground.
The pain is sharpest with line items and tables. Column alignment in OCR output is unreliable. Multi-line descriptions merge with adjacent columns. A quantity of "1" drifts into the unit price column. Any developer who has processed a batch of real-world invoices through Tesseract has encountered these problems. Low-quality scans, skewed images, and multi-column layouts compound the accuracy issues further.
What has changed since then
Vision models and LLM-powered extraction have fundamentally shifted how invoice data extraction works in Python. Instead of a pipeline where OCR produces raw text and custom regex tries to parse it, modern approaches process the invoice image directly — no separate OCR pass, no OpenCV preprocessing. These models interpret the spatial layout of the page to distinguish header fields from line-item tables, recognize field relationships by context (differentiating an invoice date from a due date based on position and label, not regex), and generalize across vendor formats without per-vendor configuration. If you want to understand how LLM-powered invoice extraction works, the core shift is that these models reason about document structure rather than pattern-matching against raw text output.
For developers still maintaining Tesseract-based pipelines, the calculus has changed. The engineering effort required to handle OCR accuracy issues, build format-specific parsers, and maintain them across vendor changes is now difficult to justify when API and SDK-based alternatives handle the full extraction pipeline in a single call.
API and SDK-Based Extraction in Python
Template-based and OCR approaches both leave you building and maintaining parsing logic. An invoice data extraction API takes a different approach: you upload documents, describe what to extract, and receive structured output. Your code handles integration, not parsing.
The InvoiceDataExtraction Python SDK wraps the full REST API workflow into a few method calls. Install it with pip (Python 3.9+ required):
pip install invoicedataextraction-sdk
Initialize the client with your API key, which you generate from the user dashboard:
from invoicedataextraction import InvoiceDataExtraction
import os
client = InvoiceDataExtraction(
api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
)
The free tier includes 50 pages per month with no credit card required, enough to evaluate the SDK against your own invoices.
The One-Call Extract Pattern
The SDK's primary usage pattern is a single extract() call that handles uploading, extraction, polling, and downloading:
result = client.extract(
folder_path="./invoices",
prompt="Extract invoice number, date, vendor name, and total amount",
output_structure="per_invoice",
download={"formats": ["xlsx", "json"], "output_path": "./output"},
console_output=True,
)
That one call uploads every PDF, JPG, and PNG from the folder (batches up to 6,000 files per session), submits the extraction job with your natural language prompt, polls until processing completes, and downloads the output files. The prompt parameter accepts plain English instructions describing the fields you need. The AI determines column names, handles layout variations across different vendor formats, and returns consistently structured results.
Structured Prompts for Production Pipelines
Natural language prompts work well for exploration, but production pipelines need deterministic column names and formatting rules. The dict-based prompt format gives you that control:
prompt = {
"fields": [
{"name": "Invoice Number"},
{"name": "Invoice Date", "prompt": "The date issued, NOT due date"},
{"name": "Total Amount", "prompt": "No currency symbol, 2 decimals"},
],
"general_prompt": "One record per invoice. YYYY-MM-DD dates.",
}
result = client.extract(
folder_path="./invoices",
prompt=prompt,
output_structure="per_invoice",
download={"formats": ["json"], "output_path": "./output"},
)
Each field's name becomes the exact column header in your output. The optional prompt on each field provides extraction hints, such as disambiguating invoice date from due date or stripping currency symbols. The general_prompt applies cross-field rules. This structured format means your downstream code can rely on consistent column names regardless of how varied the source invoices are.
Line-Item Extraction
Extracting individual line items from invoices is where template-based tools and raw OCR consistently struggle. Multi-line tables, spanning rows, and varying column layouts make reliable line-item parsing one of the hardest problems in document extraction. With the SDK, you change output_structure to "per_line_item":
result = client.extract(
folder_path="./invoices",
prompt="Extract line items: description, quantity, unit price, line total",
output_structure="per_line_item",
download={"formats": ["json"], "output_path": "./output"},
)
This produces one row per line item in the output, with each row associated to its parent invoice. The JSON output includes source file and page references, so you can trace every extracted line item back to the original document. For AP teams doing spend analysis or matching line items against purchase orders, this eliminates the complex template engineering that invoice2data would require and the unreliable table detection that raw OCR produces.
Output Formats and Data Pipelines
The SDK can download output as Excel (.xlsx), CSV (.csv), or JSON (.json), specified in the download parameter. For Python data pipelines, JSON output maps directly to Python dicts and lists, ready for processing with pandas or loading into a database. Excel output includes native data typing (numbers stored as numbers, dates as dates), which matters when the output goes to finance teams who need formulas and pivot tables to work immediately. For CSV workflows, detailed guidance is available on extracting invoice data to CSV format.
Error Handling
Production integrations need to handle failures gracefully. The SDK provides typed exceptions for API errors and client-side issues:
from invoicedataextraction.errors import SdkError, ApiResponseError
try:
result = client.extract(
folder_path="./invoices",
prompt="Extract invoice number, date, and total amount",
output_structure="per_invoice",
download={"formats": ["json"], "output_path": "./output"},
)
except ApiResponseError as error:
print(error.body["error"]["code"])
print(error.body["error"]["message"])
Error codes like INSUFFICIENT_CREDITS, RATE_LIMITED, and UNAUTHENTICATED let you build specific retry and alerting logic. After a successful extraction, confirm every page processed without issues by checking the pages section of the result dictionary.
Staged Methods for Advanced Control
The REST API underpins everything the Python SDK does. While extract() abstracts the full workflow into a single call, the SDK also exposes individual steps for cases where you need finer control, such as uploading files in one process and triggering extraction in another, or integrating the polling step with your own task queue:
upload = client.upload_files(files=["./invoice1.pdf"], console_output=True)
submitted = client.submit_extraction(
upload_session_id=upload["upload_session_id"],
file_ids=upload["file_ids"],
prompt="Extract invoice number and total",
output_structure="per_invoice",
)
result = client.wait_for_extraction_to_finish(
extraction_id=submitted["extraction_id"],
console_output=True,
)
client.download_output(
extraction_id=submitted["extraction_id"],
format="xlsx",
file_path="./output/invoices.xlsx",
)
The staged methods (upload_files, submit_extraction, wait_for_extraction_to_finish, download_output) map directly to the underlying API endpoints, giving you full control over each step of the extraction workflow.
Choosing the Right Approach for Production
Each approach covered in this guide occupies a different point on the control-versus-effort spectrum. The right choice depends on your invoice volume, vendor diversity, accuracy requirements, and how much engineering time you can allocate to building and maintaining extraction infrastructure.
Comparison Across Production Dimensions
| Dimension | Template-Based (invoice2data) | OCR Pipeline | API/SDK |
|---|---|---|---|
| Template maintenance | One template per vendor layout. Manageable for 5-10 stable vendors; unsustainable at hundreds. Templates break when vendors update their invoice format. | Custom parsing logic per format. Every new layout means new regex or positional rules. | No templates or parsing logic required. The extraction model handles format variation. |
| Scanned document support | Requires separate Tesseract configuration and tuning. | Handles scanned documents natively, though accuracy depends heavily on scan quality. | Handles native PDFs, scanned PDFs, and images without separate OCR setup. |
| Line-item extraction | Difficult. Extracting tabular line items requires complex template engineering with positional rules that are fragile across vendors. | Unreliable. Column alignment issues, merged cells, and inconsistent spacing make structured table extraction from raw OCR text error-prone. | Built-in. Line-item extraction is a standard output mode, not a custom engineering effort. |
| Accuracy at scale | High within template scope. Brittle when a vendor changes their layout or a new vendor appears. | Degrades with poor scan quality, non-standard fonts, and multi-column layouts. | Consistent across formats. AI models understand document structure rather than relying on positional rules. |
| Cost model | Free (open source). The real cost is developer hours for template creation, testing, and ongoing maintenance. | Free (open source). Development cost for building parsing logic, handling edge cases, and monitoring accuracy. | Per-page pricing. Eliminates development, maintenance, and accuracy monitoring costs. |
Batch Processing and Scale
For developers processing hundreds or thousands of invoices, the difference in engineering effort is significant. With invoice2data or a raw OCR pipeline, you need to build your own queuing system, implement error recovery, handle parallel processing, and manage retries for failed documents. That is a substantial amount of infrastructure code that has nothing to do with invoice extraction itself.
An SDK-based approach eliminates that entirely. Batch processing of up to 6,000 files per session with parallelized extraction is handled by the service, not your code. Your Python script submits the batch and downloads the results.
Security and Data Handling
Invoices contain sensitive financial information: vendor payment details, bank account numbers, tax identifiers, and purchase amounts. Your choice of approach has direct implications for data handling.
Self-hosted open-source approaches (invoice2data, Tesseract) keep all data on your own infrastructure — and the landscape of open-source OCR engines suited to invoices has expanded well beyond those two. You control encryption, access, retention, and compliance. For organizations with strict data residency requirements, this may be a deciding factor.
API-based approaches send documents to a third-party service for processing. Before transmitting sensitive financial documents, verify the provider's data handling policies: Where is data processed and stored? Is it encrypted in transit and at rest? What are the retention and deletion periods? Is the data used for model training? Does the provider hold relevant certifications (SOC 2, ISO 27001) and comply with GDPR or other applicable regulations? These are not optional questions for production financial document processing.
When to Use Each Approach
Template-based (invoice2data): You process invoices from a small, stable set of vendors and want full control with no external dependencies.
OCR pipeline: You need raw text extraction from scanned documents as input to another system, not a complete invoice data extraction solution.
API/SDK: You need production-ready extraction across diverse formats without maintaining parsing infrastructure — the practical choice for automating invoice processing in Python at scale.
The Real Cost Calculation
The decision is rarely "free open-source versus paid API." It is developer time building, debugging, and maintaining a custom parsing pipeline versus a per-page API cost. A template-based approach is genuinely free for five vendors, but the total cost of ownership changes when you factor in ongoing template maintenance as vendors update their formats, accuracy monitoring to catch silent extraction failures, and infrastructure costs for batch processing and error recovery.
For teams with established AP workflows that need to automate invoice processing, the calculation should include all of these costs. If you are evaluating different integration approaches beyond just Python libraries, the same trade-offs apply when choosing between API, SaaS, and ERP-native invoice capture. The best Python library for invoice extraction depends entirely on what "best" means for your specific production requirements.
About the author
David Harding
Founder, Invoice Data Extraction
David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.
Profile
View author pageEditorial process
This page is reviewed as part of Invoice Data Extraction's editorial process.
If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.
Related Articles
Explore adjacent guides and reference articles on this topic.
Extract Invoice Data with JavaScript and Node.js
Extract structured data from invoices using JavaScript and Node.js. Covers PDF parsing, OCR, and managed APIs with production-ready SDK code examples.
Python PDF Table Extraction: pdfplumber vs Camelot vs Tabula
Compare pdfplumber, Camelot, and tabula-py for extracting tables from PDF invoices. Code examples, invoice-specific tests, and a decision framework.
Invoice Extraction API: Developer Quickstart Guide
Developer quickstart for invoice extraction API integration. Full REST workflow with curl: authenticate, upload, extract, poll, and download structured output.
Extract invoice data to Excel with natural language prompts
Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.