Python PDF Table Extraction: pdfplumber vs Camelot vs Tabula

Compare pdfplumber, Camelot, and tabula-py for extracting tables from PDF invoices. Code examples, invoice-specific tests, and a decision framework.

Published
Updated
Reading Time
20 min
Topics:
API & Developer IntegrationPDF table extractionPythonlibrary comparison

Extracting structured line item data from PDF invoices is a different problem than extracting text. You need a library that can detect tabular structure within the PDF's native text layer, map columns to headers, and return clean rows you can feed into a database or ERP system. Three Python libraries dominate this space: pdfplumber, Camelot, and tabula-py. Each takes a fundamentally different approach to finding and parsing tables, and the right choice depends entirely on the invoice formats hitting your pipeline.

Here is how they compare at a high level. pdfplumber gives you pixel-level control over table detection with built-in visual debugging, making it the strongest option for complex invoice layouts where column boundaries are ambiguous or inconsistent. Camelot uses lattice and stream detection modes that excel at well-structured tables with clear cell borders but struggles with the borderless aesthetic layouts common in modern invoices. tabula-py wraps Java's Tabula engine for fast extraction with minimal code, though it requires a JRE dependency and offers less granular control when edge cases surface. The consistent finding across all three: for production pipelines processing invoices from varied vendors, every library demands significant per-format tuning.

Most Python PDF table extraction tutorials test against clean, predictable documents like employee directories or academic paper tables. Invoice PDFs are structurally harder. They introduce patterns that generic benchmarks miss entirely:

  • Variable column counts across vendors. One supplier includes a discount column; another splits tax into state and federal. Your extraction logic has to handle both without manual reconfiguration.
  • Subtotal, tax, and total rows that break table structure. These summary rows span multiple columns or use different alignment, causing libraries to merge them into line item data or drop them entirely.
  • Borderless aesthetic layouts. Design-forward invoices rely on spacing and alignment rather than ruled lines, which defeats lattice-based detection methods.
  • Multi-page continuation tables. A single purchase order can span three pages, and the library needs to stitch those tables together without duplicating headers or losing rows at page breaks.

This comparison tests each library against these real invoice patterns rather than synthetic benchmarks.

One critical scope boundary: pdfplumber, Camelot, and tabula-py all operate on native text-based PDFs where the text layer is already embedded. Scanned invoice images require OCR preprocessing, a fundamentally different toolchain involving Tesseract or cloud vision APIs before any table detection can begin. This article focuses exclusively on native PDF table extraction. If you need a broader view of the Python extraction landscape that covers OCR-based workflows alongside programmatic parsing, see our broader guide to extracting invoice data with Python.

Fortune Business Insights reports that the global intelligent document processing market was valued at USD 10.57 billion in 2025, projected to reach USD 91.02 billion by 2034, with finance and accounting representing 45.57% of the total. Your choice of Python library for invoice table extraction shapes the rest of the pipeline.

pdfplumber: Pixel-Level Control for Complex Invoice Layouts

pdfplumber takes a fundamentally different approach from libraries that treat table extraction as pattern matching. It parses the PDF's underlying character-level layout data—every character's position, every line segment's coordinates—and uses coordinate geometry to identify table boundaries. Instead of guessing where tables might be, you define precisely which regions of a page contain tabular data and how cell boundaries should be detected.

This gives you pixel-level control over extraction. You can specify exact vertical and horizontal line positions, adjust snap tolerances for slightly misaligned rules, and target specific page regions while ignoring headers, footers, and totals blocks. For invoice extraction, that precision matters: invoice tables sit alongside addresses, logos, payment terms, and tax summaries, and you need the parser to isolate the line item table cleanly.

Extracting Invoice Line Items with pdfplumber

Here's a practical example that extracts a line item table from an invoice PDF. The table settings are tuned for a typical invoice structure with description, quantity, unit price, and total columns:

import pdfplumber

def extract_invoice_line_items(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]

        table_settings = {
            "vertical_strategy": "lines",
            "horizontal_strategy": "lines",
            "snap_tolerance": 4,
            "intersection_x_tolerance": 8,
            "intersection_y_tolerance": 8,
            "min_words_vertical": 2,
        }

        table = page.extract_table(table_settings)

        if not table:
            return []

        headers = [
            cell.strip().lower() if cell else ""
            for cell in table[0]
        ]
        line_items = []

        for row in table[1:]:
            if not any(row):
                continue
            item = dict(zip(headers, [
                cell.strip() if cell else "" for cell in row
            ]))
            line_items.append(item)

        return line_items


items = extract_invoice_line_items("vendor_invoice.pdf")
for item in items:
    print(item)
# {'description': 'Consulting Services - March',
#  'quantity': '40', 'unit price': '150.00', 'total': '6,000.00'}
# {'description': 'Expense Reimbursement - Travel',
#  'quantity': '1', 'unit price': '842.50', 'total': '842.50'}

The snap_tolerance and intersection_tolerance parameters are doing the real work here. Invoice PDFs generated from different tools produce table lines that don't always align to the exact same coordinates. A snap tolerance of 4 pixels lets pdfplumber treat nearly-aligned lines as the same boundary, which prevents phantom columns or split cells.

For invoices where the table region shares the page with dense surrounding content, you can crop to a bounding box first:

# Target only the line items region of the page
cropped = page.within_bbox((30, 280, 580, 520))
table = cropped.extract_table(table_settings)

Visual Debugging for Invoice Formats

One of pdfplumber's strongest features for invoice work is its visual debugging. You can render the page with detected table boundaries, cell divisions, and character positions overlaid directly onto the page image:

with pdfplumber.open("vendor_invoice.pdf") as pdf:
    page = pdf.pages[0]
    im = page.to_image(resolution=150)
    im.debug_tablefinder(table_settings)
    im.save("debug_output.png")

This produces an annotated image showing exactly which lines pdfplumber detected, where it placed cell boundaries, and which regions it classified as table content. When a vendor's invoice extracts incorrectly—a merged header cell splitting into two columns, or a description field wrapping across rows—the debug image tells you why in seconds. You can then adjust explicit_vertical_lines or explicit_horizontal_lines in your table settings to force correct boundaries for that format.

Where pdfplumber Is the Strongest Choice

pdfplumber excels in scenarios where invoice layouts defy standard grid assumptions. Invoices with merged header cells (a single "Amount" header spanning "Unit Price" and "Tax" subcolumns), variable-height description rows that wrap across multiple lines, or tables without visible borders that rely on whitespace alignment—these are cases where pdfplumber's fine-tuning parameters let you build an extraction config that handles the format reliably.

If you're processing invoices from a known, stable set of vendors, pdfplumber is a strong fit. You invest time upfront to tune table_settings per vendor format, and the result is precise, repeatable extraction. Teams that maintain per-vendor configurations for a stable supplier base tend to find pdfplumber delivers the highest accuracy of the three libraries.

The Vendor Diversity Problem

The same configurability that makes pdfplumber accurate on known formats becomes a liability as vendor count grows. The table settings that extract line items perfectly from Vendor A's invoice will misparse Vendor B's layout. Each new vendor format requires its own tuned configuration: adjusted tolerances, different line detection strategies, sometimes explicit coordinate overrides.

With a small, stable vendor base, this is manageable. But invoice pipelines that regularly onboard new suppliers—or receive invoices from hundreds of vendors—hit a maintenance wall. Every format change, every new vendor, means another round of visual debugging and parameter tuning. The extraction logic itself works; the operational cost of keeping configurations current across a growing vendor portfolio is what eventually pushes teams toward alternatives.


Camelot: Lattice and Stream Modes for Invoice Tables

Camelot approaches table detection differently from pdfplumber by offering two distinct extraction modes, each built for a different kind of PDF structure. Choosing the right mode is the single biggest factor in whether Camelot produces usable output from your invoices.

Lattice mode detects tables by tracing the actual drawn lines in a PDF — cell borders, row separators, and column dividers that exist as vector objects in the document. When an invoice has a fully bordered table with visible gridlines, lattice mode identifies cell boundaries with high precision because it's working with explicit structural data rather than guessing.

Stream mode takes the opposite approach. It infers table structure from whitespace gaps and text alignment patterns when no visible borders exist. Stream mode scans for consistent spacing between text elements and groups them into columns and rows based on their spatial positions on the page.

This distinction matters for invoice processing because it maps directly to the two most common invoice designs you'll encounter in production.

Extracting an Invoice Table with Camelot

A basic Camelot extraction in lattice mode looks like this:

import camelot

# Extract tables using lattice mode (default)
tables = camelot.read_pdf("invoice.pdf", flavor="lattice", pages="1")

if tables:
    # Access the first detected table as a DataFrame
    invoice_table = tables[0].df
    print(invoice_table)

    # Check the parsing accuracy report
    print(tables[0].parsing_report)
    # {'accuracy': 97.85, 'whitespace': 12.3, 'order': 1, 'page': 1}

The parsing_report property returns an accuracy score that reflects how cleanly Camelot could map the detected lines to a table grid. For invoices with well-defined borders, you'll typically see accuracy scores above 90%, and the resulting DataFrame will have correctly aligned columns for item descriptions, quantities, unit prices, and totals.

The Stream Mode Problem with Borderless Invoices

Here's where Camelot's dual-mode design creates a practical challenge for invoice pipelines. Many professionally designed invoices — particularly from SaaS platforms, design agencies, and modern accounting tools — use minimal or no visible table borders. They rely on spacing, shading, and typography to visually separate line items rather than drawn gridlines.

These borderless invoices force you into stream mode:

tables = camelot.read_pdf("borderless_invoice.pdf", flavor="stream", pages="1")

Stream mode's whitespace-based detection is significantly less reliable than lattice for real invoice data. The core issue: invoice layouts frequently use variable spacing between columns. A short item description leaves a wide gap before the quantity column, while a long description compresses that gap. Stream mode interprets these inconsistencies as column boundaries shifting, which causes it to misalign columns or merge adjacent fields into a single cell.

Accuracy scores for stream-mode extractions on borderless invoices can drop into the 50–70% range. At that level, you need programmatic validation on every extracted table before the data enters your pipeline:

for table in tables:
    report = table.parsing_report
    if report["accuracy"] < 80:
        # Flag for manual review or fallback extraction
        log_low_confidence_extraction(table, report)

Deployment and Multi-Page Constraints

Two practical constraints affect Camelot in production environments. First, Camelot depends on Ghostscript as a system-level dependency — it's not bundled in the pip install. In containerized deployments, you need to add Ghostscript to your Docker image or system packages, which increases image size and adds a dependency that lives outside your Python dependency management.

Second, Camelot processes one page at a time. Multi-page invoice tables — common with purchase orders containing dozens of line items — require you to detect where a table continues from one page to the next and stitch the resulting DataFrames together manually. There's no built-in continuation detection, so your code needs to handle matching column structures across pages and removing repeated header rows that many invoices print on each page.


tabula-py: Java-Backed Extraction and the JRE Trade-Off

tabula-py is not a native Python extraction engine. It is a Python wrapper around Tabula, a Java library, and that distinction shapes every deployment decision you will make with it. When you call tabula-py's extraction functions, Python spawns a JVM subprocess that does the actual PDF parsing. The structured data flows back into your Python process as a pandas DataFrame. This architecture means every machine in your pipeline, from your laptop to your production server, needs a Java Runtime Environment installed alongside Python.

Here is a basic extraction using tabula-py against an invoice PDF:

import tabula

# Lattice mode for invoices with visible cell borders
dfs = tabula.read_pdf(
    "invoice_009.pdf",
    pages="1",
    lattice=True,
    pandas_options={"header": 0}
)

invoice_table = dfs[0]
print(invoice_table)
Item CodeDescriptionQtyUnit PriceAmount
WDG-441Steel bracket2003.25650.00
WDG-442Mounting plate1007.80780.00
WDG-443Hex bolt M8x405000.45225.00

Switching to stream=True instead of lattice=True targets invoices that lack visible borders and rely on whitespace alignment. The API surface is minimal, which is either an advantage or a limitation depending on how much control you need.

The JRE dependency in practice

On a developer's local machine, installing Java is a non-issue. The friction shows up in production. A slim Python Docker image runs 50-80 MB; adding a JRE pushes it to 300-500 MB or more, multiplying across every container instance and slowing cold starts. CI/CD pipelines need Java as a build step. Serverless platforms like Lambda and Cloud Functions impose hard package size limits that a JRE can exceed. Your Dockerfile now manages two language runtimes, and you are tracking security patches for both.

Where tabula-py earns its place

For well-bordered invoice tables with consistent grid layouts, tabula-py is fast and requires almost no configuration. If your invoices come from a single vendor or ERP system with a standardized template, calling read_pdf() with lattice=True may be all you need. The library gets you from PDF to DataFrame in the fewest lines of code of any option in this comparison.

tabula-py also benefits from the broader Tabula ecosystem. Teams already familiar with the Tabula desktop application can transfer that knowledge directly, and there is substantial community documentation around Tabula's extraction behavior.

Where it falls short on invoices

tabula-py offers less fine-grained control than pdfplumber over table detection. When extraction produces garbage data or merged columns, the error output gives you little to diagnose the cause. Debugging becomes trial and error: adjusting area coordinates, toggling between lattice and stream, re-running until the output looks right. Where pdfplumber offers visual rendering of detected lines and characters to pinpoint exactly why extraction failed, tabula-py has no equivalent.


Invoice Patterns That Break Open-Source Table Extraction

Library demos look great when the input is a single, clean, well-structured invoice. Production pipelines process thousands of invoices from hundreds of vendors, and that is where pdfplumber, Camelot, and tabula-py all hit the same walls. Understanding these failure patterns before you commit to a library-based architecture saves months of debugging.

Subtotal and Tax Rows That Corrupt Table Output

Most invoices embed summary rows — subtotal, tax, shipping, total — directly below the line item table or even within it. These rows follow a completely different column structure than the invoice line items above them. A line item row might have columns for quantity, description, unit price, and amount. The subtotal row might span the first three columns with a label and place a value only in the last column.

All three libraries detect these summary rows as part of the same table. The result is misaligned data: values shift into wrong columns, empty cells appear where the parser expected content, and your downstream code cannot distinguish a $4,500 line item total from a $4,500 invoice subtotal. You need post-processing logic that identifies summary rows by pattern (keyword matching on "subtotal," "tax," "total") and separates them from actual line items. This logic is straightforward for one invoice format and becomes a maintenance burden across dozens.

Multi-Page Line Item Tables

Invoices with large orders routinely span two, five, or twenty pages. pdfplumber and Camelot process each page independently with no awareness that page two's table is a continuation of page one's. The developer is responsible for detecting table continuation across page breaks and stitching the results into a single dataset.

This creates two sub-problems. Some invoice templates repeat the header row on every page, so you need deduplication logic to strip repeated headers from concatenated results. Other templates print headers only on the first page, meaning continuation pages produce raw data with no column labels — you have to map columns by position using the first page's structure as a reference.

tabula-py's pages="all" parameter is a partial improvement. It processes every page in one call, but still returns separate DataFrames per page that require merging. If the table structure shifts slightly between pages (a common occurrence with dynamic PDF generators), the merge breaks.

Borderless Tables: The Hardest Pattern for Rule-Based Extraction

When an invoice has visible grid lines, all three libraries can identify cell boundaries with reasonable accuracy. Remove those lines, and extraction reliability drops sharply. This is the single hardest pattern for rule-based extraction, and it is common — many invoices generated from accounting software, ERPs, or email templates use whitespace alignment instead of borders.

pdfplumber requires you to manually specify table boundary coordinates or write heuristic logic to infer them from text positioning. Camelot's stream mode is designed for this scenario and is the most direct approach, but it frequently misaligns columns when spacing is inconsistent or when a description field wraps to multiple lines. tabula-py's stream mode has similar reliability problems. In a comparison of pdfplumber vs Camelot vs tabula-py on borderless invoices, none consistently produce usable output without significant per-template tuning.

Mixed PDF Batches: Text-Based and Scanned Documents

Production invoice pipelines rarely have the luxury of processing only one document type. Vendors send a mix of text-based PDFs (where text is embedded and extractable) and scanned PDFs (where the page is a flat image). None of the three libraries can extract data from scanned PDFs. They operate on embedded text layers — if the PDF contains only an image, they return nothing.

Your pipeline needs document classification logic that runs before extraction: detect whether each incoming invoice is a text-based PDF or a scanned PDF, then route scanned documents through an OCR preprocessing step before table extraction can begin. This adds an entire processing stage with its own accuracy challenges. For guidance on handling scanned documents effectively, developers can reference strategies for improving invoice extraction accuracy.

The Per-Vendor Tuning Problem at Scale

This is the pattern that compounds all the others. The extraction configuration you build and test for Vendor A's invoice — the table area coordinates, the column boundaries, the summary row detection rules, the page-stitching logic — will almost certainly fail on Vendor B's layout. Different fonts, different column ordering, different spacing, different summary row placement.

As your vendor count grows from 5 to 50 to 500, maintaining separate extraction configurations per vendor format becomes the dominant engineering cost. It is no longer a data extraction problem; it is a configuration management problem. Each vendor's output also requires post-processing to normalize into a consistent schema — mapping "Qty" to "quantity," handling tax as a percentage versus a flat amount, parsing dates in varying formats. For developers building pipelines that need consistent output, converting extracted invoice data to structured JSON is a common next step that adds another layer of per-vendor code.

The practical ceiling of open-source table extraction is not any single library's parsing accuracy. It is the engineering effort required to handle the full diversity of real invoices at scale.


When a Managed Extraction API Replaces the Library Approach

The previous sections expose a pattern: each library requires per-vendor tuning, and that tuning breaks when invoice formats change. At some point, the engineering cost of maintaining extraction rules exceeds the cost of letting someone else solve the problem.

Four thresholds signal it's time to switch:

  1. Vendor format diversity. Your pipeline processes invoices from more than a handful of vendors, each with different table structures, column orders, and labeling conventions. Every new vendor means a new configuration pass through pdfplumber, Camelot, or tabula-py.
  2. Mixed document types. The incoming document mix includes both native text-based PDFs and scanned PDFs or images. You're maintaining separate OCR pipelines alongside your table extraction logic, with classification code to route documents to the right path.
  3. Multi-page complexity. Table stitching across page breaks and subtotal row separation require significant custom code for each vendor format. The heuristics you built for one vendor's three-page invoice fail on another vendor's five-page layout.
  4. Format drift. Vendors update their invoice templates without warning. Your extraction rules break silently, producing incomplete data that surfaces only when someone notices missing line items downstream.

Practical implementation with the Python SDK

Instead of the rule-based parsing that open-source libraries rely on, AI-based extraction analyzes the document's visual and textual structure together, identifying fields and table rows without format-specific rules. The Invoice Data Extraction API provides a Python SDK built on this approach. Install it from PyPI:

pip install invoicedataextraction-sdk

Authenticate with an API key from the dashboard, then call the client.extract() method with either a natural language prompt or structured field definitions:

from invoicedataextraction import InvoiceDataExtraction
import os

client = InvoiceDataExtraction(
    api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY")
)

result = client.extract(
    folder_path="./invoices",
    prompt={
        "fields": [
            {"name": "Invoice Number"},
            {"name": "Invoice Date"},
            {"name": "Vendor Name"},
            {"name": "Line Item Description"},
            {"name": "Quantity"},
            {"name": "Unit Price"},
            {"name": "Line Total"}
        ],
        "general_prompt": "Extract all line items. Dates in YYYY-MM-DD format."
    },
    output_structure="per_invoice",
    download={"formats": ["xlsx", "json"], "output_path": "./output"},
    console_output=True
)

The SDK handles upload, extraction, and result download in that single client.extract() call. Output is available as XLSX, CSV, or JSON. Both native text PDFs and scanned images are processed natively, so you can drop the document classification step and the separate OCR pipeline entirely. Batch processing handles up to 6,000 mixed-format files in one job.

The prompt-driven workflow is what eliminates per-vendor configuration. Instead of writing geometric extraction rules for each invoice layout, you describe what data you need. The same prompt works across vendors because the AI model interprets the document structure rather than relying on fixed coordinate mappings.

The cost trade-off, honestly

The API uses credit-based pricing: one credit per successfully processed page, with credits only consumed on successful extractions. A permanent free tier covers 50 pages per month with full functionality, no credit card required.

For production volumes, this is a recurring per-page cost. Open-source libraries are free to run but carry ongoing maintenance costs: developer time spent building per-vendor configurations, debugging edge cases, handling format changes, and maintaining OCR infrastructure for scanned documents. The trade-off favors a managed API when vendor format diversity makes the cumulative maintenance hours more expensive than the per-page processing cost. For a team processing invoices from three consistent vendors, the math may favor open-source. For a team handling invoices from dozens of vendors with mixed document types, the maintenance cost compounds quickly.

Decision summary

pdfplumberCamelottabula-pyManaged API
Borderless invoicesManual coordinate tuningStream mode (unreliable)Stream mode (unreliable)Handled natively
Multi-page tablesManual page stitchingManual page stitchingpages="all", needs mergingAutomatic
Debugging toolsVisual renderingAccuracy score onlyNoneN/A
Dependenciespip onlyGhostscriptJREpip only (SDK)
Multi-vendor scaleConfig per vendorConfig per vendorConfig per vendorSingle prompt
Best fitComplex layouts, stable vendorsBordered tables, standard formatsQuick extraction, JRE OKDiverse vendors at scale

When the best Python library to extract PDF tables still requires a new configuration pass for every vendor, the library itself is no longer the bottleneck worth optimizing.

About the author

DH

David Harding

Founder, Invoice Data Extraction

David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.

Editorial process

This page is reviewed as part of Invoice Data Extraction's editorial process.

If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.

Continue Reading

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours