Converting invoices to JSON means feeding invoice PDFs or images into an extraction engine that uses AI and OCR to identify fields (invoice number, date, vendor name, line items, tax, totals) and output them as structured JSON. For developers, this typically happens through a REST API call or an SDK method that accepts a file and returns parsed data. Non-technical users can accomplish the same thing by uploading documents to a web platform and downloading the JSON output directly.
JSON is the natural target format for this workflow. According to Cloudflare's analysis of API traffic patterns, JSON accounts for approximately 97% of API request payloads across their global network, dwarfing XML and every other interchange format. If extracted invoice data needs to flow into a database, feed a downstream API, or populate a data pipeline, JSON is what those systems already expect. There is no serialization step, no format translation. The data arrives ready to use.
The practical question is how you get from a stack of invoice files to clean JSON output. This guide covers three paths:
- The JSON output structure itself. Before writing extraction code, you need to understand what the resulting data looks like: which fields are extracted, how line items nest, and what schema choices matter for your downstream consumers.
- SDK-based extraction with Python and Node.js. The fastest route for most developers. A single method call handles upload, AI-powered extraction, and JSON output. Both languages are covered with working code.
- REST API integration for custom pipelines. When you need full control over the HTTP layer, retry logic, or webhook-driven architectures, the REST API gives you direct access to the same extraction engine.
Platforms like Invoice Data Extraction provide all three pathways: a web interface for ad-hoc uploads (no code required), official Python and Node.js SDKs, and a REST API. The underlying AI handles OCR on scanned documents and images, identifies invoice fields and line-item tables, and returns structured JSON you can consume directly.
What Invoice JSON Output Looks Like
Before writing integration code, you need to know exactly what comes back from an extraction. JSON (ECMA-404) gives you typed fields, nested structures, and direct compatibility with document databases and REST APIs.
The structure of your extracted invoice JSON depends on one decision: do you need one object per invoice or one object per line item?
Invoice-Level JSON
When you set output_structure to per_invoice, each invoice produces a single JSON object containing header-level totals and metadata. Here is a realistic example of structured invoice data in JSON format:
[
{
"invoice_number": "INV-2024-03842",
"invoice_date": "2024-11-15",
"due_date": "2024-12-15",
"vendor_name": "Cascade Cloud Services Ltd.",
"vendor_address": "47 Richmond Street, Vancouver, BC V6B 1E3",
"currency": "CAD",
"subtotal": 4250.00,
"tax_amount": 552.50,
"total_amount": 4802.50,
"source_file": "cascade-nov-2024.pdf",
"page": 1
},
{
"invoice_number": "INV-88910",
"invoice_date": "2024-11-18",
"due_date": "2025-01-17",
"vendor_name": "Primewell Industrial Supply",
"vendor_address": "1200 N Harbor Blvd, Suite 300, Fullerton, CA 92832",
"currency": "USD",
"subtotal": 11780.00,
"tax_amount": 1060.20,
"total_amount": 12840.20,
"source_file": "primewell-q4-batch.pdf",
"page": 3
}
]
Each field name in your prompt becomes a column header in the output. The source_file and page fields trace every record back to its origin document, which matters when you are processing hundreds of invoices in a single batch.
Line-Item-Level JSON
Setting output_structure to per_line_item breaks each invoice into its individual line items. This is the structure you want when invoice line items need to land in a transactions table or feed into a cost-allocation system:
[
{
"invoice_number": "INV-2024-03842",
"invoice_date": "2024-11-15",
"vendor_name": "Cascade Cloud Services Ltd.",
"description": "Dedicated GPU instance (A100) - monthly",
"quantity": 2,
"unit_price": 1850.00,
"line_total": 3700.00,
"source_file": "cascade-nov-2024.pdf",
"page": 1
},
{
"invoice_number": "INV-2024-03842",
"invoice_date": "2024-11-15",
"vendor_name": "Cascade Cloud Services Ltd.",
"description": "Managed backup storage (500 GB)",
"quantity": 1,
"unit_price": 550.00,
"line_total": 550.00,
"source_file": "cascade-nov-2024.pdf",
"page": 1
},
{
"invoice_number": "INV-88910",
"invoice_date": "2024-11-18",
"vendor_name": "Primewell Industrial Supply",
"description": "Stainless steel hex bolts M12x40 (box of 200)",
"quantity": 15,
"unit_price": 42.00,
"line_total": 630.00,
"source_file": "primewell-q4-batch.pdf",
"page": 3
}
]
Notice that invoice_number, invoice_date, and vendor_name repeat on every line-item row. This is deliberate. It produces a flat structure where each JSON object is self-contained.
Flat vs. Nested Invoice JSON Schema
The examples above show a flat structure: every object carries all the context it needs. This maps directly to database rows and works well for bulk inserts into relational tables, data warehouses, or any system expecting tabular data.
A nested structure groups line items under their parent invoice:
{
"invoice_number": "INV-2024-03842",
"invoice_date": "2024-11-15",
"vendor_name": "Cascade Cloud Services Ltd.",
"total_amount": 4802.50,
"line_items": [
{
"description": "Dedicated GPU instance (A100) - monthly",
"quantity": 2,
"unit_price": 1850.00,
"line_total": 3700.00
},
{
"description": "Managed backup storage (500 GB)",
"quantity": 1,
"unit_price": 550.00,
"line_total": 550.00
}
]
}
When to use which:
- Flat is the right choice when your downstream consumer is a SQL database, a pandas DataFrame, or anything that expects uniform rows. It also maps cleanly to CSV if you ever need a fallback format.
- Nested preserves the natural document hierarchy. It is better for API request/response payloads, MongoDB or other document-oriented databases, and any front-end rendering where you display invoices with expandable line-item details.
You can get the flat structure directly from the extraction output and reshape it to nested in your application code, or vice versa. The extraction prompt and output_structure parameter give you control over which shape you start with.
Field Formatting and Standardization
Raw invoice PDFs contain dates in dozens of formats ("Nov 15, 2024", "15/11/2024", "2024.11.15") and currency values with inconsistent symbols and separators. Your extraction prompt controls how these get standardized in the JSON output.
Two formatting rules worth enforcing in every extraction:
- Dates as ISO 8601 strings (
YYYY-MM-DD). A prompt instruction like "Format all dates as YYYY-MM-DD" ensures consistent sorting, comparison, and parsing across languages. No ambiguity between month-first and day-first conventions. - Currency amounts as numbers, not strings. Extracting
4802.50rather than"$4,802.50"or"4.802,50 €"means your code can perform arithmetic immediately without stripping symbols or guessing decimal conventions. Add the prompt directive: "Ensure all currency fields have 2 decimal places."
These formatting choices are not cosmetic. They eliminate an entire category of parsing bugs downstream and make your invoice JSON schema predictable across vendors, currencies, and locales.
Extracting Invoice Data to JSON with Python
Install the official SDK (requires Python 3.9+):
pip install invoicedataextraction-sdk
Initialize the client using an API key stored in an environment variable:
import os
from invoicedataextraction import InvoiceDataExtraction
client = InvoiceDataExtraction(
api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY")
)
The SDK's extract() method handles the entire workflow in a single call — uploading files, submitting the extraction job, polling for completion, and downloading results.
result = client.extract(
folder_path="./invoices",
prompt="Extract invoice number, invoice date, vendor name, line items with description, quantity, unit price, and line total",
output_structure="per_line_item",
download={"formats": ["json"], "output_path": "./output"},
console_output=True,
)
Point folder_path at a directory of PDFs (or images, or scanned documents), and the SDK processes every file it finds. Setting output_structure to "per_line_item" produces one JSON record per line item rather than one per invoice — useful when you need to feed rows directly into a database or accounting system. If you need finer control over how tables are parsed from the PDF before converting to JSON, Python libraries like pdfplumber, Camelot, and tabula-py handle that lower layer directly. The download parameter tells the SDK to write JSON files to ./output once extraction finishes.
Structured prompts for precise field control
A plain string prompt works for straightforward extractions. When you need tighter control over field names and formatting, pass a prompt object instead:
result = client.extract(
folder_path="./invoices",
prompt={
"fields": [
{"name": "Invoice Number"},
{"name": "Invoice Date", "prompt": "Format as YYYY-MM-DD"},
{"name": "Vendor Name"},
{"name": "Total Amount", "prompt": "Numeric, no currency symbol, 2 decimal places"},
],
"general_prompt": "One record per invoice. Skip email cover sheets.",
},
download={"formats": ["json"], "output_path": "./output"},
console_output=True,
)
The fields array defines exactly which data points to extract and how to format each one. The general_prompt applies instructions across the entire batch — filtering out non-invoice pages, setting record granularity, or specifying how to handle edge cases.
Error handling
The SDK exposes two exception types worth catching in production code:
from invoicedataextraction.errors import SdkError, ApiResponseError
try:
result = client.extract(
folder_path="./invoices",
prompt="Extract invoice number, date, vendor, and total",
download={"formats": ["json"], "output_path": "./output"},
)
except SdkError as e:
# Client-side issues: invalid parameters, file read errors, network failures
print(f"SDK error: {e}")
except ApiResponseError as e:
# Server-side issues: authentication failure, quota exceeded, malformed request
print(f"API error: {e}")
SdkError covers client-side problems like bad parameters or network failures. ApiResponseError surfaces issues from the API itself — expired keys, exceeded quotas, or invalid job configurations. For full SDK documentation, see the Python SDK reference.
Extracting Invoice Data to JSON with Node.js
Install the SDK from npm:
npm install @invoicedataextraction/sdk
The package requires Node.js 18+ and is ESM only, so your package.json needs "type": "module". TypeScript declarations ship with the package — no separate @types install required.
Initialize the client with your API key:
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
The SDK exposes an async extract() method that handles file upload, processing, and download in a single call:
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number, invoice date, vendor name, line items with description, quantity, unit price, and line total",
output_structure: "per_line_item",
download: { formats: ["json"], output_path: "./output" },
console_output: true,
});
The parameters work identically to the Python SDK. For finer control over extracted fields, pass a structured prompt object:
const result = await client.extract({
folder_path: "./invoices",
prompt: {
fields: [
{ name: "Invoice Number" },
{ name: "Invoice Date", prompt: "Format as YYYY-MM-DD" },
{ name: "Vendor Name" },
{ name: "Total Amount", prompt: "Numeric, no currency symbol, 2 decimals" },
],
general_prompt: "One record per invoice. Skip cover pages.",
},
output_structure: "per_invoice",
download: { formats: ["json"], output_path: "./output" },
});
One naming convention worth noting: method names use camelCase (extract, uploadFiles, submitExtraction) while option keys use snake_case (folder_path, output_structure, api_key). The snake_case keys match the REST API response format directly, so you can pass API-level parameters without translation.
Handle errors at two levels. SDK and network errors throw exceptions, which you catch with a standard try/catch. Task-level failures — a corrupted PDF, an unreadable scan — are reported in the response object:
try {
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number, date, vendor, and total",
download: { formats: ["json"], output_path: "./output" },
});
if (!result.success) {
console.error("Extraction failed:", result);
}
} catch (error) {
console.error("SDK or network error:", error.message);
}
Every method on the client is async and returns a Promise, so the SDK fits into existing Express middleware, serverless functions, or queue workers without blocking. If you want to skip the SDK layer entirely and call vision LLMs like GPT-4o or Claude directly from Node.js, that approach gives you full control over model selection, prompt engineering, and structured output with Zod schemas — though you take on more of the orchestration yourself.
Using the REST API for Custom Invoice-to-JSON Pipelines
The Python and Node.js SDKs handle the multi-step extraction workflow for you, and they are the recommended starting point for most projects. But there are solid reasons to work with the REST API directly. You may be building in Go, Ruby, or Java without an official SDK. Your pipeline may already have an HTTP client layer you want to reuse. Or you may be orchestrating extraction through tools like Airflow or AWS Step Functions where each stage runs independently and SDK abstractions get in the way.
The invoice data extraction API exposes a straightforward five-step workflow. Each step maps to a single HTTP call, which makes it easy to distribute across orchestration tasks or wrap in whatever retry logic your infrastructure already provides.
The API Workflow
Authentication uses a Bearer token in the Authorization header. Generate your API key from the dashboard — the API shares the same credit-based pricing as the web interface with no separate subscription fees.
The extraction sequence works as follows:
- Create an upload session. Send a POST request to
/v1/uploads/sessionswith metadata about the files you plan to upload. The response returns presigned URLs for each file. - Upload your invoice files. Use the presigned URLs to upload PDFs, JPGs, or PNGs directly. File limits are 150 MB per PDF, 5 MB per image, and up to 2 GB or 6,000 files per batch.
- Submit the extraction task. POST to
/v1/extractionswith your prompt and theoutput_structureparameter. For JSON output, setoutput_structureto"per_invoice"or"per_line_item"depending on the granularity you need, or use"automatic"to let the engine decide. This call returns immediately with an extraction ID. - Poll for completion. GET
/v1/extractions/{id}until the status indicates the task is finished. All operations are asynchronous — nothing blocks while extraction runs. - Download the JSON output. Once complete, the response body includes
output.json_urlalongsidexlsx_urlandcsv_url. Fetch the JSON URL to retrieve your structured invoice data. These download URLs are presigned and expire after 5 minutes, so generate a fresh one if your pipeline has a delay between polling and download.
Rate Limits to Plan Around
Two limits matter most for pipeline design. Extraction submissions are capped at 30 requests per minute, which defines your maximum batch throughput. Polling is allowed at 120 requests per minute, but a minimum interval of 5 seconds between status checks is recommended to avoid unnecessary load. If you are processing large batches, structure your pipeline to submit in controlled bursts and poll with exponential backoff rather than tight loops. At high volume, these design choices also affect your bill — see our breakdown of techniques to reduce invoice extraction API costs at scale for concrete savings estimates.
The API quickstart guide for invoice extraction walks through the full setup from key generation to first successful extraction.
JSON vs CSV vs XLSX: Choosing the Right Invoice Data Format
The extraction process itself is format-agnostic. A single extraction task produces results you can download as JSON, CSV, or XLSX simultaneously. The real question is what consumes the data downstream.
| Factor | JSON | CSV | XLSX |
|---|---|---|---|
| Best consumer | Code (APIs, pipelines, databases) | Flat-file imports, spreadsheet tools | People (review, editing, sharing) |
| Nested data (line items) | Native support — line items are arrays within invoice objects | No nesting; requires flattening into one row per line item | Possible via multiple sheets, but clunky programmatically |
| Typical use cases | Webhook payloads, MongoDB/document store ingestion, microservice communication, data lake pipelines | Bulk SQL imports, legacy system integrations expecting delimited files, quick spreadsheet analysis | Finance team review, manual verification before posting, sharing with non-technical stakeholders |
| Schema flexibility | High — accommodates varying fields per invoice without null-padding | Low — every row must share the same column set | Moderate — supports typed columns (dates, currencies) and formatting |
| Programmatic parsing | First-class support in every language | Straightforward but watch for delimiter/encoding edge cases | Requires a library (openpyxl, exceljs, etc.) |
For most developer workflows, JSON is the default choice. It preserves the hierarchical structure of invoice data — vendor details, line items, tax breakdowns — without forcing you to flatten relationships into rows. Any system that consumes data programmatically (REST APIs, message queues, document databases) expects JSON natively.
CSV makes sense when your target is a flat schema. Bulk-loading invoice header data into a SQL table, feeding records into an ETL tool that expects delimited files, or handing data to an analyst who will open it in a spreadsheet — these are CSV's strengths. If your use case involves nested line items, though, you will either need to flatten them (one row per line item, duplicating header fields) or split them across multiple files. For a deeper look at that workflow, see our guide on extracting invoice data to CSV format.
XLSX is the right pick when the next step involves a human. Finance teams reviewing extracted data before it enters an ERP, auditors spot-checking vendor totals, or managers who want a formatted report — all of these favor a spreadsheet file they can open, filter, and annotate without writing code.
Many teams have mixed needs: developers pulling JSON into a pipeline while the finance team downloads XLSX from the same extraction run. Since all three formats come from a single task, there is no cost to supporting both.
Working with Extracted Invoice JSON in Production
Extracting invoice data to JSON is the first half of the problem. The second half — getting that JSON into your database, validating it, and passing it downstream — is where most teams hit friction. The patterns below cover the most common production integration scenarios.
Database Ingestion Patterns
How you store extracted invoice JSON depends on your database engine and how you need to query the data later.
Relational databases (PostgreSQL, MySQL). The flat fields on each invoice object — vendor name, invoice number, date, total — map directly to columns in an INSERT statement or a COPY/LOAD DATA operation. Line items are where you make a design choice: normalize them into a separate line_items table with a foreign key back to the invoice, or store the entire invoice object in a native JSON column. PostgreSQL's jsonb and MySQL's JSON column type both support this approach. With jsonb, you can query individual line items using JSON path expressions without ever denormalizing:
SELECT invoice_number,
item->>'description' AS description,
(item->>'amount')::numeric AS amount
FROM invoices,
jsonb_array_elements(raw_json->'line_items') AS item
WHERE (item->>'amount')::numeric > 1000;
This hybrid approach — structured columns for fields you filter on frequently, a JSON column for the full extraction result — gives you both query performance and schema flexibility.
Document databases (MongoDB, DynamoDB). The nested per-invoice JSON, line_items arrays included, maps directly to documents with no schema transformation. Each extracted invoice becomes a single document. For MongoDB, you insert it as-is. For DynamoDB, you define your partition key (invoice number or vendor ID) and store the rest as attributes. If your extraction pipeline processes invoices in batch, the output array feeds directly into insertMany or BatchWriteItem.
Downstream API Integration
When the destination is an ERP system, accounting platform, or internal microservice rather than a database you control, the extracted JSON serves as the request body for an HTTP POST — or as the source you map from. The same principle applies if you are wrapping invoice extraction in an MCP server so that AI assistants can call it as a tool; the structured JSON output maps directly to the tool's response schema.
If you control the extraction prompt (as with the structured prompt approach covered earlier), you can specify field names and formats that already match the target API's expected schema. An accounting API that expects vendor_name, due_date, and line_items gets exactly those fields from extraction, with no post-processing layer in between. This eliminates an entire mapping step in your data pipeline.
When the target schema differs from your extraction output, a thin transformation function handles the mapping. The key advantage of JSON here over CSV or flat formats is that nested structures like line items, tax breakdowns, and payment terms survive intact through the transformation rather than requiring reassembly from flattened rows.
Validating Extracted JSON Before Ingestion
OCR and extraction models handle the vast majority of invoices correctly, but production pipelines need a safety net. An invoice with a missing vendor name, a null total, or a date in an unexpected format will cause downstream failures — a rejected database insert, a failed API call, or worse, silently corrupt data.
JSON Schema validation catches these anomalies at the boundary, before processing continues. Define a schema that encodes your requirements:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["invoice_number", "vendor_name", "date", "total"],
"properties": {
"invoice_number": { "type": "string", "minLength": 1 },
"vendor_name": { "type": "string", "minLength": 1 },
"date": { "type": "string", "format": "date" },
"total": { "type": "number", "minimum": 0 },
"line_items": {
"type": "array",
"items": {
"type": "object",
"required": ["description", "amount"],
"properties": {
"description": { "type": "string" },
"amount": { "type": "number" }
}
}
}
}
}
Validate each extracted invoice against this schema before it enters your database or gets forwarded to a downstream service. In Python, jsonschema.validate() handles this in a single call. In Node.js, libraries like Ajv do the same, and TypeScript teams increasingly reach for Zod-based invoice extraction pipelines that combine schema definition with runtime validation in a single declaration. Records that fail validation get routed to a review queue or error log rather than silently passing through.
Batch Processing at Scale
When you process hundreds or thousands of invoices through extraction, every record produces JSON with the same field structure. This consistency is what makes batch operations practical: you can stream results into a bulk INSERT, feed them into a message queue one at a time, or write them to newline-delimited JSON (NDJSON) files for tools like PostgreSQL's COPY or BigQuery's load jobs.
The pattern stays the same regardless of volume: extract, validate, ingest.
About the author
David Harding
Founder, Invoice Data Extraction
David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.
Profile
View author pageEditorial process
This page is reviewed as part of Invoice Data Extraction's editorial process.
If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.
Related Articles
Explore adjacent guides and reference articles on this topic.
How to Build an MCP Server for Invoice Extraction
Build an MCP server that exposes invoice extraction as a tool for AI assistants. Covers tool definition, API integration, and structured JSON responses.
Python PDF Table Extraction: pdfplumber vs Camelot vs Tabula
Compare pdfplumber, Camelot, and tabula-py for extracting tables from PDF invoices. Code examples, invoice-specific tests, and a decision framework.
How to Reduce Invoice Extraction API Costs at Scale
Seven engineering techniques that reduce invoice extraction API costs by 30-60% at high volume, with estimated savings and implementation priorities for each.
Extract invoice data to Excel with natural language prompts
Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.