How to Convert Invoices to JSON: Developer Guide

Converting invoices to JSON means feeding invoice PDFs or images into an extraction engine that uses AI and OCR to identify fields (invoice number, date, vendor name, line items, tax, totals) and output them as structured JSON. For developers, this typically happens through a REST API call or an SDK method that accepts a file and returns parsed data. Non-technical users can accomplish the same thing by uploading documents to a web platform and downloading the JSON output directly.

JSON is the natural target format for this workflow. In Cloudflare's 2021 analysis of API traffic patterns, JSON represented about 97% of requests in its JSON/XML payload-encoding comparison. If extracted invoice data needs to feed a downstream API, document database, or data pipeline, JSON usually avoids a separate CSV/XML transformation step.

The practical question is how you get from a stack of invoice files to clean JSON output. This guide covers three paths:

The JSON output structure itself. Before writing extraction code, you need to understand what the resulting data looks like: which fields are extracted, how line items nest, and what schema choices matter for your downstream consumers.
SDK-based extraction with Python and Node.js. The fastest route for most developers. A single method call handles upload, AI-powered extraction, and JSON output. Both languages are covered with working code.
REST API integration for custom pipelines. When you need full control over the HTTP layer, retry logic, or webhook-driven architectures, the REST API gives you direct access to the same extraction engine.

Invoice Data Extraction supports all three paths: a web interface for ad-hoc uploads, official Python and Node.js SDKs, and a REST API that returns structured JSON from PDFs, scans, and images.

What Invoice JSON Output Looks Like

Before writing integration code, you need to know exactly what comes back from an extraction. JSON (ECMA-404) gives you typed fields, nested structures, and direct compatibility with document databases and REST APIs.

The structure of your extracted invoice JSON depends on one decision: do you need one object per invoice or one object per line item?

Invoice-Level JSON

When you set output_structure to per_invoice, each invoice produces a single JSON object containing header-level totals and metadata. Here is a realistic example of structured invoice data in JSON format:

[
  {
    "invoice_number": "INV-2024-03842",
    "invoice_date": "2024-11-15",
    "due_date": "2024-12-15",
    "vendor_name": "Cascade Cloud Services Ltd.",
    "vendor_address": "47 Richmond Street, Vancouver, BC V6B 1E3",
    "currency": "CAD",
    "subtotal": 4250.00,
    "tax_amount": 552.50,
    "total_amount": 4802.50,
    "source_file": "cascade-nov-2024.pdf",
    "page": 1
  },
  {
    "invoice_number": "INV-88910",
    "invoice_date": "2024-11-18",
    "due_date": "2025-01-17",
    "vendor_name": "Primewell Industrial Supply",
    "vendor_address": "1200 N Harbor Blvd, Suite 300, Fullerton, CA 92832",
    "currency": "USD",
    "subtotal": 11780.00,
    "tax_amount": 1060.20,
    "total_amount": 12840.20,
    "source_file": "primewell-q4-batch.pdf",
    "page": 3
  }
]

Each field name in your prompt becomes a column header in the output. The source_file and page fields trace every record back to its origin document, which matters when you are processing hundreds of invoices in a single batch.

Line-Item-Level JSON

Setting output_structure to per_line_item breaks each invoice into its individual line items. This is the structure you want when invoice line items need to land in a transactions table or feed into a cost-allocation system; for API response design, see the companion guide to invoice line item extraction API fields:

[
  {
    "invoice_number": "INV-2024-03842",
    "invoice_date": "2024-11-15",
    "vendor_name": "Cascade Cloud Services Ltd.",
    "description": "Dedicated GPU instance (A100) - monthly",
    "quantity": 2,
    "unit_price": 1850.00,
    "line_total": 3700.00,
    "source_file": "cascade-nov-2024.pdf",
    "page": 1
  },
  {
    "invoice_number": "INV-2024-03842",
    "invoice_date": "2024-11-15",
    "vendor_name": "Cascade Cloud Services Ltd.",
    "description": "Managed backup storage (500 GB)",
    "quantity": 1,
    "unit_price": 550.00,
    "line_total": 550.00,
    "source_file": "cascade-nov-2024.pdf",
    "page": 1
  },
  {
    "invoice_number": "INV-88910",
    "invoice_date": "2024-11-18",
    "vendor_name": "Primewell Industrial Supply",
    "description": "Stainless steel hex bolts M12x40 (box of 200)",
    "quantity": 15,
    "unit_price": 42.00,
    "line_total": 630.00,
    "source_file": "primewell-q4-batch.pdf",
    "page": 3
  }
]

Notice that invoice_number, invoice_date, and vendor_name repeat on every line-item row. This is deliberate. It produces a flat structure where each JSON object is self-contained.

Flat vs. Nested Invoice JSON Schema

The examples above show a flat structure: every object carries all the context it needs. This maps directly to database rows and works well for bulk inserts into relational tables, data warehouses, or any system expecting tabular data.

A nested structure groups line items under their parent invoice:

{
  "invoice_number": "INV-2024-03842",
  "invoice_date": "2024-11-15",
  "vendor_name": "Cascade Cloud Services Ltd.",
  "total_amount": 4802.50,
  "line_items": [
    {
      "description": "Dedicated GPU instance (A100) - monthly",
      "quantity": 2,
      "unit_price": 1850.00,
      "line_total": 3700.00
    },
    {
      "description": "Managed backup storage (500 GB)",
      "quantity": 1,
      "unit_price": 550.00,
      "line_total": 550.00
    }
  ]
}

When to use which:

Flat is the right choice when your downstream consumer is a SQL database, a pandas DataFrame, or anything that expects uniform rows. It also maps cleanly to CSV if you ever need a fallback format.
Nested preserves the natural document hierarchy. It is better for API request/response payloads, MongoDB or other document-oriented databases, and any front-end rendering where you display invoices with expandable line-item details.

You can get the flat structure directly from the extraction output and reshape it to nested in your application code, or vice versa. The extraction prompt and output_structure parameter give you control over which shape you start with. If your pipeline needs to interoperate with e-invoicing networks or regulatory systems, it is worth understanding how these custom shapes relate to standardized invoice data schemas like UBL, Peppol BIS, and country-specific formats before committing to a structure.

Field Formatting and Standardization

Raw invoice PDFs contain dates in dozens of formats ("Nov 15, 2024", "15/11/2024", "2024.11.15") and currency values with inconsistent symbols and separators. Your extraction prompt controls how these get standardized in the JSON output.

Two formatting rules worth enforcing in every extraction:

Dates as ISO 8601 strings (YYYY-MM-DD). A prompt instruction like "Format all dates as YYYY-MM-DD" ensures consistent sorting, comparison, and parsing across languages. No ambiguity between month-first and day-first conventions.
Currency amounts as numbers, not strings. Extracting 4802.50 rather than "$4,802.50" or "4.802,50 €" means your code can perform arithmetic immediately without stripping symbols or guessing decimal conventions. Add the prompt directive: "Ensure all currency fields have 2 decimal places."

These formatting choices are not cosmetic. They eliminate an entire category of parsing bugs downstream and make your invoice JSON schema predictable across vendors, currencies, and locales.

Extracting Invoice Data to JSON with Python

Install the official SDK (requires Python 3.9+):

pip install invoicedataextraction-sdk

Initialize the client using an API key stored in an environment variable:

import os
from invoicedataextraction import InvoiceDataExtraction

client = InvoiceDataExtraction(
    api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY")
)

The SDK's extract() method handles the entire workflow in a single call — uploading files, submitting the extraction job, polling for completion, and downloading results.

result = client.extract(
    folder_path="./invoices",
    prompt="Extract invoice number, invoice date, vendor name, line items with description, quantity, unit price, and line total",
    output_structure="per_line_item",
    download={"formats": ["json"], "output_path": "./output"},
    console_output=True,
)

Point folder_path at a directory of PDFs, images, or scanned documents, and the SDK processes every file it finds. Setting output_structure to "per_line_item" produces one JSON record per line item rather than one per invoice — useful when rows need to land directly in a database or accounting system. The download parameter tells the SDK to write JSON files to ./output once extraction finishes.

Structured prompts for precise field control

A plain string prompt works for straightforward extractions. When you need tighter control over field names and formatting, start by defining what the extraction prompt should specify, then pass a prompt object instead:

result = client.extract(
    folder_path="./invoices",
    prompt={
        "fields": [
            {"name": "Invoice Number"},
            {"name": "Invoice Date", "prompt": "Format as YYYY-MM-DD"},
            {"name": "Vendor Name"},
            {"name": "Total Amount", "prompt": "Numeric, no currency symbol, 2 decimal places"},
        ],
        "general_prompt": "One record per invoice. Skip email cover sheets.",
    },
    download={"formats": ["json"], "output_path": "./output"},
    console_output=True,
)

The fields array defines exactly which data points to extract and how to format each one. The general_prompt applies instructions across the entire batch — filtering out non-invoice pages, setting record granularity, or specifying how to handle edge cases.

Error handling

The SDK exposes two exception types worth catching in production code:

from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    result = client.extract(
        folder_path="./invoices",
        prompt="Extract invoice number, date, vendor, and total",
        download={"formats": ["json"], "output_path": "./output"},
    )
except SdkError as e:
    # Client-side issues: invalid parameters, file read errors, network failures
    print(f"SDK error: {e}")
except ApiResponseError as e:
    # Server-side issues: authentication failure, quota exceeded, malformed request
    print(f"API error: {e}")

SdkError covers client-side problems like bad parameters or network failures. ApiResponseError surfaces issues from the API itself — expired keys, exceeded quotas, or invalid job configurations. For full SDK documentation, see the Python SDK reference.

Extracting Invoice Data to JSON with Node.js

Install the SDK from npm:

npm install @invoicedataextraction/sdk

The package requires Node.js 18+ and is ESM only, so your package.json needs "type": "module". TypeScript declarations ship with the package — no separate @types install required.

Initialize the client with your API key:

import InvoiceDataExtraction from "@invoicedataextraction/sdk";

const client = new InvoiceDataExtraction({
    api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});

The SDK exposes an async extract() method that handles file upload, processing, and download in a single call:

const result = await client.extract({
    folder_path: "./invoices",
    prompt: "Extract invoice number, invoice date, vendor name, line items with description, quantity, unit price, and line total",
    output_structure: "per_line_item",
    download: { formats: ["json"], output_path: "./output" },
    console_output: true,
});

The parameters work identically to the Python SDK. For finer control over extracted fields, pass a structured prompt object:

const result = await client.extract({
    folder_path: "./invoices",
    prompt: {
        fields: [
            { name: "Invoice Number" },
            { name: "Invoice Date", prompt: "Format as YYYY-MM-DD" },
            { name: "Vendor Name" },
            { name: "Total Amount", prompt: "Numeric, no currency symbol, 2 decimals" },
        ],
        general_prompt: "One record per invoice. Skip cover pages.",
    },
    output_structure: "per_invoice",
    download: { formats: ["json"], output_path: "./output" },
});

One naming convention worth noting: method names use camelCase (extract, uploadFiles, submitExtraction) while option keys use snake_case (folder_path, output_structure, api_key). The snake_case keys match the REST API response format directly, so you can pass API-level parameters without translation.

Handle errors at two levels. SDK and network errors throw exceptions, which you catch with a standard try/catch. Task-level failures — a corrupted PDF, an unreadable scan — are reported in the response object:

try {
    const result = await client.extract({
        folder_path: "./invoices",
        prompt: "Extract invoice number, date, vendor, and total",
        download: { formats: ["json"], output_path: "./output" },
    });

    if (!result.success) {
        console.error("Extraction failed:", result);
    }
} catch (error) {
    console.error("SDK or network error:", error.message);
}

Every method on the client is async and returns a Promise, so the SDK fits into existing Express middleware, serverless functions, or queue workers without blocking.

Using the REST API for Custom Invoice-to-JSON Pipelines

The Python and Node.js SDKs handle the multi-step extraction workflow for you, and they are the recommended starting point for most projects. Use the REST API directly when you are building in a language without an official SDK, reusing an existing HTTP client layer, or orchestrating each stage through tools like Airflow or AWS Step Functions.

The invoice data extraction API exposes a straightforward five-step workflow. Each step maps to a single HTTP call, which makes it easy to distribute across orchestration tasks or wrap in whatever retry logic your infrastructure already provides.

The API Workflow

Authentication uses a Bearer token in the Authorization header. Generate your API key from the dashboard — the API shares the same credit-based pricing as the web interface with no separate subscription fees.

The extraction sequence works as follows:

Create an upload session. Send a POST request to /v1/uploads/sessions with metadata about the files you plan to upload. The response returns presigned URLs for each file.
Upload your invoice files. Use the presigned URLs to upload PDFs, JPGs, or PNGs directly. File limits are 150 MB per PDF, 5 MB per image, and up to 2 GB or 6,000 files per batch.
Submit the extraction task. POST to /v1/extractions with your prompt and the output_structure parameter. For JSON output, set output_structure to "per_invoice" or "per_line_item" depending on the granularity you need, or use "automatic" to let the engine decide. This call returns immediately with an extraction ID.
Poll for completion. GET /v1/extractions/{id} until the status indicates the task is finished. All operations are asynchronous — nothing blocks while extraction runs.
Download the JSON output. Once complete, the response body includes output.json_url alongside xlsx_url and csv_url. Fetch the JSON URL to retrieve your structured invoice data. These download URLs are presigned and expire after 5 minutes, so generate a fresh one if your pipeline has a delay between polling and download.

Rate Limits to Plan Around

Two limits matter most for pipeline design. Extraction submissions are capped at 30 requests per minute, which defines your maximum batch throughput. Polling is allowed at 120 requests per minute, but a minimum interval of 5 seconds between status checks is recommended to avoid unnecessary load. If you are processing large batches, structure your pipeline to submit in controlled bursts and poll with exponential backoff rather than tight loops. At high volume, these design choices also affect your bill — see our breakdown of techniques to reduce invoice extraction API costs at scale for concrete savings estimates.

The API quickstart guide for invoice extraction walks through the full setup from key generation to first successful extraction.

JSON vs CSV vs XLSX: Choosing the Right Invoice Data Format

The extraction process itself is format-agnostic. A single extraction task produces results you can download as JSON, CSV, or XLSX simultaneously. The real question is what consumes the data downstream.

Factor	JSON	CSV	XLSX
Best consumer	Code (APIs, pipelines, databases)	Flat-file imports, spreadsheet tools	People (review, editing, sharing)
Nested data (line items)	Native support — line items are arrays within invoice objects	No nesting; requires flattening into one row per line item	Possible via multiple sheets, but clunky programmatically
Typical use cases	Webhook payloads, MongoDB/document store ingestion, microservice communication, data lake pipelines	Bulk SQL imports, legacy system integrations expecting delimited files, quick spreadsheet analysis	Finance team review, manual verification before posting, sharing with non-technical stakeholders
Schema flexibility	High — accommodates varying fields per invoice without null-padding	Low — every row must share the same column set	Moderate — supports typed columns (dates, currencies) and formatting
Programmatic parsing	First-class support in every language	Straightforward but watch for delimiter/encoding edge cases	Requires a library (openpyxl, exceljs, etc.)

For most developer workflows, JSON is the default choice. It preserves the hierarchical structure of invoice data — vendor details, line items, tax breakdowns — without forcing you to flatten relationships into rows. Any system that consumes data programmatically (REST APIs, message queues, document databases) expects JSON natively.

CSV makes sense when your target is a flat schema. Bulk-loading invoice header data into a SQL table, feeding records into an ETL tool that expects delimited files, or handing data to an analyst who will open it in a spreadsheet — these are CSV's strengths. If your use case involves nested line items, though, you will either need to flatten them (one row per line item, duplicating header fields) or split them across multiple files. For a deeper look at that workflow, see our guide on extracting invoice data to CSV format.

XLSX is the right pick when the next step involves a human. Finance teams reviewing extracted data before it enters an ERP, auditors spot-checking vendor totals, or managers who want a formatted report — all of these favor a spreadsheet file they can open, filter, and annotate without writing code.

Many teams have mixed needs: developers pulling JSON into a pipeline while the finance team downloads XLSX from the same extraction run. Since all three formats come from a single extraction task, supporting both does not require another extraction run.

Working with Extracted Invoice JSON in Production

Extracting invoice data to JSON is the first half of the problem. The second half — getting that JSON into your database, validating it, and passing it downstream — is where most teams hit friction. The patterns below cover the most common production integration scenarios.

Database Ingestion Patterns

How you store extracted invoice JSON depends on your database engine and how you need to query the data later.

Relational databases (PostgreSQL, MySQL). The flat fields on each invoice object — vendor name, invoice number, date, total — map directly to columns in an INSERT statement or a COPY/LOAD DATA operation. Line items are where you make a design choice: normalize them into a separate line_items table with a foreign key back to the invoice, or store the entire invoice object in a native JSON column. PostgreSQL's jsonb and MySQL's JSON column type both support this approach. With jsonb, you can query individual line items using JSON path expressions without ever denormalizing:

SELECT invoice_number,
       item->>'description' AS description,
       (item->>'amount')::numeric AS amount
FROM invoices,
     jsonb_array_elements(raw_json->'line_items') AS item
WHERE (item->>'amount')::numeric > 1000;

This hybrid approach — structured columns for fields you filter on frequently, a JSON column for the full extraction result — gives you both query performance and schema flexibility.

Document databases (MongoDB, DynamoDB). The nested per-invoice JSON, line_items arrays included, maps directly to documents with no schema transformation. Each extracted invoice becomes a single document. For MongoDB, you insert it as-is. For DynamoDB, you define your partition key (invoice number or vendor ID) and store the rest as attributes. If your extraction pipeline processes invoices in batch, the output array feeds directly into insertMany or BatchWriteItem.

For larger batches, you can also stream validated records into a bulk INSERT, feed them into a message queue one at a time, or write newline-delimited JSON (NDJSON) for PostgreSQL COPY or BigQuery load jobs.

Downstream API Integration

When the destination is an ERP system, accounting platform, or internal microservice rather than a database you control, the extracted JSON serves as the request body for an HTTP POST — or as the source you map from. The same principle applies if you are wrapping invoice extraction in an MCP server so that AI assistants can call it as a tool; the structured JSON output maps directly to the tool's response schema.

If you control the extraction prompt (as with the structured prompt approach covered earlier), you can specify field names and formats that already match the target API's expected schema. An accounting API that expects vendor_name, due_date, and line_items gets exactly those fields from extraction, with no post-processing layer in between. That reduces mapping work to the fields whose names or nesting differ from the destination schema.

When the target schema differs from your extraction output, a thin transformation function handles the mapping. The key advantage of JSON here over CSV or flat formats is that nested structures like line items, tax breakdowns, and payment terms survive intact through the transformation rather than requiring reassembly from flattened rows.

Validating Extracted JSON Before Ingestion

Even a good extraction run can produce a missing vendor name, null total, or malformed date, so production pipelines need validation before ingestion. Otherwise the failure shows up downstream as a rejected database insert, a failed API call, or silently corrupted data.

JSON Schema validation catches these anomalies at the boundary, before processing continues. Define a schema that encodes your requirements:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["invoice_number", "vendor_name", "date", "total"],
  "properties": {
    "invoice_number": { "type": "string", "minLength": 1 },
    "vendor_name": { "type": "string", "minLength": 1 },
    "date": { "type": "string", "format": "date" },
    "total": { "type": "number", "minimum": 0 },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["description", "amount"],
        "properties": {
          "description": { "type": "string" },
          "amount": { "type": "number" }
        }
      }
    }
  }
}

Validate each extracted invoice against this schema before it enters your database or gets forwarded to a downstream service. In Python, jsonschema.validate() handles this in a single call. In Node.js, libraries like Ajv do the same, and TypeScript teams increasingly reach for Zod-based invoice extraction pipelines that combine schema definition with runtime validation in a single declaration. Records that fail validation get routed to a review queue or error log rather than silently passing through.