To extract invoice data with JavaScript, use a dedicated invoice extraction API with a Node.js SDK rather than a generic PDF parsing library. The @invoicedataextraction/sdk package provides a one-call extract() method that accepts invoice PDFs or images and returns structured JSON containing vendor name, invoice number, amounts, dates, and line items:
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number, date, vendor name, and total amount",
output_structure: "per_invoice",
download: { formats: ["json"], output_path: "./output" },
});
If you have searched npm for "invoice," most of what you have found are packages that generate invoice PDFs from your data (easyinvoice, pdf-invoice, nodeice). Extracting structured data from an existing invoice is the opposite workflow, and finding the right tools requires knowing where to look.
JavaScript and Node.js are a natural fit for invoice extraction workflows. JavaScript is used by 66% of all developers and Node.js by 49% of professional developers, based on a survey of over 49,000 respondents across 177 countries according to Stack Overflow's 2025 Developer Survey. Invoice processing pipelines frequently live inside Express or Fastify APIs, serverless functions, or full-stack applications where Node.js is already the runtime. Having a native SDK that returns typed JSON means the extracted data slots directly into your existing application logic without format conversion or language bridging.
This guide covers three approaches to extracting invoice data in Node.js: raw PDF text parsing with libraries like pdf-parse, local OCR using Tesseract bindings, and managed extraction APIs purpose-built for financial documents. From there, it walks through production-ready TypeScript examples using the InvoiceDataExtraction SDK, including batch processing for multi-invoice workflows and error handling patterns suited for real deployments.
Why Most npm Search Results Will Not Solve Invoice Extraction
Search npm for "invoice" and you will find dozens of packages. Nearly all of them do the opposite of what you need. Packages like easyinvoice, microinvoice, and pdf-invoice-generator take structured data you already have and produce a PDF from it. Invoice data extraction runs in the other direction: you start with an existing PDF or image and need to pull structured data out of it.
This generation-vs-extraction confusion is not just a naming inconvenience. It means the best npm package for invoice extraction is not findable through the obvious search paths, because the results are dominated by generation tools that solve a fundamentally different problem.
What npm Actually Offers for Extraction
The dedicated npm invoice parser ecosystem is thin. What you will find falls into two categories:
General-purpose PDF text extractors like pdf-parse and pdfjs-dist can read native (digitally created) PDFs and return their text content as strings. They are legitimate tools, but they do not understand invoices. They extract raw text, not invoice data.
API client libraries that wrap managed extraction services. These are typically SDKs for cloud-based OCR and document intelligence platforms, not standalone extraction engines.
There is very little in between. No widely adopted npm package takes an invoice PDF as input and returns structured JSON with labeled fields as output without relying on an external service.
The Gap Between Text and Data
What pdf-parse gives you is a wall of concatenated strings with no field labels, no row boundaries, and no semantic structure. Turning that into usable invoice data requires parsing logic specific to each vendor's layout, and that logic breaks the moment you receive invoices from a second vendor with a different format. Scanned invoices return nothing at all, since pdf-parse has no OCR capability.
Dedicated extraction tools fill this gap by combining OCR, layout analysis, and field identification into a single pipeline. These capabilities require document intelligence models trained on invoice layouts, which is why the extraction side of the ecosystem is built around managed APIs rather than self-contained npm packages.
Three Approaches to Invoice Extraction in Node.js
When you need to pull structured data from invoices in a Node.js application, you have three realistic paths. Each involves a fundamentally different trade-off between control, accuracy, and implementation effort. The right choice depends on how many invoice layouts you need to handle, whether your documents are native PDFs or scanned images, and how much parsing logic you want to own.
1. Raw PDF Text Extraction
Libraries like pdf-parse and pdfjs-dist can extract the text layer from native (digitally generated) PDFs. You get back a string of raw text, and from there, you write your own parsing logic to locate invoice fields.
This approach is free, runs entirely locally, and has zero external dependencies. For a single known invoice template where the vendor name always appears on line 3 and the total is always prefixed by "Amount Due:", regex or string splitting can work. You control everything.
The limitations are significant. These libraries perform no OCR at all. A scanned invoice, a photographed receipt, or any image-based PDF returns nothing. There is no layout analysis, no field identification, and no understanding of what "invoice date" means versus "due date." You are writing and maintaining a custom node invoice parser for every template variation you encounter. When a vendor updates their invoice layout, your parsing breaks silently.
2. Local OCR with Tesseract.js
Tesseract.js brings the Tesseract OCR engine into Node.js, letting you run optical character recognition locally without an external service. This solves the image problem that raw PDF extraction cannot touch: scanned invoices, photos, and image-based PDFs become machine-readable text.
The output, however, is still raw text. Tesseract.js performs invoice OCR in JavaScript at the character recognition level. It does not identify invoice fields, understand table structures, or return structured data. You still need custom post-processing logic to extract vendor names, line items, and totals from the recognized text.
Accuracy depends heavily on input quality. Clean, high-resolution scans in common languages produce reasonable results. Skewed images, low-contrast scans, or handwritten annotations degrade output substantially. Processing time is also a factor: OCR on a single page can take several seconds locally, and CPU usage scales linearly with page count. For a deeper look at how traditional OCR compares to AI-driven extraction, see our guide on comparing AI and OCR approaches for invoice extraction.
3. Managed Invoice Extraction APIs
The third approach offloads OCR, layout analysis, and field extraction to a cloud service accessed through a REST API. You send a file and receive structured JSON with named fields: vendor name, invoice number, dates, line items, tax amounts, totals. No custom parsing code required.
Behind the scenes, these services combine OCR with document understanding models that recognize invoice layouts, table structures, and field relationships. They handle varied document quality, multi-page invoices, and different vendor formats automatically. The difference between getting raw text and getting a parsed data object is the difference between hours of regex engineering and a single API call.
The trade-offs are real. You need an API key and network connectivity. There is a per-page cost. You depend on a third-party service for a critical part of your pipeline. For workflows where uptime and latency requirements are strict, you need to account for network round-trips and design appropriate retry logic.
As a concrete example, the InvoiceDataExtraction platform provides an invoice extraction REST API backed by a multi-model AI system where specialized models validate each other's output for accuracy. Their Node.js SDK offers a one-call extract() method with built-in TypeScript declarations, so you send a document and get typed, structured JSON back without writing any parsing logic. Processing runs at 1 to 8 seconds per page, and the platform includes 50 free pages per month with no credit card required, so you can test it against your own documents before committing.
Which Approach Fits Your Use Case
For a single, predictable PDF template from one vendor, raw text extraction with pdf-parse may be all you need. The cost is zero and the code is straightforward.
For scanned documents from a known source, Tesseract.js plus custom post-processing gives you local control over the entire pipeline, provided you can absorb the accuracy limitations and processing overhead. If you would rather skip the custom parsing step, vision LLMs can extract structured invoice data directly from images by combining OCR and field identification in a single model call.
For production systems receiving invoices from multiple vendors in varying formats and quality levels, a managed API removes the parsing and OCR burden entirely. The per-page cost is typically far less than the engineering time required to build and maintain equivalent extraction logic in-house. If you are weighing deployment models more broadly, our article on choosing between API, SaaS, and ERP invoice capture covers the architectural trade-offs in depth.
Extracting Invoice Data with the Node.js SDK
Install the SDK from npm:
npm install @invoicedataextraction/sdk
Two requirements before you write any code. The SDK is ESM-only, so your package.json needs "type": "module" or your files need .mjs extensions. And you need Node.js 18 or later.
Initialize the client with your API key stored in an environment variable:
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
Basic Extraction with extract()
The extract() method handles the entire workflow in a single call: uploading your files, submitting the extraction job, polling until completion, and downloading the results. Here is a complete, runnable example that extracts invoice data from a folder of PDFs and saves the output as JSON:
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number, date, vendor name, and total amount",
output_structure: "per_invoice",
download: {
formats: ["json"],
output_path: "./output",
},
console_output: true,
});
console.log(`Status: ${result.status}`);
console.log(`Pages processed: ${result.pages.successful_count}`);
console.log(`Credits used: ${result.credits_deducted}`);
That string prompt tells the AI what data to extract from each invoice, and it produces one row per document. The JSON file lands in ./output once extraction finishes. Unlike extraction APIs that return a fixed set of fields, the prompt-driven approach means you control exactly what data the SDK extracts and how the output is structured.
Object Prompts for Precise Field Control
String prompts are fast to write, but the AI chooses the column names in your output. When you need exact field names that match your database schema or downstream API contract, use the object prompt format instead:
const result = await client.extract({
folder_path: "./invoices",
prompt: {
fields: [
{ name: "Invoice Number" },
{ name: "Invoice Date", prompt: "Date issued, not due date, format YYYY-MM-DD" },
{ name: "Vendor Name" },
{ name: "Total Amount", prompt: "No currency symbol, 2 decimal places" },
],
general_prompt: "Extract one record per invoice. Skip any document that is not an invoice.",
},
output_structure: "per_invoice",
download: {
formats: ["json", "csv"],
output_path: "./output",
},
console_output: true,
});
Each entry in the fields array defines a column in your output. The name property sets the exact column header. The optional prompt property gives the AI specific instructions for that field. The general_prompt applies to the extraction as a whole.
If your Node.js invoice extraction pipeline feeds data into an ERP system expecting columns named Invoice Number and Total Amount, the object prompt guarantees those exact names appear in every JSON or CSV output.
TypeScript Support
The @invoicedataextraction/sdk package ships with TypeScript declarations built in. There is no separate @types package to install. You get autocomplete and compile-time type checking for every method, parameter, and response field out of the box.
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number, date, vendor name, net amount, tax, and total",
output_structure: "per_invoice",
download: {
formats: ["json"],
output_path: "./output",
},
console_output: false,
});
if (result.pages.failed_count > 0) {
console.warn(`Failed to process ${result.pages.failed_count} pages`);
}
TypeScript invoice extraction workflows benefit here because the SDK types catch parameter typos and invalid option values at compile time rather than at runtime. If you pass output_struture instead of output_structure, tsc catches it before you ever run the code. For validating the shape of extracted data at runtime as well, you can pair the SDK with Zod to define and enforce invoice schemas with TypeScript and Zod, catching malformed or missing fields before they reach your database.
Express.js Endpoint for Invoice Processing
A realistic integration goes beyond a standalone script. Here is an Express.js route that accepts uploaded invoice files via multipart form data, extracts structured data from PDF invoices using the SDK, and returns the results as JSON to the API caller:
import express from "express";
import multer from "multer";
import fs from "fs/promises";
import path from "path";
import { randomUUID } from "crypto";
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const app = express();
const upload = multer({ dest: "./tmp-uploads" });
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
app.post("/extract-invoices", upload.array("invoices"), async (req, res) => {
const jobDir = path.join("./tmp-jobs", randomUUID());
const outputDir = path.join(jobDir, "output");
try {
await fs.mkdir(outputDir, { recursive: true });
// Move uploaded files into the job directory
for (const file of req.files as Express.Multer.File[]) {
const dest = path.join(jobDir, file.originalname);
await fs.rename(file.path, dest);
}
const result = await client.extract({
folder_path: jobDir,
prompt: {
fields: [
{ name: "Invoice Number" },
{ name: "Invoice Date", prompt: "Format YYYY-MM-DD" },
{ name: "Vendor Name" },
{ name: "Total Amount", prompt: "No currency symbol, 2 decimal places" },
],
general_prompt: "Extract one record per invoice.",
},
output_structure: "per_invoice",
download: {
formats: ["json"],
output_path: outputDir,
},
console_output: false,
});
// Read the downloaded JSON and return it to the client
const jsonFiles = (await fs.readdir(outputDir)).filter((f) => f.endsWith(".json"));
const extractedData = JSON.parse(await fs.readFile(path.join(outputDir, jsonFiles[0]), "utf-8"));
res.json({
status: result.status,
credits_used: result.credits_deducted,
pages_processed: result.pages.successful_count,
pages_failed: result.pages.failed_count,
data: extractedData,
});
} catch (err) {
res.status(500).json({ error: "Extraction failed", message: (err as Error).message });
} finally {
await fs.rm(jobDir, { recursive: true, force: true });
}
});
app.listen(3000);
This pattern lets any client that can send a POST request with file attachments extract invoice data to JSON through your Node.js service — and if your consumers are AI assistants rather than traditional HTTP clients, you can expose invoice extraction as an MCP server tool so that models call it directly through the Model Context Protocol. The temporary directories are cleaned up after each request regardless of success or failure.
Understanding the Response
Every extract() call returns a response object with fields you should inspect before passing data downstream:
status—"completed"when the extraction finished successfully.credits_deducted— the number of credits consumed, one per page processed.pages.successful_countandpages.failed_count— checkfailed_countto identify documents that could not be processed. Thepages.failedarray lists each failed file with its name and page number.ai_uncertainty_notes— an array of objects where the AI flags ambiguity it encountered. Each note includes atopic,description, andsuggested_prompt_additionsyou can use to refine your prompt for future runs.output— contains download URLs (xlsx_url,csv_url,json_url) for each format you requested. When you use thedownloadoption inextract(), the SDK downloads these files automatically, but the URLs remain available if you need to fetch them again.
Batch Processing and Production Patterns
Once the basic extraction works, the next step is making it production-ready: processing thousands of files per run, handling failures without manual intervention, and producing consistent output for downstream systems.
Processing Large Batches
The SDK accepts entire directories of mixed-format invoices through the folder_path parameter. A single upload session supports up to 6,000 files with a combined size limit of 2 GB. Individual file limits are 150 MB for PDFs and 5 MB for JPG, JPEG, and PNG files.
const result = await client.extract({
folder_path: "./invoices/2026-q1",
prompt: invoiceSchema,
output_structure: "per_line_item",
download: {
formats: ["json", "csv"],
output_path: "./output/2026-q1",
},
polling: {
interval_ms: 15000,
timeout_ms: null,
},
console_output: true,
});
Setting polling.timeout_ms to null disables the timeout entirely, which is appropriate for large batch jobs where processing duration is unpredictable.
The output_structure parameter controls how extracted data is organized:
- "per_invoice" produces one row per invoice document. Use this when you need a flat summary of each invoice for reconciliation or payment scheduling.
- "per_line_item" produces one row per line item, with invoice-level fields (invoice number, date, vendor) repeated on each row. This is the right choice when your downstream system needs to ingest individual line items for inventory or cost-center allocation.
- "automatic" lets the extraction engine decide based on document content. Suitable for exploratory runs or mixed-complexity batches.
Progress Monitoring with on_update
The on_update callback gives real-time visibility into each stage of a batch job without requiring custom polling logic. The callback receives an object containing stage, level, message, and progress.
const result = await client.extract({
folder_path: "./invoices/incoming",
prompt: invoiceSchema,
output_structure: "per_invoice",
download: {
formats: ["json"],
output_path: "./output",
},
on_update: ({ stage, level, message, progress }) => {
const percent = progress !== null ? ` (${progress}%)` : "";
console.log(`[${stage}] [${level}] ${message}${percent}`);
if (level === "error") {
// Log to your monitoring service
alertOpsTeam({ stage, message });
}
},
});
The stage value cycles through "upload", "submission", "waiting", "download", and "completion". The progress field reports a value from 0 to 100 during stages that support it, or null when progress percentage is unavailable. This is enough to drive a progress bar in a CLI tool or feed status updates into a job queue dashboard.
Structured Error Handling
SDK errors throw with a structured error.body.error object containing code, message, retryable, and details. A production error handler should branch on the error code and respect the retryable flag.
async function processInvoiceBatch(folderPath, schema) {
try {
const { credits_balance } = await client.getCreditsBalance();
console.log(`Available credits: ${credits_balance}`);
const result = await client.extract({
folder_path: folderPath,
prompt: schema,
output_structure: "per_invoice",
download: {
formats: ["json"],
output_path: "./output",
},
polling: {
timeout_ms: null,
},
});
return result;
} catch (error) {
const { code, message, retryable } = error.body?.error || {};
switch (code) {
case "INSUFFICIENT_CREDITS":
console.error(`Not enough credits. ${message}`);
// Notify billing team, do not retry
break;
case "SDK_TIMEOUT_ERROR":
console.error("Extraction timed out. Consider setting timeout_ms to null.");
break;
case "SDK_UPLOAD_ERROR":
case "SDK_NETWORK_ERROR":
if (retryable) {
console.warn(`Retryable error (${code}). Scheduling retry.`);
return scheduleRetry(folderPath, schema);
}
break;
default:
console.error(`Extraction failed: [${code}] ${message}`);
}
throw error;
}
}
Calling getCreditsBalance() before submitting a large batch prevents wasted processing time. Credits are consumed at one credit per successfully processed page, and failed pages are not charged. For rate limiting, the SDK handles retries automatically, so your code does not need to implement backoff logic for RATE_LIMITED responses.
Repeatable Extraction Configurations
The object prompt format turns extraction rules into version-controlled configuration that you define once and apply to every batch. This eliminates drift between runs and makes schema changes auditable.
// config/invoice-extraction-schema.js
export const invoiceSchema = {
fields: [
{ name: "Vendor Name" },
{ name: "Invoice Number" },
{ name: "Invoice Date", prompt: "Date issued, NOT due date" },
{ name: "Due Date" },
{ name: "Line Item Description" },
{ name: "Line Item Quantity", prompt: "Numeric value only" },
{ name: "Line Item Unit Price", prompt: "No currency symbol, 2 decimal places" },
{ name: "Total Amount", prompt: "No currency symbol, 2 decimal places" },
{ name: "Tax Amount", prompt: "No currency symbol, 2 decimal places" },
],
general_prompt:
"Extract one record per line item. Dates in YYYY-MM-DD format. If a field is missing, return null.",
};
Storing the schema as a module means it can be imported into any processing script, tested in isolation, and tracked in version control alongside the code that uses it. When extraction requirements change, you update one file rather than hunting through scattered string prompts.
import { invoiceSchema } from "./config/invoice-extraction-schema.js";
await client.extract({
folder_path: process.argv[2],
prompt: invoiceSchema,
output_structure: "per_line_item",
download: {
formats: ["json"],
output_path: process.argv[3] || "./output",
},
});
This pattern automates invoice processing in Node.js with a clean separation between extraction logic and configuration, making it straightforward to run the same schema across different batch jobs or environments.
Output Format Selection
The download.formats array accepts any combination of "json", "csv", and "xlsx". You can request multiple formats in a single extraction call.
download: {
formats: ["json", "xlsx"],
output_path: "./output/march-2026"
}
JSON fits programmatic pipelines, XLSX works for finance teams reviewing data in spreadsheets, and CSV is the standard for flat-file ingestion into accounting or ERP systems. For a deeper look at the CSV path, see our guide on extracting invoice data to CSV format.
About the author
David Harding
Founder, Invoice Data Extraction
David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.
Profile
View author pageEditorial process
This page is reviewed as part of Invoice Data Extraction's editorial process.
If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.
Related Articles
Explore adjacent guides and reference articles on this topic.
Extract Invoice Data with Python: Complete Guide
Extract structured data from invoices using Python. Covers invoice2data, Tesseract OCR, and API/SDK integration with code examples and trade-off analysis.
TypeScript Invoice Extraction with Zod Validation
Build type-safe invoice extraction pipelines with TypeScript and Zod. Schema design, runtime validation with safeParse, and Node SDK integration.
Vision LLM Invoice Extraction with Node.js: A Practical Guide
A Node.js guide to extracting invoice data with vision LLMs. Covers Zerox, direct GPT-4o/Claude API calls with Zod schemas, OCR comparison, and cost analysis.
Extract invoice data to Excel with natural language prompts
Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.