Invoice Data Extraction Node SDK
Official Node.js SDK for Invoice Data Extraction. Handles file upload, extraction submission, polling, and result download so you can go from local files to structured output in a few lines of code.
- Node.js 18 or later
- ESM only
Install
npm install @invoicedataextraction/sdk
This package is ESM only. Your project's package.json must include "type": "module" (or use .mjs file extensions). TypeScript declarations are included.
Quick Start
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number and total",
output_structure: "per_invoice",
download: {
formats: ["xlsx"],
output_path: "./output",
},
console_output: true, // remove to disable console logging
});
extract(...) uploads every file in the folder, submits the extraction, polls until it finishes, and downloads the results. The returned result is the final polling response from the API containing the extraction task details such as the successfully processed pages, any failed pages, credits deducted etc.
Generate an API key from your dashboard. Every account includes 50 free pages per month. Additional credits can be purchase on a pay-as-you-go basis with no subscription needed.
Constructor
import InvoiceDataExtraction from "@invoicedataextraction/sdk";
const client = new InvoiceDataExtraction({
api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});
| Parameter | Required | Description |
|---|---|---|
api_key | Yes | Your API key. |
base_url | No | API base URL. Defaults to https://api.invoicedataextraction.com/v1. Only needed for testing or non-production environments. |
extract(...)
Run a complete extraction in a single call. Pass a folder path or an array of file paths, tell the SDK what to extract, and optionally download the extracted data to disk as Excel, CSV, or JSON. The method returns the extraction task results — credits deducted, successful and failed pages, and AI uncertainty notes. The SDK handles upload, submission, polling, and download internally.
Underlying API workflow: upload session → submit extraction → poll for results → download output. See File limits for size and count constraints.
const result = await client.extract({
folder_path: "./invoices",
prompt: "Extract invoice number, date, vendor name, and total amount",
output_structure: "per_invoice",
download: {
formats: ["xlsx", "json"],
output_path: "./output",
},
console_output: true, // remove to disable console logging
});
Parameters
| Parameter | Required | Description |
|---|---|---|
folder_path | One of folder_path or files | Path to a local folder. The SDK uploads every supported file in the folder (.pdf, .jpg, .jpeg, .png). Not recursive. |
files | One of folder_path or files | Array of local file paths to upload. Supported types: .pdf, .jpg, .jpeg, .png. |
prompt | Yes | Extraction instructions. String or object — see Prompt below. |
output_structure | Yes | Controls how the extracted data is structured — see Output structure below. |
task_name | No | Your label for this extraction (3–40 characters). Appears in the web dashboard. If omitted, the SDK generates one as extraction_YYYYMMDD_HHMMSS. |
exclude_columns | No | Array of system-generated columns to exclude from output. By default, a "Source File" column is added to every row indicating which uploaded file/page the data was extracted from. If your workflow requires an exact output structure, you can exclude it. Valid values: "source_file". |
download | No | Download options — see Download below. If omitted, no files are downloaded. |
polling | No | Polling options — see Polling below. |
console_output | No | Boolean. When true, the SDK logs progress to the console during upload, polling, and download. Off by default. |
on_update | No | Callback function for lifecycle updates — see on_update below. |
Output structure
Controls how the extracted data is structured:
| Value | Meaning |
|---|---|
automatic | The AI decides based on your prompt and documents. |
per_invoice | Each invoice becomes a single row (spreadsheet/CSV) or object (JSON). |
per_line_item | Each individual product/service listed within an invoice becomes its own row (spreadsheet/CSV) or object (JSON). |
Prompt
The prompt tells the AI what data to extract. It can be a string or an object.
String — describe what you want in natural language (max 2,500 characters):
prompt: "Extract invoice number, date, vendor name, and total amount"
With a string, the AI chooses output field names based on your instructions.
Object — use an object when you need exact output field names. Each name is guaranteed to appear exactly as written in the extracted data. You can also add optional per-field and general instructions:
prompt: {
fields: [
{ name: "Invoice Number" },
{ name: "Invoice Date", prompt: "The date the invoice was issued, NOT the due date" },
{ name: "Vendor Name" },
{ name: "Total Amount", prompt: "No currency symbol, 2 decimal places" },
],
general_prompt: "Extract one record per invoice or credit note. Ignore email cover letters. Dates should be in YYYY-MM-DD format.",
}
Each item in fields:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | The name for this data point in the output (2–50 characters). Prefer clear, descriptive names (e.g., "Invoice Number", not "Field A"). |
prompt | string | No | Specific instructions for extracting this data point (3–600 characters). Use this to clarify ambiguities or instruct special handling. |
| Field | Type | Required | Description |
|---|---|---|---|
general_prompt | string | No | Instructions that apply to the full task and across all fields (max 1,500 characters). Use this to provide special handling instructions, specify output formatting, or describe the extraction goal. |
fields must be a non-empty array.
For guidance on writing effective prompts, see the Extraction Guide.
Download
When download is provided, the SDK downloads output files after a successful extraction.
download: {
formats: ["xlsx", "csv", "json"],
output_path: "./output",
}
| Field | Required | Description |
|---|---|---|
formats | Yes | Array of output formats to download. One or more of "xlsx", "csv", "json". |
output_path | Yes | Destination folder for downloaded files. Created automatically if it doesn't exist. |
Downloaded files are named {task_name}_{timestamp}.{format}.
Auto-download is a best-effort convenience. If the extraction completed but a download fails, the SDK surfaces a warning through console_output / on_update and still returns the completed extraction response. You can retry the download later using downloadOutput(...).
Auto-download does not overwrite existing files. If a generated file path already exists, the SDK skips that file and surfaces a warning.
Returns
extract(...) returns the terminal polling response from the API unchanged — for both successful and failed extractions.
Completed:
{
"success": true,
"status": "completed",
"extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"credits_deducted": 25,
"output_structure": "per_invoice",
"pages": {
"successful_count": 10,
"failed_count": 2,
"successful": [
{ "file_name": "invoice-1.pdf", "page": 1 }
],
"failed": [
{ "file_name": "damaged.pdf", "page": 1 }
]
},
"ai_uncertainty_notes": [],
"output": {
"xlsx_url": "https://...",
"csv_url": "https://...",
"json_url": "https://..."
}
}
| Field | Description |
|---|---|
credits_deducted | Credits charged for this extraction (one credit per successful page). |
output_structure | The output structure used: "per_invoice" or "per_line_item". If you submitted "automatic", this tells you what the AI chose. |
pages.successful_count | Number of pages successfully processed. |
pages.failed_count | Number of pages that failed processing. |
pages.successful | List of successfully processed pages. Each item has file_name (the uploaded file name) and page (the page number within that file). |
pages.failed | List of pages that failed processing. Same shape as successful. |
ai_uncertainty_notes | Areas where the AI made assumptions due to ambiguity in your prompt. Empty array if none. Each note has a topic, a description of what was assumed, and a suggested_prompt_additions array of prompt additions you can use to remove the ambiguity in future extractions. Each suggestion has a purpose (why you'd add it) and instructions (prompt text you can add). |
output | Presigned download URLs for each format (xlsx_url, csv_url, json_url). null if not available. URLs expire after 5 minutes — use downloadOutput(...) or getDownloadUrl(...) for a fresh URL. |
File uploads are all-or-nothing — if extract(...) returns without throwing, every file was uploaded successfully. The only failures to check for are in pages.failed, which lists pages that failed during extraction processing. If pages.failed_count is 0, all uploaded files and pages were processed successfully.
Failed:
When the extraction task itself fails, extract(...) returns the failed polling response — it does not throw. The failure details are in the returned response body, not on error.body.
{
"success": false,
"status": "failed",
"extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"error": {
"code": "INSUFFICIENT_CREDITS",
"message": "Insufficient credits to process this extraction.",
"retryable": false,
"details": { "credits_required": 25, "credits_balance": 15, "credits_reserved": 10 }
}
}
See the API docs for the full list of task failure codes.
When extract(...) throws
extract(...) only throws before a terminal extraction response is available — for example if upload, submission, or polling fails due to invalid input, network errors, or a polling timeout. These are SDK/API errors and are read from error.body as described in Errors.
Staged Workflow
extract(...) runs the full pipeline in one call. If you need control over individual steps — for example, uploading files in one part of your system and triggering extraction in another, running multiple extractions against the same uploaded files, or fitting each step into your own error handling and retry logic — use these methods instead:
const upload = await client.uploadFiles({
files: ["./invoice1.pdf", "./invoice2.pdf"],
});
const submitted = await client.submitExtraction({
upload_session_id: upload.upload_session_id,
file_ids: upload.file_ids,
prompt: "Extract invoice number and total",
output_structure: "per_invoice",
});
const result = await client.waitForExtractionToFinish({
extraction_id: submitted.extraction_id,
});
await client.downloadOutput({
extraction_id: submitted.extraction_id,
format: "xlsx",
file_path: "./output/invoices.xlsx",
});
uploadFiles(...)
Upload local files without starting an extraction. Use this when you want to upload once and submit extractions separately — for example, to run different prompts against the same files, or to upload in one part of your system and extract in another.
Underlying API workflow: create upload session → upload file parts → complete each file. See File limits for size and count constraints.
| Parameter | Required | Description |
|---|---|---|
folder_path | One of folder_path or files | Path to a local folder. The SDK uploads every supported file in the folder (.pdf, .jpg, .jpeg, .png). Not recursive. |
files | One of folder_path or files | Array of local file paths to upload. Supported types: .pdf, .jpg, .jpeg, .png. |
upload_session_id | No | Your own session ID. If omitted, the SDK generates one. If an upload fails partway through, that session cannot be resumed — start a new upload with a fresh session ID. |
console_output | No | Boolean. When true, the SDK logs upload progress to the console. |
on_update | No | Callback for upload lifecycle updates — see on_update. |
Returns
{
"upload_session_id": "session_a1b2c3d4-...",
"file_ids": ["file_abc123", "file_def456"]
}
Pass upload_session_id and file_ids to submitExtraction(...) to start an extraction.
File uploads are all-or-nothing. If any file fails to upload, the method throws immediately — there is no partial success state. If uploadFiles(...) returns without throwing, every file was uploaded successfully.
The API checks your credit balance when the upload session is created. If you don't have enough credits, uploadFiles(...) throws INSUFFICIENT_CREDITS before any files are uploaded.
submitExtraction(...)
Submit an extraction task for files that have already been uploaded. The method returns immediately — it does not wait for the extraction to finish.
Underlying API endpoint: POST /extractions.
| Parameter | Required | Description |
|---|---|---|
upload_session_id | Yes | The upload session ID returned by uploadFiles(...). |
file_ids | Yes | Array of file IDs returned by uploadFiles(...). |
prompt | Yes | Extraction instructions. String or object — see Prompt. |
output_structure | Yes | Controls how the extracted data is structured — see Output structure. |
task_name | No | Your label for this extraction (3–40 characters). Appears in the web dashboard. If omitted, the SDK generates one as extraction_YYYYMMDD_HHMMSS. |
exclude_columns | No | Array of system-generated columns to exclude from output. By default, a "Source File" column is added to every row indicating which uploaded file/page the data was extracted from. If your workflow requires an exact output structure, you can exclude it. Valid values: "source_file". |
submission_id | No | Your own idempotency ID for this submission. If omitted, the SDK generates one. If a request fails or times out, retry with the same submission_id to safely retrieve the existing task instead of creating a duplicate. |
Returns
{
"success": true,
"extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"submission_state": "received"
}
The task is now queued for processing. Use extraction_id to poll for results with waitForExtractionToFinish(...) or checkExtraction(...). Submitted tasks also appear in the web dashboard where you can view progress and results.
waitForExtractionToFinish(...)
Poll an extraction until it reaches a terminal state (completed or failed). Use this after submitExtraction(...) when you want the SDK to handle the polling loop for you.
Underlying API endpoint: GET /extractions/{extraction_id} (polled repeatedly).
| Parameter | Required | Description |
|---|---|---|
extraction_id | Yes | The extraction ID returned by submitExtraction(...). |
polling | No | Polling options — see Polling. |
console_output | No | Boolean. When true, the SDK logs polling progress to the console. |
on_update | No | Callback for waiting lifecycle updates — see on_update. |
Returns
Returns the terminal polling response from the API unchanged — the same shape documented for extract(...) returns.
When the extraction completes, you get the full result with credits_deducted, pages, ai_uncertainty_notes, and output URLs. When it fails, you get the failed response with error.code and error.message. In both cases the response is returned, not thrown.
If polling.timeout_ms is set and the extraction hasn't finished in time, the method throws SDK_TIMEOUT_ERROR. The extraction may still be processing — you can check later with checkExtraction(...) or from the web dashboard.
downloadOutput(...)
Download a single output file for a completed extraction to disk. Use this for manual downloads after using the staged workflow, or to retry a failed auto-download from extract(...).
Underlying API workflow: request a fresh presigned download URL → download the file → write to disk.
| Parameter | Required | Description |
|---|---|---|
extraction_id | Yes | The extraction ID whose output you want to download. |
format | Yes | A single output format: "xlsx", "csv", or "json". |
file_path | Yes | Full destination file path on disk. The file extension must match the requested format. The parent directory is created automatically if it doesn't exist. |
downloadOutput(...) does not overwrite existing files. If file_path already exists, the SDK throws SDK_FILESYSTEM_ERROR with guidance to choose a new path or remove the existing file.
The extraction must be completed before downloading. If the output is not available — for example, the extraction hasn't finished or the format was not generated — the method throws OUTPUT_NOT_AVAILABLE.
Returns
{
"success": true,
"extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"format": "xlsx",
"file_path": "./output/invoices.xlsx"
}
checkExtraction(...)
Check the current status of a submitted extraction without polling. Use this when you want a single point-in-time status check — for example, in a job queue where you check periodically on your own schedule rather than having the SDK poll using waitForExtractionToFinish(...).
Underlying API endpoint: GET /extractions/{extraction_id}.
| Parameter | Required | Description |
|---|---|---|
extraction_id | Yes | The extraction ID to check. |
Returns
Returns the current polling response from the API unchanged. The response may represent a processing, completed, or failed extraction — the same shapes documented for extract(...) returns. A processing response includes a progress field (0–100) indicating approximate completion.
getDownloadUrl(...)
Request a fresh presigned download URL for an extraction's output. Use this when you want to handle the download yourself rather than using downloadOutput(...).
Underlying API endpoint: GET /extractions/{extraction_id}/output?format={format}.
| Parameter | Required | Description |
|---|---|---|
extraction_id | Yes | The extraction ID whose output you want to download. |
format | Yes | A single output format: "xlsx", "csv", or "json". |
Returns
{
"download_url": "https://storage.example.com/...?X-Amz-Signature=...",
"format": "xlsx",
"expires_in_seconds": 300
}
The URL is a temporary, pre-authenticated link. Make a plain GET request to it — no Authorization header needed. It expires after 5 minutes.
The extraction must be completed before requesting a download URL. If the output is not available, the method throws OUTPUT_NOT_AVAILABLE.
deleteExtraction(...)
Permanently delete an extraction, its output files, and its uploaded source files. Use this when you need to remove data immediately rather than waiting for automatic data retention. Extractions that are currently being processed cannot be deleted.
If you created multiple extractions from the same upload session, deleting one will not affect the others — source files are only removed when no other extraction is using them.
Underlying API endpoint: DELETE /extractions/{extraction_id}.
| Parameter | Required | Description |
|---|---|---|
extraction_id | Yes | The extraction ID to delete. |
Returns
Returns the API response unchanged.
getCreditsBalance()
Check your current credit balance and reserved credits.
Underlying API endpoint: GET /credits/balance.
This method takes no arguments.
Returns
{
"success": true,
"credits_balance": 150,
"credits_reserved": 10
}
| Field | Description |
|---|---|
credits_balance | Your total credit balance (paid + free credits). |
credits_reserved | Credits reserved by extractions currently being processed. Your usable balance is credits_balance minus credits_reserved. |
File Limits
| Type | Max size |
|---|---|
| 150 MB | |
| JPG / JPEG / PNG | 5 MB |
| Total batch size | 2 GB |
| Max files per session | 6,000 |
Applies to extract(...) and uploadFiles(...).
Polling
Several methods accept a polling option to control how the SDK polls for extraction status.
| Field | Default | Description |
|---|---|---|
interval_ms | 10000 | Milliseconds between polls. Minimum 5000. |
timeout_ms | null | Maximum time to wait in milliseconds. null means no timeout. |
Used by: extract(...), waitForExtractionToFinish(...).
on_update
Optional callback that receives lifecycle updates across all stages. Use this when you want to handle progress reporting yourself — for example to update a UI, feed a progress bar, or route updates to your own logging instead of the built-in console_output.
on_update({ stage, level, message, progress, extraction_id })
| Field | Description |
|---|---|
stage | Current lifecycle stage: "upload", "submission", "waiting", "download", or "completion". |
level | "info", "warn", or "error". |
message | Human-readable status message. |
progress | Numeric progress when available, otherwise null. |
extraction_id | The extraction ID once available, otherwise null. |
Used by: extract(...), uploadFiles(...), waitForExtractionToFinish(...).
Conventions
- Method names are camelCase:
extract(...),uploadFiles(...),submitExtraction(...). - All option keys are snake_case:
api_key,folder_path,output_structure,task_name,upload_session_id,file_ids,console_output,on_update,file_path. Do not use camelCase equivalents likeapiKeyorfolderPath. - Response fields are snake_case, matching the API exactly. The SDK returns the same JSON shapes as the raw API — if you have the API docs, those response examples are valid for the SDK too.
- The
filesparameter accepts local file paths as strings only. Buffers, streams, and browser file objects are not supported in v1. - ESM only — use
import InvoiceDataExtraction from "@invoicedataextraction/sdk", notrequire().
Rate Limits
All API endpoints are rate limited per API key. The SDK automatically retries rate-limited requests, but you should be aware of the limits if you are making many calls. Sustained overuse will result in a RATE_LIMITED error.
| Endpoints | Limit |
|---|---|
| Upload endpoints (create session, get part URLs, complete upload) | 600 requests per minute |
| Submit extraction | 30 requests per minute |
| Poll extraction status | 120 requests per minute |
| Download output | 30 requests per minute |
| Delete extraction | 30 requests per minute |
| Check credit balance | 60 requests per minute |
Errors
SDK methods use normal JavaScript promise rejection behavior:
- On failure, a method rejects by throwing a normal JavaScript
Error. - The structured error body is available on
error.body. error.bodyuses the same JSON error shape as the API.
Error body shape:
{
"success": false,
"error": {
"code": "SOME_ERROR_CODE",
"message": "Human-readable message.",
"retryable": false,
"details": null
}
}
Read the error like this:
try {
await client.checkExtraction({ extraction_id });
} catch (error) {
const sdkError = error?.body;
if (!sdkError?.error) {
throw error;
}
console.log(sdkError.error.code);
console.log(sdkError.error.message);
console.log(sdkError.error.retryable);
console.log(sdkError.error.details);
}
Every error includes a code (machine-readable), message (human-readable), and retryable (whether retrying may succeed). The message is descriptive enough to act on directly in most cases. details provides additional context when available — for example, INVALID_INPUT errors include a details.issues array with the specific validation problems.
INVALID_INPUT can come from either the SDK (caught before the request is sent) or the API. Handle it the same way in both cases.
Authentication errors (UNAUTHENTICATED, API_KEY_EXPIRED, API_KEY_REVOKED) indicate a problem with your API key — generate a new one from your dashboard.
The SDK automatically retries RATE_LIMITED and transient INTERNAL_ERROR responses, but will surface them if retries are exhausted.
Method-specific errors like EXTRACTION_NOT_FOUND, OUTPUT_NOT_AVAILABLE, and INSUFFICIENT_CREDITS are documented in the relevant method sections above. For full endpoint-level error details, see the API docs.
Task failure vs SDK/API failure:
- After an extraction task has been accepted, the task itself can still finish with
status: "failed". - That is a task outcome, not an SDK error.
checkExtraction(...),waitForExtractionToFinish(...), andextract(...)return the polling response body for task states such asprocessing,completed, andfailed.- When a task ends with
status: "failed", the failure details are in the returned response body, not onerror.body. error.bodyis only used when the SDK method/request itself fails — validation errors, authentication errors, network failures, timeouts, or other operational failures.
SDK-specific error codes:
| Code | When the SDK uses it |
|---|---|
SDK_FILESYSTEM_ERROR | A local filesystem operation failed, such as reading an input file, creating a directory, or writing a downloaded file. |
SDK_NETWORK_ERROR | A network request failed before the SDK received a valid HTTP response. |
SDK_HTTP_ERROR | The SDK received an unexpected HTTP response shape, such as a non-JSON response or another response that does not match the documented contract. |
SDK_TIMEOUT_ERROR | waitForExtractionToFinish(...) timed out before the extraction finished. |
SDK_DOWNLOAD_ERROR | An SDK-managed download step failed. |
SDK_UPLOAD_ERROR | An SDK-managed upload orchestration step failed. |
Method to API Endpoint Mapping
| SDK Method | Underlying API |
|---|---|
extract(...) | uploadFiles(...) → submitExtraction(...) → waitForExtractionToFinish(...) → downloadOutput(...) |
uploadFiles(...) | POST /uploads/sessions → POST /uploads/sessions/{id}/parts → POST /uploads/sessions/{id}/complete |
submitExtraction(...) | POST /extractions |
waitForExtractionToFinish(...) | GET /extractions/{extraction_id} (polled) |
downloadOutput(...) | GET /extractions/{extraction_id}/output?format={format} → presigned URL download |
checkExtraction(...) | GET /extractions/{extraction_id} |
getDownloadUrl(...) | GET /extractions/{extraction_id}/output?format={format} |
deleteExtraction(...) | DELETE /extractions/{extraction_id} |
getCreditsBalance() | GET /credits/balance |