Invoice Data Extraction API

Programmatic access to our invoice extraction engine. The same AI that powers our Invoice Data Extraction platform, accessible via REST API.

Base URL

api.invoicedataextraction.com/v1

Auth

Bearer token (API key)

Output

xlsx / csv / json
Get API KeyView PricingStart with 50 free pages monthly — no credit card required

LLM-ready documentation

This documentation is structured so that an AI coding assistant can build a complete, working integration for you in any language — Python, Node.js, Go, Java, C#, or anything else. Copy the full docs and paste into your preferred LLM.

Node.js script — ready to run

Get extraction results in minutes. A complete script that handles upload, extraction, polling, and download — no dependencies, just Node.js. View in docs

  1. 1Install Node.js 18+ — download from nodejs.org if you don't have it. To check, open a terminal and run node --version
  2. 2Download the script — use the button below to save extract.js to your computer
  3. 3Open the file in any text editor and set these three variables at the top (the other settings have sensible defaults you can adjust later):
API_KEYYour API key from the dashboard
FOLDER_PATHPath to your local folder of invoices
PROMPTWhat data to extract (plain text or field definitions)
4Run the script — open a terminal, navigate to the folder where you saved the file, and run: node extract.js

Once it's working, you can customize the script to fit your exact workflow — or copy the full docs above into an AI assistant and ask it to adapt the script for you.

Invoice Data Extraction API

Overview

Extracting data from invoices is a three-step process:

  1. Upload — Create an upload session, upload your files in chunks, then complete each upload.
  2. Submit — Submit an extraction task referencing your uploaded files.
  3. Poll — Check the task status until processing completes, then download your results.

Each file in the session is uploaded and completed independently. If a file fails at any stage, you can still upload, complete, and submit the other files.

Extraction tasks submitted via API appear in your web dashboard alongside tasks submitted from the web app — you can view progress, results, and download output from either.

Authentication

All API requests require a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Generate and manage your API keys from your dashboard at https://invoicedataextraction.com/dashboard?view=API. Every account includes 50 free pages per month.

Error Responses

All endpoints return errors in this format:

{
  "success": false,
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable error message.",
    "retryable": false,
    "details": null
  }
}

retryable indicates whether the same request can be retried. When true, the error is transient (e.g., a temporary server issue) and retrying after a short delay may succeed. When false, the request itself is invalid and retrying will produce the same error.

The following errors can be returned by any endpoint:

CodeStatusRetryableMessage
UNAUTHENTICATED401NoMissing or invalid bearer token.
API_KEY_EXPIRED401NoAPI key has expired. Please create a new key.
API_KEY_REVOKED401NoAPI key has been revoked. Please create a new key.
NOT_FOUND404NoThe requested endpoint does not exist.
RATE_LIMITED429YesToo many requests. Retry after the period indicated in the Retry-After header.
INTERNAL_ERROR500YesAn unexpected error occurred.

details is always present. It is null when there is no additional context, or an object with error-specific information. For example, INVALID_INPUT errors include validation issues:

{
  "success": false,
  "error": {
    "code": "INVALID_INPUT",
    "message": "Request validation failed. Check details for specific issues.",
    "retryable": false,
    "details": {
      "issues": [
        { "message": "file_name must end with a supported extension: .pdf, .jpg, .jpeg, or .png.", "path": ["files", 0, "file_name"] }
      ]
    }
  }
}

Rate Limits

All endpoints are rate limited per API key. If you exceed the limit, the API returns a 429 status with a Retry-After header indicating how many seconds to wait before retrying.

EndpointsLimit
Upload endpoints (create session, get part URLs, complete upload)600 requests per minute
Submit extraction30 requests per minute
Poll extraction status120 requests per minute
Download output30 requests per minute
Delete extraction30 requests per minute
Check credit balance60 requests per minute

Step 1: Create Upload Session

Creates an upload for one or more files. Returns the part size you should use when chunking files for upload.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions

Authentication: Bearer token in the Authorization header.

Authorization: Bearer YOUR_API_KEY

Request Body

FieldTypeRequiredDescription
upload_session_idstringYesYour unique identifier for this upload session. Use a different ID for each new session. If a request fails or times out, you can safely retry with the same ID and files — the existing session will be returned without creating duplicates.
filesarrayYesThe files you want to upload (1 to 6,000 files).

Each item in files:

FieldTypeRequiredDescription
file_idstringYesYour unique identifier for this file within the session. Only letters, numbers, dots, underscores, colons, and hyphens (1-200 characters). You'll use this ID to reference the file when requesting part URLs and completing the upload.
file_namestringYesThe file name, including extension. Must end in .pdf, .jpg, .jpeg, or .png.
file_size_bytesintegerYesThe exact size of the file in bytes.

File Limits

TypeMax Size
PDF150 MB
JPG / JPEG / PNG5 MB
Total batch size2 GB
Max files per session6,000

Example Request

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "upload_session_id": "sess_001",
    "files": [
      {
        "file_id": "file_001",
        "file_name": "invoice-1.pdf",
        "file_size_bytes": 120450
      },
      {
        "file_id": "file_002",
        "file_name": "receipt.jpg",
        "file_size_bytes": 84200
      }
    ]
  }'

Success Response (200)

{
  "success": true,
  "upload_session_id": "sess_001",
  "files": [
    {
      "file_id": "file_001",
      "file_name": "invoice-1.pdf",
      "part_size": 8388608
    },
    {
      "file_id": "file_002",
      "file_name": "receipt.jpg",
      "part_size": 8388608
    }
  ]
}

part_size is the chunk size in bytes to use when splitting files for multipart upload. This value is the same for all files in the session. Files smaller than part_size are uploaded as a single part.

Error Codes

CodeStatusRetryableMessage
DUPLICATE_FILE_NAME400NoEach file must have a unique file_name. Check details for the duplicates.
DUPLICATE_FILE_ID400NoEach file must have a unique file_id. Check details for the duplicates.
FILE_TOO_LARGE400NoA file exceeds the maximum size for its type. Check details for the file and size limit.
TOTAL_UPLOAD_SIZE_LIMIT_EXCEEDED400NoThe combined size of all files exceeds the maximum upload size. Check details for the limit.
INSUFFICIENT_CREDITS402NoNot enough credits for this upload session. Each file requires at least one credit. Check details for your balance. credits_reserved are credits held by extractions currently being processed.
SESSION_ALREADY_INITIALIZED409NoThis upload_session_id is already in use. Please use a different upload_session_id.

Idempotency

You can safely retry a failed or timed-out request using the same upload_session_id and files. If the session was already created, the existing session is returned. If you need a new session with different files, use a different upload_session_id.

Next Step

After creating the upload session, request presigned part URLs for each file to begin uploading.


Step 2: Get Part Upload URLs

For each file, request presigned URLs for the parts you need to upload. You then PUT your file bytes directly to these URLs.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions/{upload_session_id}/parts

{upload_session_id} is the ID you provided when creating the upload session in Step 1.

Authentication: Bearer token in the Authorization header.

Request Body

FieldTypeRequiredDescription
file_idstringYesThe file ID you used when creating the upload session.
part_numbersarray of integersYesThe part numbers you want upload URLs for (1-indexed).

How to calculate part numbers

Use the part_size from the Step 1 response to determine how many parts your file needs:

total_parts = ceil(file_size_bytes / part_size)
part_numbers = [1, 2, 3, ..., total_parts]

Files smaller than part_size need only one part: [1].

In the examples below, part_size is 8388608 (8 MB):

  • A 120 KB file is smaller than 8 MB, so it needs only part [1].
  • A 20 MB file needs ceil(20_000_000 / 8_388_608) = 3 parts: [1, 2, 3].

Example: Small file (single part)

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/parts" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_001",
    "part_numbers": [1]
  }'
{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_001",
  "file_name": "invoice-1.pdf",
  "part_size": 8388608,
  "part_urls": [
    {
      "part_number": 1,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    }
  ]
}

Example: Large file (multiple parts)

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/parts" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_002",
    "part_numbers": [1, 2, 3]
  }'
{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_002",
  "file_name": "large-report.pdf",
  "part_size": 8388608,
  "part_urls": [
    {
      "part_number": 1,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    },
    {
      "part_number": 2,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    },
    {
      "part_number": 3,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    }
  ]
}

Uploading parts

Once you have the presigned URLs, split your file into chunks and upload each one. Each presigned URL is valid for 15 minutes.

How it works

  1. Read the file as binary (Buffer, ArrayBuffer, Uint8Array, etc.).
  2. Slice into chunks of part_size bytes (returned in the Step 1 response). The last chunk will usually be smaller — that's fine.
  3. PUT each chunk to the corresponding presigned URL. Send the raw bytes as the request body — no special headers or encoding needed.
  4. Capture the ETag response header from each PUT response. The ETag is a quoted string (e.g., "d41d8cd98f00b204e9800998ecf8427e"). Keep the quotes — you'll need the exact value in Step 3.

See the full Node.js example at the end of this document.

Error Codes

CodeStatusRetryableMessage
FILE_NOT_FOUND404NoThis file_id was not registered when the upload session was created. Check the file_id and upload_session_id.
FILE_NOT_UPLOADABLE409NoThis file has already been completed or aborted.

Next Step

After uploading all parts for a file, complete the upload with the ETags from each part.


Step 3: Complete File Upload

After uploading all parts for a file, call this endpoint with the ETags to finalize the upload. Call this once per file.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions/{upload_session_id}/complete

{upload_session_id} is the ID you provided when creating the upload session in Step 1.

Authentication: Bearer token in the Authorization header.

Request Body

FieldTypeRequiredDescription
file_idstringYesThe file ID you used when creating the upload session.
partsarrayYesThe part numbers and ETags from your part uploads.

Each item in parts:

FieldTypeRequiredDescription
part_numberintegerYesThe part number (matches what you requested in Step 2).
e_tagstringYesThe ETag returned in the response header when you uploaded this part. Include the surrounding quotes (e.g., "\"a1b2c3...\"")

Example Request

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/complete" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_001",
    "parts": [
      {
        "part_number": 1,
        "e_tag": "\"a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4\""
      }
    ]
  }'

For a multi-part file:

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/complete" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_002",
    "parts": [
      { "part_number": 1, "e_tag": "\"a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4\"" },
      { "part_number": 2, "e_tag": "\"f6e5d4c3b2a1f6e5d4c3b2a1f6e5d4c3\"" },
      { "part_number": 3, "e_tag": "\"9876543210ab9876543210ab9876543210\"" }
    ]
  }'

Success Response (200)

{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_001",
  "file_name": "invoice-1.pdf"
}

Idempotency

If a file has already been completed, calling this endpoint again returns a success response. This makes it safe to retry if your connection drops before you receive the response.

Error Codes

CodeStatusRetryableMessage
FILE_NOT_FOUND404NoThis file_id was not registered when the upload session was created. Check the file_id and upload_session_id.
FILE_ABORTED409NoThis file has been aborted and can no longer be completed.
INVALID_COMPLETION_PARTS400NoThe parts provided to complete this file upload are invalid. Check details for the specific reason.
OBJECT_SIZE_MISMATCH422NoThe uploaded file size does not match the file_size_bytes declared when the upload session was created. Check details for the declared and actual sizes.
UPLOAD_ID_NOT_FOUND409NoThis upload session is no longer available. Please create a new upload session and re-upload your files.
UPLOAD_COMPLETE_FAILED502YesFile upload completion failed. This may be a temporary issue — please retry.

Next Step

After completing all files, submit an extraction task.


Step 4: Submit Extraction Task

Submit an extraction task referencing your uploaded files. You tell the API what data to extract using a prompt.

Endpoint

POST https://api.invoicedataextraction.com/v1/extractions

Authentication: Bearer token in the Authorization header.

Request Body

FieldTypeRequiredDescription
submission_idstringYesYour unique identifier for this submission. If a request fails or times out, retry with the same submission_id to safely retrieve the existing task instead of creating a duplicate. Use a different ID for each new extraction task (e.g., a UUID).
upload_session_idstringYesThe upload session ID from Step 1.
file_idsarray of stringsYesThe file IDs to include in this extraction. Must reference files that were completed in Step 3.
task_namestringYesYour own label for this extraction task, for your internal reference (3-40 characters).
promptstring or objectYesYour extraction instructions. See below.
output_structurestringYes"automatic", "per_invoice", or "per_line_item".
optionsobjectNoConfiguration options. See below.

Output structure

Controls how the extracted data is structured:

ValueMeaning
automaticThe AI decides based on your prompt and documents.
per_invoiceEach invoice becomes a single row (spreadsheet/CSV) or object (JSON).
per_line_itemEach individual product/service listed within an invoice becomes its own row (spreadsheet/CSV) or object (JSON).

Prompt

The prompt field tells the AI what data to extract from your documents. It can be either a string or an object.

As a string — describe what you want in natural language:

"prompt": "Extract invoice number, date, vendor name, total amount, and all line items with descriptions and amounts"

As an object — define exact output field names, with optional per-field and general instructions:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
    { "name": "Vendor Name" },
    { "name": "Total Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
  ],
  "general_prompt": "One row for each product. Do not extract shipping lines."
}

Use an object when you need exact output field names — each name is guaranteed to appear exactly as written in the extracted data. With a string, the AI chooses field names based on your instructions.

For guidance on writing effective prompts, see the Prompt Guide.

Each item in fields:

FieldTypeRequiredDescription
namestringYesThe name for this data point in the output (2-50 characters). Prefer clear, descriptive names (e.g., "Invoice Number", not "Field A").
promptstringNoSpecific instructions for extracting this data point. Use this to clarify ambiguities or instruct special handling.
FieldTypeRequiredDescription
general_promptstringNoInstructions that apply to the full task (and across all fields). Use this to provide special handling instructions, specify output structure/formatting, or describe the extraction goal.

Options

The options object is optional. All fields within it are optional and have sensible defaults.

FieldTypeDefaultDescription
exclude_columnsarray of strings[]System-generated columns to exclude from output files. By default, a "Source File" column is added to every row indicating which uploaded file/page the data was extracted from. If your workflow requires an exact output structure, you can exclude it. Valid values: "source_file".

Example: String prompt

curl -X POST "https://api.invoicedataextraction.com/v1/extractions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "submission_id": "sub_001",
    "upload_session_id": "sess_001",
    "file_ids": ["file_001", "file_002"],
    "task_name": "January invoices",
    "prompt": "Extract invoice number, date, vendor name, and total amount",
    "output_structure": "per_invoice"
  }'

Example: Object prompt

curl -X POST "https://api.invoicedataextraction.com/v1/extractions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "submission_id": "sub_002",
    "upload_session_id": "sess_001",
    "file_ids": ["file_001", "file_002"],
    "task_name": "January invoices",
    "prompt": {
      "fields": [
        { "name": "Invoice Number" },
        { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
        { "name": "Vendor Name" },
        { "name": "Line Item Description" },
        { "name": "Line Item Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
      ],
      "general_prompt": "Dates should be in YYYY-MM-DD format. Ignore email cover letters."
    },
    "output_structure": "per_line_item"
  }'

Success Response (202)

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "submission_state": "received"
}

The task is now queued for processing. Use the extraction_id to poll for results (Step 5).

Once submitted, the extraction task also appears in the web dashboard alongside tasks submitted from the web app — you can view its progress and results there.

Idempotency

If a request fails or times out, you can safely retry with the same submission_id. If the task was already created, the existing task is returned without creating a duplicate. Use a different submission_id for each new extraction task.

Next Step

After submitting, poll the task status until processing completes.


Step 5: Poll for Results

After submitting an extraction task, poll this endpoint until the task completes or fails.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}

{extraction_id} is returned in the Step 4 response.

Authentication: Bearer token in the Authorization header.

Response

The response always includes success, status, and extraction_id. The rest of the response depends on the status.

Processing (keep polling)

{
  "success": true,
  "status": "processing",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "progress": 42
}

progress is an integer from 0 to 100 indicating approximate completion. The task is still being processed — wait a few seconds and poll again.

Completed

{
  "success": true,
  "status": "completed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 25,
  "output_structure": "per_invoice",
  "pages": {
    "successful_count": 10,
    "failed_count": 2,
    "successful": [
      { "file_name": "invoice-1.pdf", "page": 1 },
      { "file_name": "invoice-1.pdf", "page": 2 }
    ],
    "failed": [
      { "file_name": "damaged.pdf", "page": 1 }
    ]
  },
  "ai_uncertainty_notes": [
    {
      "topic": "Documents to extract from",
      "description": "Your files often contain a 'Tax Invoice' with an attached 'Delivery Note'. I treated the 'Tax Invoice' pages as the main source of data, and ignored the attached 'Delivery Note' pages as supporting context.",
      "suggested_prompt_additions": [
        {
          "purpose": "To confirm this handling",
          "instructions": ["Extract from 'Tax Invoice' only"]
        },
        {
          "purpose": "To extract from both",
          "instructions": ["Extract from 'Tax Invoice' and 'Delivery Note'"]
        }
      ]
    }
  ],
  "output": {
    "xlsx_url": "https://storage.example.com/...?X-Amz-Signature=...",
    "csv_url": "https://storage.example.com/...?X-Amz-Signature=...",
    "json_url": "https://storage.example.com/...?X-Amz-Signature=..."
  }
}
FieldDescription
credits_deductedThe number of credits charged for this extraction (one credit per successful page).
output_structureThe output structure used: "per_invoice" or "per_line_item". If you submitted with "automatic", this tells you what the AI chose.
pages.successful_countNumber of pages successfully processed.
pages.failed_countNumber of pages that failed processing.
pages.successfulList of successfully processed pages. Each item has file_name (the uploaded file name) and page (the page number within that file).
pages.failedList of pages that failed processing. Same shape as successful.
ai_uncertainty_notesAreas where the AI made assumptions due to ambiguity in your prompt. Empty array if none. Each note has a topic, a description of what was assumed, and a suggested_prompt_additions array of prompt additions you can use to remove the ambiguity in future extractions. Each item has a purpose (why you'd add it) and instructions (prompt text you can add).
output.xlsx_urlPresigned download URL for the Excel (.xlsx) file. null if not available.
output.csv_urlPresigned download URL for the CSV file. null if not available.
output.json_urlPresigned download URL for the JSON file. null if not available.

Download URLs are temporary, pre-authenticated URLs. To download a file, make a plain GET request to the URL — no Authorization header or other authentication needed. URLs expire after 5 minutes. If a URL has expired, use the download endpoint to get a fresh one.

Failed

When an extraction fails, the response uses the standard error format plus status: "failed" and extraction_id.

INSUFFICIENT_CREDITScredits_balance is your total credit balance. credits_reserved are credits held by extractions currently being processed (your available credits = balance minus reserved).

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "INSUFFICIENT_CREDITS",
    "message": "Insufficient credits to process this extraction. Check details for your balance and required credits.",
    "retryable": false,
    "details": {
      "credits_required": 25,
      "credits_balance": 15,
      "credits_reserved": 10
    }
  }
}

FILE_PAGE_LIMIT_EXCEEDED / ENCRYPTED_FILEdetails.file_names lists the affected files.

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "ENCRYPTED_FILE",
    "message": "One or more files are encrypted. Remove the encryption and re-upload. Check details for the affected files.",
    "retryable": false,
    "details": {
      "file_names": ["protected-invoice.pdf"]
    }
  }
}

All other error codes have details: null:

CodeRetryableMessage
CONCURRENT_TASK_LIMITYesToo many extractions running at once. Wait for one to complete, then retry.
NO_PAGES_FOUNDNoNo extractable pages found. Files may be empty or corrupted.
PROMPT_REJECTEDNoThe prompt did not describe data extraction. Please revise your prompt.
PROMPT_UNCLEARNoThe AI could not understand the prompt well enough. Please adjust your instructions.
FILE_SIZE_LIMIT_EXCEEDEDNoA file exceeded the size limit during processing. Split large files and retry.
SUBMISSION_STALLEDYesThis extraction was not picked up for processing. Please resubmit.
EXTRACTION_NOT_FOUNDNoNo extraction found for this extraction_id.
INTERNAL_ERRORYesAn unexpected error occurred. Retry after a short delay.

Polling Strategy

Poll no more frequently than every 5 seconds. Processing time depends on the number and size of your files.

while status == "processing":
    wait 5+ seconds
    GET /extractions/{extraction_id}

if success == false:
    check error.retryable — if true, wait and resubmit; if false, fix the issue first
else if status == "completed":
    download output files

Next Step

Download the output files using the URLs in the response. If a download URL has expired, request a fresh one.


Step 6: Download Output

If a download URL from the polling response has expired (URLs are valid for 5 minutes), request a fresh one.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}/output?format={format}

Authentication: Bearer token in the Authorization header.

Query Parameters

ParameterRequiredDescription
formatYesxlsx, csv, or json

Example Request

curl "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890/output?format=xlsx" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "download_url": "https://storage.example.com/...?X-Amz-Signature=...",
  "format": "xlsx",
  "expires_in_seconds": 300
}

Error Codes

CodeStatusRetryableMessage
EXTRACTION_NOT_FOUND404NoNo extraction found for this extraction_id.
OUTPUT_NOT_AVAILABLE404NoOutput is not available. The extraction may not be completed, or this format was not generated.

Delete Extraction

Permanently deletes an extraction, its output files, and its uploaded source files. Extractions that are currently being processed cannot be deleted.

Note: Deleting an extraction removes the uploaded source files associated with it. If you created multiple extractions from the same upload session, deleting one will not affect the others — source files are only removed when no other extraction is using them.

Our standard data retention policies apply automatically — uploaded documents and processing data are deleted on a schedule. Use this endpoint if you need to delete an extraction and its data immediately rather than waiting for automatic retention.

Endpoint

DELETE https://api.invoicedataextraction.com/v1/extractions/{extraction_id}

Authentication: Bearer token in the Authorization header.

Example Request

curl -X DELETE "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true
}

Error Codes

CodeStatusRetryableMessage
EXTRACTION_NOT_FOUND404NoNo extraction found for this extraction_id.
EXTRACTION_IN_PROGRESS409NoThis extraction is currently being processed and cannot be deleted. Wait for it to complete or fail, then try again.

Check Credit Balance

Returns your current credit balance, including credits reserved by extractions that are currently being processed.

Endpoint

GET https://api.invoicedataextraction.com/v1/credits/balance

Authentication: Bearer token in the Authorization header.

Example Request

curl "https://api.invoicedataextraction.com/v1/credits/balance" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true,
  "credits_balance": 150,
  "credits_reserved": 10
}
FieldDescription
credits_balanceYour total credit balance (paid + free credits).
credits_reservedCredits reserved by extractions currently being processed. Up to this amount will be deducted when processing completes depending on number of successful pages. Your usable balance is credits_balance minus credits_reserved.

Node.js Example

A ready-to-run script that handles the full workflow — reads files from a local folder, uploads them, submits an extraction task, polls until completion, and downloads the results. No dependencies beyond Node.js 18+.

Save this as extract.js, set the three configuration variables at the top (API_KEY, FOLDER_PATH, PROMPT), and run with node extract.js. You'll have extraction results in minutes.

import { readdir, readFile, stat, writeFile, mkdir } from "fs/promises";
import { join, extname } from "path";

// ---------------------------------------------------------------------------
// Configuration — set these before running
// ---------------------------------------------------------------------------

// Your API key. Get one at: https://invoicedataextraction.com/dashboard?view=API
// IMPORTANT: This is hardcoded here for simplicity. In production, load from an
// environment variable (e.g. process.env.IDE_API_KEY) and never commit to Git.
const API_KEY = "YOUR_API_KEY";

// Absolute path to the local folder containing the files you want to process.
const FOLDER_PATH = "/Users/you/Documents/invoices";

// Tell the AI what data to extract from each document (plain-text instruction).
const PROMPT = "Extract invoice number, date, vendor name, and total amount";
// For exact output column names, pass an object instead:
//   const PROMPT = { fields: [{ name: "Invoice Number" }, { name: "Total", prompt: "No currency symbol" }], general_prompt: "..." };

// A label for this extraction task (3-40 characters). Used in your dashboard and output filenames.
const TASK_NAME = "My extraction task";

// How rows are grouped in the output: "automatic" (AI decides), "per_invoice", or "per_line_item".
const OUTPUT_STRUCTURE = "automatic";

// Which output formats to download. Any combination of "xlsx", "csv", "json".
const DOWNLOAD_FORMATS = ["xlsx", "csv", "json"];

// ---------------------------------------------------------------------------
// Internal constants — no changes needed
// ---------------------------------------------------------------------------

const API_BASE = "https://api.invoicedataextraction.com/v1";
const SUPPORTED_EXTENSIONS = new Set([".pdf", ".jpg", ".jpeg", ".png"]);
const MAX_RETRIES = 3;

async function apiRequest(path, body) {
  for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
    const response = await fetch(`${API_BASE}${path}`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(body),
    });
    const text = await response.text();
    let data;
    try {
      data = JSON.parse(text);
    } catch {
      // Non-JSON response — could be a Cloudflare rate limit or infrastructure error.
      // Retry on 429/503, throw on anything else.
      if ((response.status === 429 || response.status === 503) && attempt < MAX_RETRIES) {
        const delayMs = 5000 * attempt;
        console.warn(`Non-JSON ${response.status} response, retrying in ${delayMs / 1000}s...`);
        await new Promise((resolve) => setTimeout(resolve, delayMs));
        continue;
      }
      throw new Error(`API returned non-JSON response (${response.status}): ${text.slice(0, 200)}`);
    }
    if (data.success) return data;

    // If the error is retryable and we have attempts left, wait and retry
    if (data.error.retryable && attempt < MAX_RETRIES) {
      // Use the Retry-After header if present (rate limit responses), otherwise exponential backoff
      const retryAfter = response.headers.get("Retry-After");
      const delayMs = retryAfter ? parseInt(retryAfter, 10) * 1000 : 1000 * attempt;
      console.warn(`Retryable error (${data.error.code}), retrying in ${delayMs / 1000}s...`);
      await new Promise((resolve) => setTimeout(resolve, delayMs));
      continue;
    }

    throw new Error(
      `API error: ${data.error.code} — ${data.error.message}` +
        (data.error.details ? `\nDetails: ${JSON.stringify(data.error.details)}` : "")
    );
  }
}

// ---------------------------------------------------------------------------
// Step 1: Discover local files and create an upload session
// ---------------------------------------------------------------------------

// Scan the folder for supported file types
const entries = await readdir(FOLDER_PATH);
const files = [];

for (const entry of entries) {
  // Skip unsupported file types and subfolders
  const ext = extname(entry).toLowerCase();
  if (!SUPPORTED_EXTENSIONS.has(ext)) continue;
  const filePath = join(FOLDER_PATH, entry);
  const fileStat = await stat(filePath);
  if (!fileStat.isFile()) continue;

  // Add this file to the list with its size in bytes.
  // file_id must be unique within the session and can only contain letters, numbers,
  // dots, underscores, colons, and hyphens (no spaces). Use your own IDs (e.g., database
  // row IDs, UUIDs, or a simple counter).
  files.push({
    file_id: `file_${files.length + 1}`,
    file_name: entry,
    file_size_bytes: fileStat.size,
    localPath: filePath, // kept locally — not sent to the API
  });
}

// Optional: before uploading, you could calculate the credits required and check
// your balance. Each page costs one credit — for PDFs, count the pages; for
// images, each file is one credit. Then call GET /credits/balance to compare
// against your available balance (credits_balance minus credits_reserved).

// Generate a unique ID for this upload session (must be different for each new session)
const uploadSessionId = `session_${Date.now()}`;

// Create the upload session — registers all files with the API
let session;
try {
  session = await apiRequest("/uploads/sessions", {
    upload_session_id: uploadSessionId,
    files: files.map(({ file_id, file_name, file_size_bytes }) => ({
      file_id,
      file_name,
      file_size_bytes,
    })),
  });
} catch (error) {
  // Session creation failure is fatal — no files can be uploaded without a session
  console.error(`Failed to create upload session: ${error.message}`);
  process.exit(1);
}

console.log(`Upload session created: ${session.upload_session_id} (${files.length} files)`);

// The chunk size in bytes — always the same for all files in the session, so we read it from the first
const partSize = session.files[0].part_size;

// ---------------------------------------------------------------------------
// Steps 2 & 3: For each file — upload chunks, then complete the upload
// ---------------------------------------------------------------------------

const completedFileIds = [];

for (const file of files) {
  try {
    // Read the entire file into memory as a binary buffer
    const fileBuffer = await readFile(file.localPath);

    // Calculate how many parts this file needs
    const totalParts = Math.ceil(fileBuffer.length / partSize);
    const partNumbers = Array.from({ length: totalParts }, (_, i) => i + 1);

    // Request a presigned upload URL for each part
    const partsData = await apiRequest(`/uploads/sessions/${uploadSessionId}/parts`, {
      file_id: file.file_id,
      part_numbers: partNumbers,
    });

    // Upload each chunk to its presigned URL via PUT
    const completedParts = [];

    for (const { part_number, url } of partsData.part_urls) {
      // Slice the file buffer into a chunk for this part
      const start = (part_number - 1) * partSize;
      const end = Math.min(start + partSize, fileBuffer.length);
      const chunk = fileBuffer.subarray(start, end);

      // PUT the raw bytes directly to the presigned URL
      const putResponse = await fetch(url, { method: "PUT", body: chunk });
      if (!putResponse.ok) {
        const errorBody = await putResponse.text();
        throw new Error(
          `Upload failed for ${file.file_name} part ${part_number}: ${putResponse.status} ${putResponse.statusText}\n${errorBody}`
        );
      }

      // Save the ETag — needed to complete the upload in Step 3
      completedParts.push({
        part_number,
        e_tag: putResponse.headers.get("etag"),
      });
    }

    console.log(`Uploaded: ${file.file_name} (${totalParts} part${totalParts > 1 ? "s" : ""})`);

    // Complete the file upload with the collected ETags
    await apiRequest(`/uploads/sessions/${uploadSessionId}/complete`, {
      file_id: file.file_id,
      parts: completedParts,
    });

    console.log(`Completed: ${file.file_name}`);
    completedFileIds.push(file.file_id);
  } catch (error) {
    // By default, abort on any file failure to avoid silent partial uploads.
    // If you'd prefer to continue with remaining files, remove the process.exit.
    console.error(`Failed: ${file.file_name} — ${error.message}`);
    process.exit(1);
  }
}

// All files uploaded and completed successfully
console.log(`\n${completedFileIds.length} files ready for extraction.`);

// ---------------------------------------------------------------------------
// Steps 4 & 5: Submit the extraction task and poll until it completes
// ---------------------------------------------------------------------------

// Retryable polling errors (e.g., concurrent task limit, temporary server issues) trigger
// a fresh submission. Non-retryable errors require action from you — the log message tells
// you what to fix before re-running the script.

// Optional: human-readable guidance for non-retryable error codes (see error reference above).
// This just improves the console output — the API works the same without it.
const NON_RETRYABLE_GUIDANCE = {
  INSUFFICIENT_CREDITS: "Purchase credits at https://invoicedataextraction.com/dashboard?view=Billing then re-run this script.",
  FILE_PAGE_LIMIT_EXCEEDED: "Split the affected files into smaller documents and re-upload.",
  ENCRYPTED_FILE: "Remove encryption from the affected files and re-upload.",
  NO_PAGES_FOUND: "Check that your files are valid and contain extractable content.",
  PROMPT_REJECTED: "Revise your prompt to clearly describe what data to extract.",
  PROMPT_UNCLEAR: "Revise your prompt with clearer instructions and re-run.",
  FILE_SIZE_LIMIT_EXCEEDED: "Split large files into smaller documents and re-upload.",
};

const MAX_SUBMISSION_ATTEMPTS = 2;
const POLL_INTERVAL_MS = 5000;

let result;

for (let attempt = 1; attempt <= MAX_SUBMISSION_ATTEMPTS; attempt++) {
  // Each attempt needs a unique submission_id
  const submissionId = `sub_${Date.now()}_${attempt}`;

  const run = await apiRequest("/extractions", {
    submission_id: submissionId,
    upload_session_id: uploadSessionId,
    file_ids: completedFileIds,
    task_name: TASK_NAME,
    prompt: PROMPT,
    output_structure: OUTPUT_STRUCTURE,
  });

  console.log(`\nExtraction task submitted (extraction_id: ${run.extraction_id})`);

  // Poll until completed or failed
  let lastFailureCode = null;
  let consecutivePollErrors = 0;
  const MAX_CONSECUTIVE_POLL_ERRORS = 10;
  while (true) {
    const response = await fetch(`${API_BASE}/extractions/${run.extraction_id}`, {
      headers: { Authorization: `Bearer ${API_KEY}` },
    });
    const data = await response.json();

    if (data.status === "completed") {
      result = data;
      break;
    }

    if (data.status === "failed") {
      const { code, message, details, retryable } = data.error;
      console.error(`\nExtraction failed: ${code} — ${message}`);
      if (details) console.error(`Details: ${JSON.stringify(details)}`);

      if (!retryable) {
        const guidance = NON_RETRYABLE_GUIDANCE[code] || "Check the error above and re-run when resolved.";
        console.error(`\nAction required: ${guidance}`);
        process.exit(1);
      }

      // Retryable — wait then submit again.
      // Concurrent task limit means we wait longer (5 min) for other processing tasks to finish.
      // Other retryable errors are transient, so a short delay (10s) suffices.
      const delayMs = code === "CONCURRENT_TASK_LIMIT" ? 300_000 : 10_000;
      console.log(`Retrying in ${delayMs / 1000}s (attempt ${attempt}/${MAX_SUBMISSION_ATTEMPTS})...`);
      await new Promise((resolve) => setTimeout(resolve, delayMs));
      lastFailureCode = code;
      break;
    }

    // Still processing — reset error counter and poll again
    if (data.status === "processing") {
      consecutivePollErrors = 0;
      console.log(`Processing... ${data.progress ?? 0}%`);
    } else {
      consecutivePollErrors++;
      console.warn(`Polling issue (HTTP ${response.status}) — retrying in ${POLL_INTERVAL_MS / 1000}s... (${consecutivePollErrors}/${MAX_CONSECUTIVE_POLL_ERRORS})`);
      if (consecutivePollErrors >= MAX_CONSECUTIVE_POLL_ERRORS) {
        console.error(`\nToo many consecutive polling errors. The extraction may still be processing — check your dashboard or retry later.`);
        process.exit(1);
      }
    }
    await new Promise((resolve) => setTimeout(resolve, POLL_INTERVAL_MS));
  }

  if (result) break;

  if (lastFailureCode && attempt === MAX_SUBMISSION_ATTEMPTS) {
    const exitMessage = lastFailureCode === "CONCURRENT_TASK_LIMIT"
      ? `\nStill hitting the concurrent task limit after ${MAX_SUBMISSION_ATTEMPTS} attempts. Wait for your other extractions to finish, then re-run.`
      : `\nGave up after ${MAX_SUBMISSION_ATTEMPTS} attempts. There may be temporary service issues — please wait and try again later.`;
    console.error(exitMessage);
    process.exit(1);
  }
}

console.log(`\nExtraction completed!`);
console.log(`Credits deducted: ${result.credits_deducted}`);
console.log(`Output structure: ${result.output_structure}`);
console.log(`Pages: ${result.pages.successful_count} successful, ${result.pages.failed_count} failed`);

if (result.pages.failed_count > 0) {
  console.warn(`\nWarning: ${result.pages.failed_count} page(s) failed to extract. Data from these pages is missing from the output.`);
  for (const page of result.pages.failed) {
    console.warn(`  - ${page.file_name} (page ${page.page})`);
  }
}

if (result.ai_uncertainty_notes.length > 0) {
  console.log(`\n--- AI Uncertainty Notes ---`);
  console.log(`The AI made assumptions in ${result.ai_uncertainty_notes.length} area(s). Review these and consider adding the suggested prompt additions to improve future extractions.\n`);
  result.ai_uncertainty_notes.forEach((note, i) => {
    console.log(`  [${i + 1}] ${note.topic}`);
    console.log(`  ${note.description}`);
    for (const suggestion of note.suggested_prompt_additions) {
      console.log(`    → ${suggestion.purpose}: "${suggestion.instructions}"`);
    }
    console.log();
  });
  console.log(`---`);
}

// ---------------------------------------------------------------------------
// Step 6: Download the output files
// ---------------------------------------------------------------------------

const timestamp = new Date().toISOString().replace(/[:.]/g, "-").slice(0, 19);
const safeName = TASK_NAME.replace(/[^a-zA-Z0-9_-]/g, "_");
await mkdir("output", { recursive: true });

for (const format of DOWNLOAD_FORMATS) {
  const url = result.output[`${format}_url`];
  if (!url) {
    console.warn(`No ${format} download available.`);
    continue;
  }
  const response = await fetch(url);
  if (!response.ok) {
    console.error(`Failed to download ${format}: ${response.status}`);
    continue;
  }
  const buffer = Buffer.from(await response.arrayBuffer());
  const outputPath = `output/${safeName}_${timestamp}.${format}`;
  await writeFile(outputPath, buffer);
  console.log(`Downloaded: ${outputPath}`);
}

console.log(`\nDone. Extraction ${result.extraction_id} completed successfully.`);

Client Libraries

Official Node.js and Python SDKs are in development. In the meantime, the API can be used with any HTTP client — see the Node.js example above for a complete reference implementation.