REST API Reference

Programmatic access to our invoice extraction engine. The same AI that powers our Invoice Data Extraction platform, accessible via REST API.

Building with Python or Node.js? Start with the Python SDK or Node.js SDK instead.

Base URL

api.invoicedataextraction.com/v1

Auth

Bearer token (API key)

Output

xlsx / csv / json
Get API KeyView PricingStart with 50 free pages monthly — no credit card required

Use an SDK for a simpler integration

Building with Python or Node.js? Use an official SDK instead of calling the REST API directly. SDKs handle file upload, polling, and download automatically — you can go from local files to structured output in a few lines of code.

If you need to call the REST API directly — for example, from a language without an official SDK — the endpoint documentation below is a complete reference.

LLM-ready documentation for the REST API

These REST API docs are structured so that an AI coding assistant can build a complete, working HTTP integration for you in any language. Copy and paste into your preferred LLM.

Invoice Data Extraction API

Overview

Extracting data from invoices is a three-step process:

  1. Upload — Create an upload session, upload your files in chunks, then complete each upload.
  2. Submit — Submit an extraction task referencing your uploaded files.
  3. Poll — Check the task status until processing completes, then download your results.

Each file in the session is uploaded and completed independently. If a file fails at any stage, you can still upload, complete, and submit the other files.

Extraction tasks submitted via API appear in your web dashboard alongside tasks submitted from the web app — you can view progress, results, and download output from either.

Authentication

All API requests require a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Generate and manage your API keys from your dashboard at https://invoicedataextraction.com/dashboard?view=API. Every account includes 50 free pages per month.

Error Responses

All endpoints return errors in this format:

{
  "success": false,
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable error message.",
    "retryable": false,
    "details": null
  }
}

retryable indicates whether the same request can be retried. When true, the error is transient (e.g., a temporary server issue) and retrying after a short delay may succeed. When false, the request itself is invalid and retrying will produce the same error.

The following errors can be returned by any endpoint:

CodeStatusRetryableMessage
UNAUTHENTICATED401NoMissing or invalid bearer token.
API_KEY_EXPIRED401NoAPI key has expired. Please create a new key.
API_KEY_REVOKED401NoAPI key has been revoked. Please create a new key.
NOT_FOUND404NoThe requested endpoint does not exist.
RATE_LIMITED429YesToo many requests. Retry after the period indicated in the Retry-After header.
INTERNAL_ERROR500YesAn unexpected error occurred.

details is always present. It is null when there is no additional context, or an object with error-specific information. For example, INVALID_INPUT errors include validation issues:

{
  "success": false,
  "error": {
    "code": "INVALID_INPUT",
    "message": "Request validation failed. Check details for specific issues.",
    "retryable": false,
    "details": {
      "issues": [
        { "message": "file_name must end with a supported extension: .pdf, .jpg, .jpeg, or .png.", "path": ["files", 0, "file_name"] }
      ]
    }
  }
}

Rate Limits

All endpoints are rate limited per API key. If you exceed the limit, the API returns a 429 status with a Retry-After header indicating how many seconds to wait before retrying.

EndpointsLimit
Upload endpoints (create session, get part URLs, complete upload)600 requests per minute
Submit extraction30 requests per minute
Poll extraction status120 requests per minute
Download output30 requests per minute
Delete extraction30 requests per minute
Check credit balance60 requests per minute

Step 1: Create Upload Session

Creates an upload for one or more files. Returns the part size you should use when chunking files for upload.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions

Authentication: Bearer token in the Authorization header.

Authorization: Bearer YOUR_API_KEY

Request Body

FieldTypeRequiredDescription
upload_session_idstringYesYour unique identifier for this upload session. Use a different ID for each new session. If a request fails or times out, you can safely retry with the same ID and files — the existing session will be returned without creating duplicates.
filesarrayYesThe files you want to upload (1 to 6,000 files).

Each item in files:

FieldTypeRequiredDescription
file_idstringYesYour unique identifier for this file within the session. Only letters, numbers, dots, underscores, colons, and hyphens (1-200 characters). You'll use this ID to reference the file when requesting part URLs and completing the upload.
file_namestringYesThe file name, including extension. Must end in .pdf, .jpg, .jpeg, or .png.
file_size_bytesintegerYesThe exact size of the file in bytes.

File Limits

TypeMax Size
PDF150 MB
JPG / JPEG / PNG5 MB
Total batch size2 GB
Max files per session6,000

Example Request

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "upload_session_id": "sess_001",
    "files": [
      {
        "file_id": "file_001",
        "file_name": "invoice-1.pdf",
        "file_size_bytes": 120450
      },
      {
        "file_id": "file_002",
        "file_name": "receipt.jpg",
        "file_size_bytes": 84200
      }
    ]
  }'

Success Response (200)

{
  "success": true,
  "upload_session_id": "sess_001",
  "files": [
    {
      "file_id": "file_001",
      "file_name": "invoice-1.pdf",
      "part_size": 8388608
    },
    {
      "file_id": "file_002",
      "file_name": "receipt.jpg",
      "part_size": 8388608
    }
  ]
}

part_size is the chunk size in bytes to use when splitting files for multipart upload. This value is the same for all files in the session. Files smaller than part_size are uploaded as a single part.

Error Codes

CodeStatusRetryableMessage
DUPLICATE_FILE_NAME400NoEach file must have a unique file_name. Check details for the duplicates.
DUPLICATE_FILE_ID400NoEach file must have a unique file_id. Check details for the duplicates.
FILE_TOO_LARGE400NoA file exceeds the maximum size for its type. Check details for the file and size limit.
TOTAL_UPLOAD_SIZE_LIMIT_EXCEEDED400NoThe combined size of all files exceeds the maximum upload size. Check details for the limit.
INSUFFICIENT_CREDITS402NoNot enough credits for this upload session. Each file requires at least one credit. Check details for your balance. credits_reserved are credits held by extractions currently being processed.
SESSION_ALREADY_INITIALIZED409NoThis upload_session_id is already in use. Please use a different upload_session_id.

Idempotency

You can safely retry a failed or timed-out request using the same upload_session_id and files. If the session was already created, the existing session is returned. If you need a new session with different files, use a different upload_session_id.

Next Step

After creating the upload session, request presigned part URLs for each file to begin uploading.


Step 2: Get Part Upload URLs

For each file, request presigned URLs for the parts you need to upload. You then PUT your file bytes directly to these URLs.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions/{upload_session_id}/parts

{upload_session_id} is the ID you provided when creating the upload session in Step 1.

Authentication: Bearer token in the Authorization header.

Request Body

FieldTypeRequiredDescription
file_idstringYesThe file ID you used when creating the upload session.
part_numbersarray of integersYesThe part numbers you want upload URLs for (1-indexed).

How to calculate part numbers

Use the part_size from the Step 1 response to determine how many parts your file needs:

total_parts = ceil(file_size_bytes / part_size)
part_numbers = [1, 2, 3, ..., total_parts]

Files smaller than part_size need only one part: [1].

In the examples below, part_size is 8388608 (8 MB):

  • A 120 KB file is smaller than 8 MB, so it needs only part [1].
  • A 20 MB file needs ceil(20_000_000 / 8_388_608) = 3 parts: [1, 2, 3].

Example: Small file (single part)

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/parts" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_001",
    "part_numbers": [1]
  }'
{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_001",
  "file_name": "invoice-1.pdf",
  "part_size": 8388608,
  "part_urls": [
    {
      "part_number": 1,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    }
  ]
}

Example: Large file (multiple parts)

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/parts" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_002",
    "part_numbers": [1, 2, 3]
  }'
{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_002",
  "file_name": "large-report.pdf",
  "part_size": 8388608,
  "part_urls": [
    {
      "part_number": 1,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    },
    {
      "part_number": 2,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    },
    {
      "part_number": 3,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    }
  ]
}

Uploading parts

Once you have the presigned URLs, split your file into chunks and upload each one. Each presigned URL is valid for 15 minutes.

How it works

  1. Read the file as binary (Buffer, ArrayBuffer, Uint8Array, etc.).
  2. Slice into chunks of part_size bytes (returned in the Step 1 response). The last chunk will usually be smaller — that's fine.
  3. PUT each chunk to the corresponding presigned URL. Send the raw bytes as the request body — no special headers or encoding needed.
  4. Capture the ETag response header from each PUT response. The ETag is a quoted string (e.g., "d41d8cd98f00b204e9800998ecf8427e"). Keep the quotes — you'll need the exact value in Step 3.

See the full Node.js example at the end of this document.

Error Codes

CodeStatusRetryableMessage
FILE_NOT_FOUND404NoThis file_id was not registered when the upload session was created. Check the file_id and upload_session_id.
FILE_NOT_UPLOADABLE409NoThis file has already been completed or aborted.

Next Step

After uploading all parts for a file, complete the upload with the ETags from each part.


Step 3: Complete File Upload

After uploading all parts for a file, call this endpoint with the ETags to finalize the upload. Call this once per file.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions/{upload_session_id}/complete

{upload_session_id} is the ID you provided when creating the upload session in Step 1.

Authentication: Bearer token in the Authorization header.

Request Body

FieldTypeRequiredDescription
file_idstringYesThe file ID you used when creating the upload session.
partsarrayYesThe part numbers and ETags from your part uploads.

Each item in parts:

FieldTypeRequiredDescription
part_numberintegerYesThe part number (matches what you requested in Step 2).
e_tagstringYesThe ETag returned in the response header when you uploaded this part. Include the surrounding quotes (e.g., "\"a1b2c3...\"")

Example Request

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/complete" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_001",
    "parts": [
      {
        "part_number": 1,
        "e_tag": "\"a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4\""
      }
    ]
  }'

For a multi-part file:

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/complete" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_002",
    "parts": [
      { "part_number": 1, "e_tag": "\"a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4\"" },
      { "part_number": 2, "e_tag": "\"f6e5d4c3b2a1f6e5d4c3b2a1f6e5d4c3\"" },
      { "part_number": 3, "e_tag": "\"9876543210ab9876543210ab9876543210\"" }
    ]
  }'

Success Response (200)

{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_001",
  "file_name": "invoice-1.pdf"
}

Idempotency

If a file has already been completed, calling this endpoint again returns a success response. This makes it safe to retry if your connection drops before you receive the response.

Error Codes

CodeStatusRetryableMessage
FILE_NOT_FOUND404NoThis file_id was not registered when the upload session was created. Check the file_id and upload_session_id.
FILE_ABORTED409NoThis file has been aborted and can no longer be completed.
INVALID_COMPLETION_PARTS400NoThe parts provided to complete this file upload are invalid. Check details for the specific reason.
OBJECT_SIZE_MISMATCH422NoThe uploaded file size does not match the file_size_bytes declared when the upload session was created. Check details for the declared and actual sizes.
UPLOAD_ID_NOT_FOUND409NoThis upload session is no longer available. Please create a new upload session and re-upload your files.
UPLOAD_COMPLETE_FAILED502YesFile upload completion failed. This may be a temporary issue — please retry.

Next Step

After completing all files, submit an extraction task.


Step 4: Submit Extraction Task

Submit an extraction task referencing your uploaded files. You tell the API what data to extract using a prompt.

Endpoint

POST https://api.invoicedataextraction.com/v1/extractions

Authentication: Bearer token in the Authorization header.

Request Body

FieldTypeRequiredDescription
submission_idstringYesYour unique identifier for this submission. If a request fails or times out, retry with the same submission_id to safely retrieve the existing task instead of creating a duplicate. Use a different ID for each new extraction task (e.g., a UUID).
upload_session_idstringYesThe upload session ID from Step 1.
file_idsarray of stringsYesThe file IDs to include in this extraction. Must reference files that were completed in Step 3.
task_namestringYesYour own label for this extraction task, for your internal reference (3-40 characters).
promptstring or objectYesYour extraction instructions. See below.
output_structurestringYes"automatic", "per_invoice", or "per_line_item".
optionsobjectNoConfiguration options. See below.

Output structure

Controls how the extracted data is structured:

ValueMeaning
automaticThe AI decides based on your prompt and documents.
per_invoiceEach invoice becomes a single row (spreadsheet/CSV) or object (JSON).
per_line_itemEach individual product/service listed within an invoice becomes its own row (spreadsheet/CSV) or object (JSON).

Prompt

The prompt field tells the AI what data to extract from your documents. It can be either a string or an object.

As a string — describe what you want in natural language (max 2,500 characters):

"prompt": "Extract invoice number, date, vendor name, total amount, and all line items with descriptions and amounts"

As an object — define exact output field names, with optional per-field and general instructions:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
    { "name": "Vendor Name" },
    { "name": "Total Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
  ],
  "general_prompt": "One row for each product. Do not extract shipping lines."
}

Use an object when you need exact output field names — each name is guaranteed to appear exactly as written in the extracted data. With a string, the AI chooses field names based on your instructions.

For guidance on writing effective prompts, see the Extraction Guide.

Each item in fields:

FieldTypeRequiredDescription
namestringYesThe name for this data point in the output (2-50 characters). Prefer clear, descriptive names (e.g., "Invoice Number", not "Field A").
promptstringNoSpecific instructions for extracting this data point (3–600 characters). Use this to clarify ambiguities or instruct special handling.
FieldTypeRequiredDescription
general_promptstringNoInstructions that apply to the full task and across all fields (max 1,500 characters). Use this to provide special handling instructions, specify output structure/formatting, or describe the extraction goal.

Options

The options object is optional. All fields within it are optional and have sensible defaults.

FieldTypeDefaultDescription
exclude_columnsarray of strings[]System-generated columns to exclude from output files. By default, a "Source File" column is added to every row indicating which uploaded file/page the data was extracted from. If your workflow requires an exact output structure, you can exclude it. Valid values: "source_file".

Example: String prompt

curl -X POST "https://api.invoicedataextraction.com/v1/extractions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "submission_id": "sub_001",
    "upload_session_id": "sess_001",
    "file_ids": ["file_001", "file_002"],
    "task_name": "January invoices",
    "prompt": "Extract invoice number, date, vendor name, and total amount",
    "output_structure": "per_invoice"
  }'

Example: Object prompt

curl -X POST "https://api.invoicedataextraction.com/v1/extractions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "submission_id": "sub_002",
    "upload_session_id": "sess_001",
    "file_ids": ["file_001", "file_002"],
    "task_name": "January invoices",
    "prompt": {
      "fields": [
        { "name": "Invoice Number" },
        { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
        { "name": "Vendor Name" },
        { "name": "Line Item Description" },
        { "name": "Line Item Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
      ],
      "general_prompt": "Dates should be in YYYY-MM-DD format. Ignore email cover letters."
    },
    "output_structure": "per_line_item"
  }'

Success Response (202)

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "submission_state": "received"
}

The task is now queued for processing. Use the extraction_id to poll for results (Step 5).

Once submitted, the extraction task also appears in the web dashboard alongside tasks submitted from the web app — you can view its progress and results there.

Idempotency

If a request fails or times out, you can safely retry with the same submission_id. If the task was already created, the existing task is returned without creating a duplicate. Use a different submission_id for each new extraction task.

Next Step

After submitting, poll the task status until processing completes.


Step 5: Poll for Results

After submitting an extraction task, poll this endpoint until the task completes or fails.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}

{extraction_id} is returned in the Step 4 response.

Authentication: Bearer token in the Authorization header.

Response

The response always includes success, status, and extraction_id. The rest of the response depends on the status.

Processing (keep polling)

{
  "success": true,
  "status": "processing",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "progress": 42
}

progress is an integer from 0 to 100 indicating approximate completion. The task is still being processed — wait a few seconds and poll again.

Completed

{
  "success": true,
  "status": "completed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 25,
  "output_structure": "per_invoice",
  "pages": {
    "successful_count": 10,
    "failed_count": 2,
    "successful": [
      { "file_name": "invoice-1.pdf", "page": 1 },
      { "file_name": "invoice-1.pdf", "page": 2 }
    ],
    "failed": [
      { "file_name": "damaged.pdf", "page": 1 }
    ]
  },
  "ai_uncertainty_notes": [
    {
      "topic": "Documents to extract from",
      "description": "Your files often contain a 'Tax Invoice' with an attached 'Delivery Note'. I treated the 'Tax Invoice' pages as the main source of data, and ignored the attached 'Delivery Note' pages as supporting context.",
      "suggested_prompt_additions": [
        {
          "purpose": "To confirm this handling",
          "instructions": ["Extract from 'Tax Invoice' only"]
        },
        {
          "purpose": "To extract from both",
          "instructions": ["Extract from 'Tax Invoice' and 'Delivery Note'"]
        }
      ]
    }
  ],
  "output": {
    "xlsx_url": "https://storage.example.com/...?X-Amz-Signature=...",
    "csv_url": "https://storage.example.com/...?X-Amz-Signature=...",
    "json_url": "https://storage.example.com/...?X-Amz-Signature=..."
  }
}
FieldDescription
credits_deductedThe number of credits charged for this extraction (one credit per successful page).
output_structureThe output structure used: "per_invoice" or "per_line_item". If you submitted with "automatic", this tells you what the AI chose.
pages.successful_countNumber of pages successfully processed.
pages.failed_countNumber of pages that failed processing.
pages.successfulList of successfully processed pages. Each item has file_name (the uploaded file name) and page (the page number within that file).
pages.failedList of pages that failed processing. Same shape as successful.
ai_uncertainty_notesAreas where the AI made assumptions due to ambiguity in your prompt. Empty array if none. Each note has a topic, a description of what was assumed, and a suggested_prompt_additions array of prompt additions you can use to remove the ambiguity in future extractions. Each item has a purpose (why you'd add it) and instructions (prompt text you can add).
output.xlsx_urlPresigned download URL for the Excel (.xlsx) file. null if not available.
output.csv_urlPresigned download URL for the CSV file. null if not available.
output.json_urlPresigned download URL for the JSON file. null if not available.

Download URLs are temporary, pre-authenticated URLs. To download a file, make a plain GET request to the URL — no Authorization header or other authentication needed. URLs expire after 5 minutes. If a URL has expired, use the download endpoint to get a fresh one.

Failed

When an extraction fails, the response uses the standard error format plus status: "failed" and extraction_id.

INSUFFICIENT_CREDITScredits_balance is your total credit balance. credits_reserved are credits held by extractions currently being processed (your available credits = balance minus reserved).

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "INSUFFICIENT_CREDITS",
    "message": "Insufficient credits to process this extraction. Check details for your balance and required credits.",
    "retryable": false,
    "details": {
      "credits_required": 25,
      "credits_balance": 15,
      "credits_reserved": 10
    }
  }
}

FILE_PAGE_LIMIT_EXCEEDED / ENCRYPTED_FILEdetails.file_names lists the affected files.

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "ENCRYPTED_FILE",
    "message": "One or more files are encrypted. Remove the encryption and re-upload. Check details for the affected files.",
    "retryable": false,
    "details": {
      "file_names": ["protected-invoice.pdf"]
    }
  }
}

All other error codes have details: null:

CodeRetryableMessage
CONCURRENT_TASK_LIMITYesToo many extractions running at once. Wait for one to complete, then retry.
NO_PAGES_FOUNDNoNo extractable pages found. Files may be empty or corrupted.
PROMPT_REJECTEDNoThe prompt did not describe data extraction. Please revise your prompt.
PROMPT_UNCLEARNoThe AI could not understand the prompt well enough. Please adjust your instructions.
FILE_SIZE_LIMIT_EXCEEDEDNoA file exceeded the size limit during processing. Split large files and retry.
SUBMISSION_STALLEDYesThis extraction was not picked up for processing. Please resubmit.
EXTRACTION_NOT_FOUNDNoNo extraction found for this extraction_id.
INTERNAL_ERRORYesAn unexpected error occurred. Retry after a short delay.

Polling Strategy

Poll no more frequently than every 5 seconds. Processing time depends on the number and size of your files.

while status == "processing":
    wait 5+ seconds
    GET /extractions/{extraction_id}

if success == false:
    check error.retryable — if true, wait and resubmit; if false, fix the issue first
else if status == "completed":
    download output files

Next Step

Download the output files using the URLs in the response. If a download URL has expired, request a fresh one.


Step 6: Download Output

If a download URL from the polling response has expired (URLs are valid for 5 minutes), request a fresh one.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}/output?format={format}

Authentication: Bearer token in the Authorization header.

Query Parameters

ParameterRequiredDescription
formatYesxlsx, csv, or json

Example Request

curl "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890/output?format=xlsx" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "download_url": "https://storage.example.com/...?X-Amz-Signature=...",
  "format": "xlsx",
  "expires_in_seconds": 300
}

Error Codes

CodeStatusRetryableMessage
EXTRACTION_NOT_FOUND404NoNo extraction found for this extraction_id.
OUTPUT_NOT_AVAILABLE404NoOutput is not available. The extraction may not be completed, or this format was not generated.

Delete Extraction

Permanently deletes an extraction, its output files, and its uploaded source files. Extractions that are currently being processed cannot be deleted.

Note: Deleting an extraction removes the uploaded source files associated with it. If you created multiple extractions from the same upload session, deleting one will not affect the others — source files are only removed when no other extraction is using them.

Our standard data retention policies apply automatically — uploaded documents and processing data are deleted on a schedule. Use this endpoint if you need to delete an extraction and its data immediately rather than waiting for automatic retention.

Endpoint

DELETE https://api.invoicedataextraction.com/v1/extractions/{extraction_id}

Authentication: Bearer token in the Authorization header.

Example Request

curl -X DELETE "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true
}

Error Codes

CodeStatusRetryableMessage
EXTRACTION_NOT_FOUND404NoNo extraction found for this extraction_id.
EXTRACTION_IN_PROGRESS409NoThis extraction is currently being processed and cannot be deleted. Wait for it to complete or fail, then try again.

Check Credit Balance

Returns your current credit balance, including credits reserved by extractions that are currently being processed.

Endpoint

GET https://api.invoicedataextraction.com/v1/credits/balance

Authentication: Bearer token in the Authorization header.

Example Request

curl "https://api.invoicedataextraction.com/v1/credits/balance" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true,
  "credits_balance": 150,
  "credits_reserved": 10
}
FieldDescription
credits_balanceYour total credit balance (paid + free credits).
credits_reservedCredits reserved by extractions currently being processed. Up to this amount will be deducted when processing completes depending on number of successful pages. Your usable balance is credits_balance minus credits_reserved.

Working with Output Files

You can control the structure and formatting of all output files in two main ways:

  • use output_structure to choose the top-level record shape, such as per_invoice or per_line_item
  • use your prompt to describe the fields, grouping, and overall structure you want, such as "one row per product" or "one row per PO"

You can also use your prompt to:

  • specify missing-value placeholders, such as empty string, N/A, or 0
  • specify formatting requirements, such as YYYY-MM-DD, digits only, or no currency symbol
  • specify the intended output type, such as text, number, date, datetime, boolean, currency, or percentage

These instructions may appear differently across JSON, CSV, and XLSX outputs, but they all affect how the final export is produced.

At a high level:

  • JSON output is string-based.
  • CSV is text-based.
  • XLSX can use native spreadsheet cell types when values can be safely interpreted.

Working with JSON Output

JSON value typing

In the JSON output file, extracted field values are returned as strings.

  • Standard fields are returned as strings.
  • If you ask for a field to contain JSON, that field is returned as a string containing valid JSON.
  • All values inside that JSON are also strings.

If you need numbers, booleans, or dates as typed values, parse them in your own code. If you plan to parse a value, state the formatting clearly in your prompt. For example:

  • "Do not include currency symbol"
  • "Use digits only"
  • "Return true or false"
  • "Use YYYY-MM-DD format"

Structured JSON fields

You can ask for a field to return structured JSON.

Example prompt:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    {
      "name": "Line Items",
      "prompt": "Return a JSON array with keys description, quantity, unit_price, and amount. Use digits only for quantity. Use a full stop as the decimal separator. Do not include currency symbols in unit_price or amount. Do not use thousands separators. Use an empty string when a value is missing."
    }
  ]
}

Example JSON output value:

"Line Items": "[{\"description\":\"Widget\",\"quantity\":\"2\",\"unit_price\":\"9.99\",\"amount\":\"19.98\"}]"

In the example above, Line Items is a string whose content is valid JSON.

Use nested line-item JSON like above, mainly for smaller or simpler cases, such as when there are only a few line items and you want a single invoice-level object.

If you need detailed line item extraction, prefer output_structure: "per_line_item" instead of returning line items inside a nested JSON field.

This is strongly recommended when:

  • invoices may contain around 7 or more line items
  • line items need detailed per-field instructions
  • you want the most reliable line item extraction

In per_line_item, define invoice-level fields and line-item fields as separate top-level fields.

Many workflows can use the per_line_item output directly, with one row/object per line item.

If your workflow needs a nested structure such as { invoice_fields..., line_items: [...] }, include your own stable invoice identifier such as Invoice Number so you can group related line item rows back into invoices in your own system.

Do not rely on Source File alone to group rows into invoices. Source File helps you trace where a row came from, but it is not a stable invoice identifier.

Example prompt for the recommended approach:

{
  "prompt": {
    "fields": [
      { "name": "Invoice Number" },
      { "name": "Invoice Date", "prompt": "Use YYYY-MM-DD format" },
      { "name": "Vendor Name" },
      { "name": "Line Item Description" },
      { "name": "Line Item Quantity", "prompt": "Use digits only" },
      { "name": "Line Item Unit Price" },
      { "name": "Line Item Amount" }
    ],
    "general_prompt": "For amount fields don't use thousands separators, use full stops as the decimal separator and do not include currency symbols."
  },
  "output_structure": "per_line_item"
}

Example JSON output rows:

[
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget A",
    "Line Item Quantity": "2",
    "Line Item Unit Price": "9.99",
    "Line Item Amount": "19.98"
  },
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget B",
    "Line Item Quantity": "1",
    "Line Item Unit Price": "5.00",
    "Line Item Amount": "5.00"
  }
]

Both rows above belong to the same invoice because they share the same Invoice Number. If your workflow needs one record per line item, you can use the rows as-is. If your workflow needs a nested invoice structure, you can group rows that share the same invoice identifier to build your own { invoice_fields..., line_items: [...] } structure.

CSV Output

CSV is a plain-text export. Every value in the CSV file is written as text.

XLSX Output

XLSX uses the most appropriate spreadsheet cell type for each value by default, and follows explicit prompt instructions where provided.


Node.js Example

A ready-to-run script that handles the full workflow — reads files from a local folder, uploads them, submits an extraction task, polls until completion, and downloads the results. No dependencies beyond Node.js 18+.

Save this as extract.js, set the three configuration variables at the top (API_KEY, FOLDER_PATH, PROMPT), and run with node extract.js. You'll have extraction results in minutes.

import { readdir, readFile, stat, writeFile, mkdir } from "fs/promises";
import { join, extname } from "path";

// ---------------------------------------------------------------------------
// Configuration — set these before running
// ---------------------------------------------------------------------------

// Your API key. Get one at: https://invoicedataextraction.com/dashboard?view=API
// IMPORTANT: This is hardcoded here for simplicity. In production, load from an
// environment variable (e.g. process.env.IDE_API_KEY) and never commit to Git.
const API_KEY = "YOUR_API_KEY";

// Absolute path to the local folder containing the files you want to process.
const FOLDER_PATH = "/Users/you/Documents/invoices";

// Tell the AI what data to extract from each document (plain-text instruction).
const PROMPT = "Extract invoice number, date, vendor name, and total amount";
// For exact output column names, pass an object instead:
//   const PROMPT = { fields: [{ name: "Invoice Number" }, { name: "Total", prompt: "No currency symbol" }], general_prompt: "..." };

// A label for this extraction task (3-40 characters). Used in your dashboard and output filenames.
const TASK_NAME = "My extraction task";

// How rows are grouped in the output: "automatic" (AI decides), "per_invoice", or "per_line_item".
const OUTPUT_STRUCTURE = "automatic";

// Which output formats to download. Any combination of "xlsx", "csv", "json".
const DOWNLOAD_FORMATS = ["xlsx", "csv", "json"];

// ---------------------------------------------------------------------------
// Internal constants — no changes needed
// ---------------------------------------------------------------------------

const API_BASE = "https://api.invoicedataextraction.com/v1";
const SUPPORTED_EXTENSIONS = new Set([".pdf", ".jpg", ".jpeg", ".png"]);
const MAX_RETRIES = 3;

async function apiRequest(path, body) {
  for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
    const response = await fetch(`${API_BASE}${path}`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(body),
    });
    const text = await response.text();
    let data;
    try {
      data = JSON.parse(text);
    } catch {
      // Non-JSON response — could be a Cloudflare rate limit or infrastructure error.
      // Retry on 429/503, throw on anything else.
      if ((response.status === 429 || response.status === 503) && attempt < MAX_RETRIES) {
        const delayMs = 5000 * attempt;
        console.warn(`Non-JSON ${response.status} response, retrying in ${delayMs / 1000}s...`);
        await new Promise((resolve) => setTimeout(resolve, delayMs));
        continue;
      }
      throw new Error(`API returned non-JSON response (${response.status}): ${text.slice(0, 200)}`);
    }
    if (data.success) return data;

    // If the error is retryable and we have attempts left, wait and retry
    if (data.error.retryable && attempt < MAX_RETRIES) {
      // Use the Retry-After header if present (rate limit responses), otherwise exponential backoff
      const retryAfter = response.headers.get("Retry-After");
      const delayMs = retryAfter ? parseInt(retryAfter, 10) * 1000 : 1000 * attempt;
      console.warn(`Retryable error (${data.error.code}), retrying in ${delayMs / 1000}s...`);
      await new Promise((resolve) => setTimeout(resolve, delayMs));
      continue;
    }

    throw new Error(
      `API error: ${data.error.code} — ${data.error.message}` +
        (data.error.details ? `\nDetails: ${JSON.stringify(data.error.details)}` : "")
    );
  }
}

// ---------------------------------------------------------------------------
// Step 1: Discover local files and create an upload session
// ---------------------------------------------------------------------------

// Scan the folder for supported file types
const entries = await readdir(FOLDER_PATH);
const files = [];

for (const entry of entries) {
  // Skip unsupported file types and subfolders
  const ext = extname(entry).toLowerCase();
  if (!SUPPORTED_EXTENSIONS.has(ext)) continue;
  const filePath = join(FOLDER_PATH, entry);
  const fileStat = await stat(filePath);
  if (!fileStat.isFile()) continue;

  // Add this file to the list with its size in bytes.
  // file_id must be unique within the session and can only contain letters, numbers,
  // dots, underscores, colons, and hyphens (no spaces). Use your own IDs (e.g., database
  // row IDs, UUIDs, or a simple counter).
  files.push({
    file_id: `file_${files.length + 1}`,
    file_name: entry,
    file_size_bytes: fileStat.size,
    localPath: filePath, // kept locally — not sent to the API
  });
}

// Optional: before uploading, you could calculate the credits required and check
// your balance. Each page costs one credit — for PDFs, count the pages; for
// images, each file is one credit. Then call GET /credits/balance to compare
// against your available balance (credits_balance minus credits_reserved).

// Generate a unique ID for this upload session (must be different for each new session)
const uploadSessionId = `session_${Date.now()}`;

// Create the upload session — registers all files with the API
let session;
try {
  session = await apiRequest("/uploads/sessions", {
    upload_session_id: uploadSessionId,
    files: files.map(({ file_id, file_name, file_size_bytes }) => ({
      file_id,
      file_name,
      file_size_bytes,
    })),
  });
} catch (error) {
  // Session creation failure is fatal — no files can be uploaded without a session
  console.error(`Failed to create upload session: ${error.message}`);
  process.exit(1);
}

console.log(`Upload session created: ${session.upload_session_id} (${files.length} files)`);

// The chunk size in bytes — always the same for all files in the session, so we read it from the first
const partSize = session.files[0].part_size;

// ---------------------------------------------------------------------------
// Steps 2 & 3: For each file — upload chunks, then complete the upload
// ---------------------------------------------------------------------------

const completedFileIds = [];

for (const file of files) {
  try {
    // Read the entire file into memory as a binary buffer
    const fileBuffer = await readFile(file.localPath);

    // Calculate how many parts this file needs
    const totalParts = Math.ceil(fileBuffer.length / partSize);
    const partNumbers = Array.from({ length: totalParts }, (_, i) => i + 1);

    // Request a presigned upload URL for each part
    const partsData = await apiRequest(`/uploads/sessions/${uploadSessionId}/parts`, {
      file_id: file.file_id,
      part_numbers: partNumbers,
    });

    // Upload each chunk to its presigned URL via PUT
    const completedParts = [];

    for (const { part_number, url } of partsData.part_urls) {
      // Slice the file buffer into a chunk for this part
      const start = (part_number - 1) * partSize;
      const end = Math.min(start + partSize, fileBuffer.length);
      const chunk = fileBuffer.subarray(start, end);

      // PUT the raw bytes directly to the presigned URL
      const putResponse = await fetch(url, { method: "PUT", body: chunk });
      if (!putResponse.ok) {
        const errorBody = await putResponse.text();
        throw new Error(
          `Upload failed for ${file.file_name} part ${part_number}: ${putResponse.status} ${putResponse.statusText}\n${errorBody}`
        );
      }

      // Save the ETag — needed to complete the upload in Step 3
      completedParts.push({
        part_number,
        e_tag: putResponse.headers.get("etag"),
      });
    }

    console.log(`Uploaded: ${file.file_name} (${totalParts} part${totalParts > 1 ? "s" : ""})`);

    // Complete the file upload with the collected ETags
    await apiRequest(`/uploads/sessions/${uploadSessionId}/complete`, {
      file_id: file.file_id,
      parts: completedParts,
    });

    console.log(`Completed: ${file.file_name}`);
    completedFileIds.push(file.file_id);
  } catch (error) {
    // By default, abort on any file failure to avoid silent partial uploads.
    // If you'd prefer to continue with remaining files, remove the process.exit.
    console.error(`Failed: ${file.file_name} — ${error.message}`);
    process.exit(1);
  }
}

// All files uploaded and completed successfully
console.log(`\n${completedFileIds.length} files ready for extraction.`);

// ---------------------------------------------------------------------------
// Steps 4 & 5: Submit the extraction task and poll until it completes
// ---------------------------------------------------------------------------

// Retryable polling errors (e.g., concurrent task limit, temporary server issues) trigger
// a fresh submission. Non-retryable errors require action from you — the log message tells
// you what to fix before re-running the script.

// Optional: human-readable guidance for non-retryable error codes (see error reference above).
// This just improves the console output — the API works the same without it.
const NON_RETRYABLE_GUIDANCE = {
  INSUFFICIENT_CREDITS: "Purchase credits at https://invoicedataextraction.com/dashboard?view=Billing then re-run this script.",
  FILE_PAGE_LIMIT_EXCEEDED: "Split the affected files into smaller documents and re-upload.",
  ENCRYPTED_FILE: "Remove encryption from the affected files and re-upload.",
  NO_PAGES_FOUND: "Check that your files are valid and contain extractable content.",
  PROMPT_REJECTED: "Revise your prompt to clearly describe what data to extract.",
  PROMPT_UNCLEAR: "Revise your prompt with clearer instructions and re-run.",
  FILE_SIZE_LIMIT_EXCEEDED: "Split large files into smaller documents and re-upload.",
};

const MAX_SUBMISSION_ATTEMPTS = 2;
const POLL_INTERVAL_MS = 5000;

let result;

for (let attempt = 1; attempt <= MAX_SUBMISSION_ATTEMPTS; attempt++) {
  // Each attempt needs a unique submission_id
  const submissionId = `sub_${Date.now()}_${attempt}`;

  const run = await apiRequest("/extractions", {
    submission_id: submissionId,
    upload_session_id: uploadSessionId,
    file_ids: completedFileIds,
    task_name: TASK_NAME,
    prompt: PROMPT,
    output_structure: OUTPUT_STRUCTURE,
  });

  console.log(`\nExtraction task submitted (extraction_id: ${run.extraction_id})`);

  // Poll until completed or failed
  let lastFailureCode = null;
  let consecutivePollErrors = 0;
  const MAX_CONSECUTIVE_POLL_ERRORS = 10;
  while (true) {
    const response = await fetch(`${API_BASE}/extractions/${run.extraction_id}`, {
      headers: { Authorization: `Bearer ${API_KEY}` },
    });
    const data = await response.json();

    if (data.status === "completed") {
      result = data;
      break;
    }

    if (data.status === "failed") {
      const { code, message, details, retryable } = data.error;
      console.error(`\nExtraction failed: ${code} — ${message}`);
      if (details) console.error(`Details: ${JSON.stringify(details)}`);

      if (!retryable) {
        const guidance = NON_RETRYABLE_GUIDANCE[code] || "Check the error above and re-run when resolved.";
        console.error(`\nAction required: ${guidance}`);
        process.exit(1);
      }

      // Retryable — wait then submit again.
      // Concurrent task limit means we wait longer (5 min) for other processing tasks to finish.
      // Other retryable errors are transient, so a short delay (10s) suffices.
      const delayMs = code === "CONCURRENT_TASK_LIMIT" ? 300_000 : 10_000;
      console.log(`Retrying in ${delayMs / 1000}s (attempt ${attempt}/${MAX_SUBMISSION_ATTEMPTS})...`);
      await new Promise((resolve) => setTimeout(resolve, delayMs));
      lastFailureCode = code;
      break;
    }

    // Still processing — reset error counter and poll again
    if (data.status === "processing") {
      consecutivePollErrors = 0;
      console.log(`Processing... ${data.progress ?? 0}%`);
    } else {
      consecutivePollErrors++;
      console.warn(`Polling issue (HTTP ${response.status}) — retrying in ${POLL_INTERVAL_MS / 1000}s... (${consecutivePollErrors}/${MAX_CONSECUTIVE_POLL_ERRORS})`);
      if (consecutivePollErrors >= MAX_CONSECUTIVE_POLL_ERRORS) {
        console.error(`\nToo many consecutive polling errors. The extraction may still be processing — check your dashboard or retry later.`);
        process.exit(1);
      }
    }
    await new Promise((resolve) => setTimeout(resolve, POLL_INTERVAL_MS));
  }

  if (result) break;

  if (lastFailureCode && attempt === MAX_SUBMISSION_ATTEMPTS) {
    const exitMessage = lastFailureCode === "CONCURRENT_TASK_LIMIT"
      ? `\nStill hitting the concurrent task limit after ${MAX_SUBMISSION_ATTEMPTS} attempts. Wait for your other extractions to finish, then re-run.`
      : `\nGave up after ${MAX_SUBMISSION_ATTEMPTS} attempts. There may be temporary service issues — please wait and try again later.`;
    console.error(exitMessage);
    process.exit(1);
  }
}

console.log(`\nExtraction completed!`);
console.log(`Credits deducted: ${result.credits_deducted}`);
console.log(`Output structure: ${result.output_structure}`);
console.log(`Pages: ${result.pages.successful_count} successful, ${result.pages.failed_count} failed`);

if (result.pages.failed_count > 0) {
  console.warn(`\nWarning: ${result.pages.failed_count} page(s) failed to extract. Data from these pages is missing from the output.`);
  for (const page of result.pages.failed) {
    console.warn(`  - ${page.file_name} (page ${page.page})`);
  }
}

if (result.ai_uncertainty_notes.length > 0) {
  console.log(`\n--- AI Uncertainty Notes ---`);
  console.log(`The AI made assumptions in ${result.ai_uncertainty_notes.length} area(s). Review these and consider adding the suggested prompt additions to improve future extractions.\n`);
  result.ai_uncertainty_notes.forEach((note, i) => {
    console.log(`  [${i + 1}] ${note.topic}`);
    console.log(`  ${note.description}`);
    for (const suggestion of note.suggested_prompt_additions) {
      console.log(`    → ${suggestion.purpose}: "${suggestion.instructions}"`);
    }
    console.log();
  });
  console.log(`---`);
}

// ---------------------------------------------------------------------------
// Step 6: Download the output files
// ---------------------------------------------------------------------------

const timestamp = new Date().toISOString().replace(/[:.]/g, "-").slice(0, 19);
const safeName = TASK_NAME.replace(/[^a-zA-Z0-9_-]/g, "_");
await mkdir("output", { recursive: true });

for (const format of DOWNLOAD_FORMATS) {
  const url = result.output[`${format}_url`];
  if (!url) {
    console.warn(`No ${format} download available.`);
    continue;
  }
  const response = await fetch(url);
  if (!response.ok) {
    console.error(`Failed to download ${format}: ${response.status}`);
    continue;
  }
  const buffer = Buffer.from(await response.arrayBuffer());
  const outputPath = `output/${safeName}_${timestamp}.${format}`;
  await writeFile(outputPath, buffer);
  console.log(`Downloaded: ${outputPath}`);
}

console.log(`\nDone. Extraction ${result.extraction_id} completed successfully.`);