Invoice Data Extraction API

Overview

Extracting data from invoices is a three-step process:

Upload — Create an upload session, upload your files in chunks, then complete each upload.
Submit — Submit an extraction task referencing your uploaded files.
Poll — Check the task status until processing completes, then download your results.

Each file in the session is uploaded and completed independently. If a file fails at any stage, you can still upload, complete, and submit the other files.

Extraction tasks submitted via API appear in your web dashboard alongside tasks submitted from the web app — you can view progress, results, and download output from either.

Authentication

All API requests require a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Generate and manage your API keys from your dashboard at https://invoicedataextraction.com/dashboard?view=API. Every account includes 50 free pages per month.

Teams and your API key

If you're a team admin, your API key has access to your team's extractions by default — see the scope query parameter on List Extractions, Poll for Results, Get Extraction Details, Download Output, and Delete Extraction.

Visibility is tied to your current team membership, not to the team you were in when you created the key. If you leave a team, all of your API keys lose team-scope visibility immediately. If you join or create a new team, the keys you already have gain visibility into that team.

Error Responses

All endpoints return errors in this format:

{
  "success": false,
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable error message.",
    "retryable": false,
    "details": null
  }
}

retryable indicates whether the same request can be retried. When true, the error is transient (e.g., a temporary server issue) and retrying after a short delay may succeed. When false, the request itself is invalid and retrying will produce the same error.

The following errors can be returned by any endpoint:

Code	Status	Retryable	Message
`UNAUTHENTICATED`	401	No	Missing or invalid bearer token.
`API_KEY_EXPIRED`	401	No	API key has expired. Please create a new key.
`API_KEY_REVOKED`	401	No	API key has been revoked. Please create a new key.
`NOT_FOUND`	404	No	The requested endpoint does not exist.
`RATE_LIMITED`	429	Yes	Too many requests. Retry after the period indicated in the `Retry-After` header.
`INTERNAL_ERROR`	500	Yes	An unexpected error occurred.

details is always present. It is null when there is no additional context, or an object with error-specific information. For example, INVALID_INPUT errors include validation issues:

{
  "success": false,
  "error": {
    "code": "INVALID_INPUT",
    "message": "Request validation failed. Check details for specific issues.",
    "retryable": false,
    "details": {
      "issues": [
        { "message": "file_name must end with a supported extension: .pdf, .jpg, .jpeg, or .png.", "path": ["files", 0, "file_name"] }
      ]
    }
  }
}

Rate Limits

All endpoints are rate limited per API key. If you exceed the limit, the API returns a 429 status with a Retry-After header indicating how many seconds to wait before retrying.

Endpoints	Limit
Upload endpoints (create session, get part URLs, complete upload)	600 requests per minute
Submit extraction	30 requests per minute
Poll extraction status	120 requests per minute
List extractions	60 requests per minute
Get extraction details	60 requests per minute
Download output	30 requests per minute
Delete extraction	30 requests per minute
Check credit balance	60 requests per minute

Step 1: Create Upload Session

Creates an upload for one or more files. Returns the part size you should use when chunking files for upload.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions

Authentication: Bearer token in the Authorization header.

Authorization: Bearer YOUR_API_KEY

Request Body

Field	Type	Required	Description
`upload_session_id`	string	Yes	Your unique identifier for this upload session. Use a different ID for each new session. If a request fails or times out, you can safely retry with the same ID and files — the existing session will be returned without creating duplicates.
`files`	array	Yes	The files you want to upload (1 to 6,000 files).

Each item in files:

Field	Type	Required	Description
`file_id`	string	Yes	Your unique identifier for this file within the session. Only letters, numbers, dots, underscores, colons, and hyphens (1-200 characters). You'll use this ID to reference the file when requesting part URLs and completing the upload.
`file_name`	string	Yes	The file name, including extension. Must end in `.pdf`, `.jpg`, `.jpeg`, or `.png`.
`file_size_bytes`	integer	Yes	The exact size of the file in bytes.

File Limits

Type	Max Size
PDF	150 MB
JPG / JPEG / PNG	5 MB
Total batch size	2 GB
Max files per session	6,000

Example Request

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "upload_session_id": "sess_001",
    "files": [
      {
        "file_id": "file_001",
        "file_name": "invoice-1.pdf",
        "file_size_bytes": 120450
      },
      {
        "file_id": "file_002",
        "file_name": "receipt.jpg",
        "file_size_bytes": 84200
      }
    ]
  }'

Success Response (200)

{
  "success": true,
  "upload_session_id": "sess_001",
  "files": [
    {
      "file_id": "file_001",
      "file_name": "invoice-1.pdf",
      "part_size": 8388608
    },
    {
      "file_id": "file_002",
      "file_name": "receipt.jpg",
      "part_size": 8388608
    }
  ]
}

part_size is the chunk size in bytes to use when splitting files for multipart upload. This value is the same for all files in the session. Files smaller than part_size are uploaded as a single part.

Error Codes

Code	Status	Retryable	Message
`DUPLICATE_FILE_NAME`	400	No	Each file must have a unique file_name. Check details for the duplicates.
`DUPLICATE_FILE_ID`	400	No	Each file must have a unique file_id. Check details for the duplicates.
`FILE_TOO_LARGE`	400	No	A file exceeds the maximum size for its type. Check details for the file and size limit.
`TOTAL_UPLOAD_SIZE_LIMIT_EXCEEDED`	400	No	The combined size of all files exceeds the maximum upload size. Check details for the limit.
`INSUFFICIENT_CREDITS`	402	No	Not enough credits for this upload session. Each file requires at least one credit. Check details for your balance. `credits_reserved` are credits held by extractions currently being processed.
`SESSION_ALREADY_INITIALIZED`	409	No	This upload_session_id is already in use. Please use a different upload_session_id.

Idempotency

You can safely retry a failed or timed-out request using the same upload_session_id and files. If the session was already created, the existing session is returned. If you need a new session with different files, use a different upload_session_id.

Next Step

After creating the upload session, request presigned part URLs for each file to begin uploading.

Step 2: Get Part Upload URLs

For each file, request presigned URLs for the parts you need to upload. You then PUT your file bytes directly to these URLs.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions/{upload_session_id}/parts

{upload_session_id} is the ID you provided when creating the upload session in Step 1.

Authentication: Bearer token in the Authorization header.

Request Body

Field	Type	Required	Description
`file_id`	string	Yes	The file ID you used when creating the upload session.
`part_numbers`	array of integers	Yes	The part numbers you want upload URLs for (1-indexed).

How to calculate part numbers

Use the part_size from the Step 1 response to determine how many parts your file needs:

total_parts = ceil(file_size_bytes / part_size)
part_numbers = [1, 2, 3, ..., total_parts]

Files smaller than part_size need only one part: [1].

In the examples below, part_size is 8388608 (8 MB):

A 120 KB file is smaller than 8 MB, so it needs only part [1].
A 20 MB file needs ceil(20_000_000 / 8_388_608) = 3 parts: [1, 2, 3].

Example: Small file (single part)

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/parts" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_001",
    "part_numbers": [1]
  }'

{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_001",
  "file_name": "invoice-1.pdf",
  "part_size": 8388608,
  "part_urls": [
    {
      "part_number": 1,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    }
  ]
}

Example: Large file (multiple parts)

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/parts" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_002",
    "part_numbers": [1, 2, 3]
  }'

{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_002",
  "file_name": "large-report.pdf",
  "part_size": 8388608,
  "part_urls": [
    {
      "part_number": 1,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    },
    {
      "part_number": 2,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    },
    {
      "part_number": 3,
      "url": "https://storage.example.com/...?X-Amz-Signature=..."
    }
  ]
}

Uploading parts

Once you have the presigned URLs, split your file into chunks and upload each one. Each presigned URL is valid for 15 minutes.

How it works

Read the file as binary (Buffer, ArrayBuffer, Uint8Array, etc.).
Slice into chunks of part_size bytes (returned in the Step 1 response). The last chunk will usually be smaller — that's fine.
PUT each chunk to the corresponding presigned URL. Send the raw bytes as the request body — no special headers or encoding needed.
Capture the ETag response header from each PUT response. The ETag is a quoted string (e.g., "d41d8cd98f00b204e9800998ecf8427e"). Keep the quotes — you'll need the exact value in Step 3.

See the full Node.js example at the end of this document.

Error Codes

Code	Status	Retryable	Message
`FILE_NOT_FOUND`	404	No	This file_id was not registered when the upload session was created. Check the file_id and upload_session_id.
`FILE_NOT_UPLOADABLE`	409	No	This file has already been completed or aborted.

Next Step

After uploading all parts for a file, complete the upload with the ETags from each part.

Step 3: Complete File Upload

After uploading all parts for a file, call this endpoint with the ETags to finalize the upload. Call this once per file.

Endpoint

POST https://api.invoicedataextraction.com/v1/uploads/sessions/{upload_session_id}/complete

{upload_session_id} is the ID you provided when creating the upload session in Step 1.

Authentication: Bearer token in the Authorization header.

Request Body

Field	Type	Required	Description
`file_id`	string	Yes	The file ID you used when creating the upload session.
`parts`	array	Yes	The part numbers and ETags from your part uploads.

Each item in parts:

Field	Type	Required	Description
`part_number`	integer	Yes	The part number (matches what you requested in Step 2).
`e_tag`	string	Yes	The ETag returned in the response header when you uploaded this part. Include the surrounding quotes (e.g., `"\"a1b2c3...\""`)

Example Request

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/complete" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_001",
    "parts": [
      {
        "part_number": 1,
        "e_tag": "\"a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4\""
      }
    ]
  }'

For a multi-part file:

curl -X POST "https://api.invoicedataextraction.com/v1/uploads/sessions/sess_001/complete" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "file_002",
    "parts": [
      { "part_number": 1, "e_tag": "\"a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4\"" },
      { "part_number": 2, "e_tag": "\"f6e5d4c3b2a1f6e5d4c3b2a1f6e5d4c3\"" },
      { "part_number": 3, "e_tag": "\"9876543210ab9876543210ab9876543210\"" }
    ]
  }'

Success Response (200)

{
  "success": true,
  "upload_session_id": "sess_001",
  "file_id": "file_001",
  "file_name": "invoice-1.pdf"
}

Idempotency

If a file has already been completed, calling this endpoint again returns a success response. This makes it safe to retry if your connection drops before you receive the response.

Error Codes

Code	Status	Retryable	Message
`FILE_NOT_FOUND`	404	No	This file_id was not registered when the upload session was created. Check the file_id and upload_session_id.
`FILE_ABORTED`	409	No	This file has been aborted and can no longer be completed.
`INVALID_COMPLETION_PARTS`	400	No	The parts provided to complete this file upload are invalid. Check details for the specific reason.
`OBJECT_SIZE_MISMATCH`	422	No	The uploaded file size does not match the file_size_bytes declared when the upload session was created. Check details for the declared and actual sizes.
`UPLOAD_ID_NOT_FOUND`	409	No	This upload session is no longer available. Please create a new upload session and re-upload your files.
`UPLOAD_COMPLETE_FAILED`	502	Yes	File upload completion failed. This may be a temporary issue — please retry.

Next Step

After completing all files, submit an extraction task.

Step 4: Submit Extraction Task

Submit an extraction task referencing your uploaded files. You tell the API what data to extract using a prompt.

Endpoint

POST https://api.invoicedataextraction.com/v1/extractions

Authentication: Bearer token in the Authorization header.

Request Body

Field	Type	Required	Description
`submission_id`	string	Yes	Your unique identifier for this submission. If a request fails or times out, retry with the same `submission_id` to safely retrieve the existing task instead of creating a duplicate. Use a different ID for each new extraction task (e.g., a UUID).
`upload_session_id`	string	Yes	The upload session ID from Step 1.
`file_ids`	array of strings	Yes	The file IDs to include in this extraction. Must reference files that were completed in Step 3.
`task_name`	string	Yes	Your own label for this extraction task, for your internal reference (3-40 characters).
`prompt`	string or object	Yes	Your extraction instructions. See below.
`output_structure`	string	Yes	`"automatic"`, `"per_invoice"`, or `"per_line_item"`.
`options`	object	No	Configuration options. See below.

Output structure

Controls how the extracted data is structured:

Value	Meaning
`automatic`	The AI decides based on your prompt and documents.
`per_invoice`	Each invoice becomes a single row (spreadsheet/CSV) or object (JSON).
`per_line_item`	Each individual product/service listed within an invoice becomes its own row (spreadsheet/CSV) or object (JSON).

Prompt

The prompt field tells the AI what data to extract from your documents. It can be either a string or an object.

As a string — describe what you want in natural language (max 2,500 characters):

"prompt": "Extract invoice number, date, vendor name, total amount, and all line items with descriptions and amounts"

As an object — define exact output field names, with optional per-field and general instructions:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
    { "name": "Vendor Name" },
    { "name": "Total Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
  ],
  "general_prompt": "One row for each product. Do not extract shipping lines."
}

Use an object when you need exact output field names — each name is guaranteed to appear exactly as written in the extracted data. With a string, the AI chooses field names based on your instructions.

For guidance on writing effective prompts, see the Extraction Guide.

Each item in fields:

Field	Type	Required	Description
`name`	string	Yes	The name for this data point in the output (2-50 characters). Prefer clear, descriptive names (e.g., "Invoice Number", not "Field A").
`prompt`	string	No	Specific instructions for extracting this data point (3–600 characters). Use this to clarify ambiguities or instruct special handling.

Field	Type	Required	Description
`general_prompt`	string	No	Instructions that apply to the full task and across all fields (max 1,500 characters). Use this to provide special handling instructions, specify output structure/formatting, or describe the extraction goal.

Options

The options object is optional. All fields within it are optional and have sensible defaults.

Field	Type	Default	Description
`exclude_columns`	array of strings	`[]`	System-generated columns to exclude from output files. By default, a "Source File" column is added to every row indicating which uploaded file/page the data was extracted from. If your workflow requires an exact output structure, you can exclude it. Valid values: `"source_file"`.

Example: String prompt

curl -X POST "https://api.invoicedataextraction.com/v1/extractions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "submission_id": "sub_001",
    "upload_session_id": "sess_001",
    "file_ids": ["file_001", "file_002"],
    "task_name": "January invoices",
    "prompt": "Extract invoice number, date, vendor name, and total amount",
    "output_structure": "per_invoice"
  }'

Example: Object prompt

curl -X POST "https://api.invoicedataextraction.com/v1/extractions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "submission_id": "sub_002",
    "upload_session_id": "sess_001",
    "file_ids": ["file_001", "file_002"],
    "task_name": "January invoices",
    "prompt": {
      "fields": [
        { "name": "Invoice Number" },
        { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
        { "name": "Vendor Name" },
        { "name": "Line Item Description" },
        { "name": "Line Item Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
      ],
      "general_prompt": "Dates should be in YYYY-MM-DD format. Ignore email cover letters."
    },
    "output_structure": "per_line_item"
  }'

Success Response (202)

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "submission_state": "received"
}

The task is now queued for processing. Use the extraction_id to poll for results (Step 5).

Once submitted, the extraction task also appears in the web dashboard alongside tasks submitted from the web app — you can view its progress and results there.

Idempotency

If a request fails or times out, you can safely retry with the same submission_id. If the task was already created, the existing task is returned without creating a duplicate. Use a different submission_id for each new extraction task.

Next Step

After submitting, poll the task status until processing completes.

Step 5: Poll for Results

After submitting an extraction task, poll this endpoint until the task completes or fails.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}

{extraction_id} is returned in the Step 4 response.

Authentication: Bearer token in the Authorization header.

Query Parameters

Parameter	Required	Description
`scope`	No	One of `own`, `team`. Team admins default to `team` (so they can poll any of their team members' runs); other callers default to `own`. Passing `scope=team` from a non-admin returns `403 FORBIDDEN`.

Response

The response always includes success, status, and extraction_id. The rest of the response depends on the status.

Processing (keep polling)

{
  "success": true,
  "status": "processing",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "progress": 42
}

progress is an integer from 0 to 100 indicating approximate completion. The task is still being processed — wait a few seconds and poll again.

Completed

{
  "success": true,
  "status": "completed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 25,
  "output_structure": "per_invoice",
  "output_expires_at": "2026-07-26T10:30:00Z",
  "pages": {
    "successful_count": 10,
    "failed_count": 2,
    "successful": [
      { "file_name": "invoice-1.pdf", "page": 1 },
      { "file_name": "invoice-1.pdf", "page": 2 }
    ],
    "failed": [
      { "file_name": "damaged.pdf", "page": 1 }
    ],
    "failure_reasons": [
      {
        "code": "PROCESSING_FILE_SIZE_LIMIT_EXCEEDED",
        "message": "The upload was accepted, but during processing part of the PDF became too large for our file-processing limit. This can happen when a compressed PDF is processed internally. Split the PDF into smaller page chunks and resubmit.",
        "affected_pages": [
          { "file_name": "damaged.pdf", "pages": [1] }
        ]
      }
    ]
  },
  "ai_uncertainty_notes": [
    {
      "topic": "Documents to extract from",
      "description": "Your files often contain a 'Tax Invoice' with an attached 'Delivery Note'. I treated the 'Tax Invoice' pages as the main source of data, and ignored the attached 'Delivery Note' pages as supporting context.",
      "suggested_prompt_additions": [
        {
          "purpose": "To confirm this handling",
          "instructions": ["Extract from 'Tax Invoice' only"]
        },
        {
          "purpose": "To extract from both",
          "instructions": ["Extract from 'Tax Invoice' and 'Delivery Note'"]
        }
      ]
    }
  ],
  "output": {
    "xlsx_url": "https://storage.example.com/...?X-Amz-Signature=...",
    "csv_url": "https://storage.example.com/...?X-Amz-Signature=...",
    "json_url": "https://storage.example.com/...?X-Amz-Signature=..."
  }
}

Field	Description
`credits_deducted`	The number of credits charged for this extraction (one credit per successful page).
`output_structure`	The effective output structure. Returns the AI-determined structure when 'automatic' was provided; otherwise returns the submitted value (`"per_invoice"`, or `"per_line_item"`)
`output_expires_at`	ISO 8601 timestamp at which the output files become unavailable (currently 90 days after submission). After this point, the `output.*_url` fields are `null` and Download Output returns `OUTPUT_EXPIRED`.
`pages.successful_count`	Number of pages successfully processed.
`pages.failed_count`	Number of pages that failed processing.
`pages.successful`	List of successfully processed pages. Each item has `file_name` (the uploaded file name) and `page` (the page number within that file).
`pages.failed`	List of pages that failed processing. Same shape as `successful`.
`pages.failure_reasons`	Page-failure reason metadata when available. Empty array if none. Each item has a `code`, user-facing `message`, and `affected_pages` grouped by uploaded `file_name` with source-file page numbers.
`ai_uncertainty_notes`	Areas where the AI made assumptions due to ambiguity in your prompt. Empty array if none. Each note has a `topic`, a `description` of what was assumed, and a `suggested_prompt_additions` array of prompt additions you can use to remove the ambiguity in future extractions. Each item has a `purpose` (why you'd add it) and `instructions` (prompt text you can add).
`output.xlsx_url`	Presigned download URL for the Excel (.xlsx) file. `null` if not generated, or if the extraction is past `output_expires_at`.
`output.csv_url`	Presigned download URL for the CSV file. `null` if not generated, or if the extraction is past `output_expires_at`.
`output.json_url`	Presigned download URL for the JSON file. `null` if not generated, or if the extraction is past `output_expires_at`.

Download URLs are temporary, pre-authenticated URLs. To download a file, make a plain GET request to the URL — no Authorization header or other authentication needed. URLs expire after 5 minutes. If a URL has expired, use the download endpoint to get a fresh one (provided the extraction itself is within its 90-day retention window).

Failed

When an extraction fails, the response uses the standard error format plus status: "failed" and extraction_id.

INSUFFICIENT_CREDITS — credits_balance is your total credit balance. credits_reserved are credits held by extractions currently being processed (your available credits = balance minus reserved).

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "INSUFFICIENT_CREDITS",
    "message": "Insufficient credits to process this extraction. Check details for your balance and required credits.",
    "retryable": false,
    "details": {
      "credits_required": 25,
      "credits_balance": 15,
      "credits_reserved": 10
    }
  }
}

FILE_PAGE_LIMIT_EXCEEDED / ENCRYPTED_FILE — details.file_names lists the affected files.

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "ENCRYPTED_FILE",
    "message": "One or more files are encrypted. Remove the encryption and re-upload. Check details for the affected files.",
    "retryable": false,
    "details": {
      "file_names": ["protected-invoice.pdf"]
    }
  }
}

All other error codes have details: null:

Code	Retryable	Message
`CONCURRENT_TASK_LIMIT`	Yes	Too many extractions running at once. Wait for one to complete, then retry.
`NO_PAGES_FOUND`	No	No extractable pages found. Files may be empty or corrupted.
`PROMPT_REJECTED`	No	The prompt did not describe data extraction. Please revise your prompt.
`PROMPT_UNCLEAR`	No	The AI could not understand the prompt well enough. Please adjust your instructions.
`FILE_SIZE_LIMIT_EXCEEDED`	No	A file exceeded the size limit during processing. Split large files and retry.
`PROCESSING_FILE_SIZE_LIMIT_EXCEEDED`	No	The upload was accepted, but during processing part of the PDF became too large for our file-processing limit. This can happen when a compressed PDF is processed internally. Split the PDF into smaller page chunks and resubmit.
`SUBMISSION_STALLED`	Yes	This extraction was not picked up for processing. Please resubmit.
`EXTRACTION_NOT_FOUND`	No	No extraction found for this extraction_id.
`INTERNAL_ERROR`	Yes	An unexpected error occurred. Retry after a short delay.

Cancelled

Dashboard users can cancel a web-submitted extraction while it is queued or processing. Cancelled extractions do not produce output files.

{
  "success": true,
  "status": "cancelled",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 4
}

Polling Strategy

Poll no more frequently than every 5 seconds. Processing time depends on the number and size of your files.

while status == "processing":
    wait 5+ seconds
    GET /extractions/{extraction_id}

if success == false:
    check error.retryable — if true, wait and resubmit; if false, fix the issue first
else if status == "completed":
    download output files
else if status == "cancelled":
    stop polling; no output files are available

Next Step

Download the output files using the URLs in the response. If a download URL has expired, request a fresh one.

Step 6: Download Output

If a download URL from the polling response has expired (URLs are valid for 5 minutes), request a fresh one.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}/output?format={format}

Authentication: Bearer token in the Authorization header.

Query Parameters

Parameter	Required	Description
`format`	Yes	`xlsx`, `csv`, or `json`
`scope`	No	One of `own`, `team`. Same semantics as on Poll for Results — team admins default to `team`, others default to `own`. Lets admins fetch a fresh download URL for any of their team members' extractions.

Example Request

curl "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890/output?format=xlsx" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "download_url": "https://storage.example.com/...?X-Amz-Signature=...",
  "format": "xlsx",
  "expires_in_seconds": 300
}

Error Codes

Code	Status	Retryable	Message
`EXTRACTION_NOT_FOUND`	404	No	No extraction found for this extraction_id.
`OUTPUT_NOT_AVAILABLE`	404	No	The extraction is not completed yet, or there was a problem generating the requested format for this extraction.
`OUTPUT_EXPIRED`	404	No	Output is no longer available — extractions are retained for 90 days.

The same output_expires_at timestamp returned by Poll for Results, List Extractions, and Get Extraction Details tells you when this endpoint will start returning OUTPUT_EXPIRED.

List Extractions

Returns a paginated list of your extractions, newest first. Useful for syncing extraction history into your own systems, finding in-progress runs, or correlating runs you submitted via the dashboard with API workflows.

This endpoint returns slim summary items, including a preview of the first few uploaded file names to help identify each extraction. To get the full canonical record for a single extraction — including the complete file name list, original prompt, options, page-level results, and AI uncertainty notes — use Get Extraction Details. For fresh signed download URLs (which expire 5 minutes after generation), use Poll for Results or Download Output.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions

Authentication: Bearer token in the Authorization header.

Query Parameters

Parameter	Type	Required	Description
`status`	string	No	Filter by status. One of `processing`, `completed`, `cancelled`, `failed`. Omit to include all statuses.
`submission_method`	string	No	Filter by where the extraction was submitted from. One of `api`, `web_app`. Omit to include both.
`created_after`	ISO 8601	No	Inclusive lower bound on `created_at`. Example: `2026-01-01T00:00:00Z`.
`created_before`	ISO 8601	No	Inclusive upper bound on `created_at`.
`scope`	string	No	One of `own`, `team`. See Scope below.
`limit`	integer	No	Items per page. Default `25`, min `1`, max `100`.
`cursor`	string	No	Opaque pagination token. Pass the `next_cursor` from the previous response to fetch the next page.

Scope

For most callers, scope can be omitted:

Individual users and non-admin team members: default scope is own. Returns only your own extractions. Passing scope=team returns 403 FORBIDDEN.
Team admins: default scope is team. Returns all extractions from your team members, plus your own pre-team extractions. Pass scope=own to restrict to extractions you submitted personally.

In the web dashboard, team admins land on My tasks by default and can switch to Team tasks for the same team-visible history returned by API scope=team.

When scope=team, each item includes a submitted_by field with the submitter's email address.

Status filter semantics

The public status value is computed from the underlying run's state:

processing — the extraction is queued or actively being processed.
completed — processing finished successfully and output is (or was) available.
cancelled — processing was cancelled from the dashboard before output creation.
failed — terminal failure. Includes pre-submission rejections (insufficient credits, file validation errors), processing-time failures, and stale submissions.

Example Request

curl "https://api.invoicedataextraction.com/v1/extractions?status=completed&limit=50" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true,
  "extractions": [
    {
      "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "submission_id": "sub_001",
      "task_name": "January invoices",
      "status": "completed",
      "created_at": "2026-04-27T10:30:00Z",
      "submission_method": "api",
      "file_count": 12,
      "file_names_preview": [
        "invoice-001.pdf",
        "invoice-002.pdf",
        "invoice-003.pdf",
        "invoice-004.pdf",
        "invoice-005.pdf"
      ],
      "file_names_truncated": true,
      "output_structure": "per_invoice",
      "credits_deducted": 25,
      "available_outputs": ["xlsx", "csv", "json"],
      "output_expires_at": "2026-07-26T10:30:00Z"
    },
    {
      "extraction_id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
      "submission_id": null,
      "task_name": "Q1 receipts",
      "status": "processing",
      "created_at": "2026-04-27T11:15:00Z",
      "submission_method": "web_app",
      "file_count": 4,
      "file_names_preview": [
        "receipt-april.pdf",
        "receipt-may.pdf",
        "receipt-june.pdf",
        "receipt-july.pdf"
      ],
      "file_names_truncated": false,
      "output_structure": "per_invoice",
      "progress": 42
    },
    {
      "extraction_id": "c3d4e5f6-a7b8-9012-cdef-345678901234",
      "submission_id": "sub_002",
      "task_name": "Vendor reconciliation",
      "status": "failed",
      "created_at": "2026-04-26T09:00:00Z",
      "submission_method": "api",
      "file_count": 8,
      "file_names_preview": [
        "vendor-001.pdf",
        "vendor-002.pdf",
        "vendor-003.pdf",
        "vendor-004.pdf",
        "vendor-005.pdf"
      ],
      "file_names_truncated": true,
      "output_structure": "automatic",
      "error": {
        "code": "INSUFFICIENT_CREDITS",
        "retryable": false
      }
    }
  ],
  "has_more": true,
  "next_cursor": "eyJjIjoiMjAyNi0wNC0yNlQwOTowMDowMFoiLCJpIjo1NjAxfQ"
}

Common fields (all items)

Field	Description
`extraction_id`	UUID of the extraction. Use with Get Extraction Details, Poll for Results, Download Output, and Delete Extraction.
`submission_id`	The `submission_id` you supplied when submitting via API. `null` for extractions submitted via the web dashboard.
`task_name`	The label provided at submission time (your `task_name` for API submissions; the run name set in the dashboard for web submissions).
`status`	One of `processing`, `completed`, `cancelled`, `failed`.
`created_at`	ISO 8601 timestamp of when the extraction was created.
`submission_method`	One of `api`, `web_app`. Identifies whether the extraction was submitted via the API or the web dashboard.
`file_count`	Number of files in the extraction.
`file_names_preview`	Up to the first 5 uploaded file names. Useful for identifying a task in a history view without fetching the full record.
`file_names_truncated`	`true` when `file_names_preview` does not include every uploaded file name. Use Get Extraction Details for the complete `file_names` array when available.
`output_structure`	Returns the AI-determined structure when `automatic` was submitted, otherwise returns the submitted value (`per_invoice`, or `per_line_item`).

Status-conditional fields

When status: "completed":

Field	Description
`credits_deducted`	The number of credits charged for the extraction (one credit per successful page).
`available_outputs`	List of output formats currently available for download (subset of `["xlsx", "csv", "json"]`). Empty when past `output_expires_at`.
`output_expires_at`	ISO 8601 timestamp at which output files become unavailable. Currently 90 days after `created_at`.

When status: "processing":

Field	Description
`progress`	Integer 0–100 indicating approximate processing completion.

When status: "cancelled":

Field	Description
`credits_deducted`	Credits charged for AI work already completed before cancellation. No output files are available.

When status: "failed":

Field	Description
`error.code`	The failure reason code (e.g. `INSUFFICIENT_CREDITS`, `PROMPT_REJECTED`, `INTERNAL_ERROR`). See the Poll for Results error codes for the full list and the meaning of each.
`error.retryable`	Whether retrying the same submission could succeed.

To get the full failure message and any structured error.details, fetch the extraction with Get Extraction Details.

Under `scope=team`

Each item additionally includes:

"submitted_by": { "email": "[email protected]" }

email is null for any user whose record could not be looked up.

Pagination fields

Field	Description
`has_more`	`true` if more rows exist after the current page; `false` when the result set is exhausted.
`next_cursor`	Opaque token to pass back as `cursor` to fetch the next page. `null` when `has_more` is `false`.

Pagination

Use cursor-based pagination by passing the previous response's next_cursor into the next request:

const allExtractions = [];
let cursor = null;

while (true) {
  const params = new URLSearchParams({ limit: "100", status: "completed" });
  if (cursor) params.set("cursor", cursor);

  const response = await fetch(
    `https://api.invoicedataextraction.com/v1/extractions?${params}`,
    { headers: { Authorization: `Bearer ${API_KEY}` } }
  );
  const data = await response.json();

  allExtractions.push(...data.extractions);

  if (!data.has_more) break;
  cursor = data.next_cursor;
}

Cursors are opaque — don't try to construct or decode them. They encode the position in the result set; passing one between requests is the only supported use.

Error Codes

Code	Status	Retryable	Message
`INVALID_INPUT`	400	No	Returned for malformed query parameters (unknown `status`/`submission_method` values, invalid `created_after`/`created_before` timestamps, `created_after` > `created_before`, malformed `cursor`, out-of-range `limit`, repeated query keys).
`FORBIDDEN`	403	No	Returned when a non-admin caller passes `scope=team`.

Get Extraction Details

Returns the full canonical record for a single extraction by ID, including the original prompt, options, all file names, page-level results, and AI uncertainty notes. Use this when you have an extraction_id (from List Extractions, the dashboard, your own database, etc.) and want stable metadata about the extraction.

When to use this vs Poll for Results

The two endpoints serve different needs:

Use Poll for Results when you've just submitted an extraction and want to wait for it to finish. It returns live state including signed download URLs, and uses success: false for failed extractions so polling clients can stop.
Use this endpoint when you want stable metadata for a known extraction — including the original prompt and options. It never returns success: false for a failed extraction; failed extractions are valid records here, with the failure reason inside the error field. This means a routine SDK call to "get the record for this ID" doesn't need to special-case success: false as an error.

This endpoint does not include signed download URLs. Use Download Output to fetch a fresh signed URL when you actually need to download a file.

Endpoint

GET https://api.invoicedataextraction.com/v1/extractions/{extraction_id}/details

Authentication: Bearer token in the Authorization header.

Query Parameters

Parameter	Required	Description
`scope`	No	One of `own`, `team`. Same semantics as on List Extractions — team admins default to `team`, others default to `own`.

Example Request

curl "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890/details" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true,
  "extraction": {
    "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "submission_id": "sub_001",
    "task_name": "January invoices",
    "status": "completed",
    "created_at": "2026-04-27T10:30:00Z",
    "submission_method": "api",
    "file_count": 12,
    "file_names": ["invoice-1.pdf", "invoice-2.pdf", "..."],
    "output_structure": "per_invoice",
    "prompt": {
      "fields": [
        { "name": "Invoice Number" },
        { "name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the payment due date" },
        { "name": "Vendor Name" },
        { "name": "Total Amount", "prompt": "Do not include currency symbol, use 2 decimal places" }
      ],
      "general_prompt": "Dates should be in YYYY-MM-DD format."
    },
    "options": {
      "exclude_columns": []
    },
    "credits_deducted": 25,
    "available_outputs": ["xlsx", "csv", "json"],
    "output_expires_at": "2026-07-26T10:30:00Z",
    "pages": {
      "successful_count": 10,
      "failed_count": 2,
      "successful": [
        { "file_name": "invoice-1.pdf", "page": 1 },
        { "file_name": "invoice-1.pdf", "page": 2 }
      ],
      "failed": [
        { "file_name": "damaged.pdf", "page": 1 }
      ],
      "failure_reasons": [
        {
          "code": "PROCESSING_FILE_SIZE_LIMIT_EXCEEDED",
          "message": "The upload was accepted, but during processing part of the PDF became too large for our file-processing limit. This can happen when a compressed PDF is processed internally. Split the PDF into smaller page chunks and resubmit.",
          "affected_pages": [
            { "file_name": "damaged.pdf", "pages": [1] }
          ]
        }
      ]
    },
    "ai_uncertainty_notes": []
  }
}

Common fields (all statuses)

Field	Description
`extraction_id`	UUID of the extraction.
`submission_id`	The `submission_id` you supplied at submission time. `null` for extractions submitted via the web dashboard.
`task_name`	The label provided at submission time.
`status`	One of `processing`, `completed`, `cancelled`, `failed`.
`created_at`	ISO 8601 timestamp of when the extraction was created.
`submission_method`	One of `api`, `web_app`.
`file_count`	Number of files in the extraction.
`file_names`	List of original file names submitted, in submission order.
`output_structure`	The current/effective output structure. Returns the AI-determined structure when available; otherwise falls back to the requested value (`automatic`, `per_invoice`, or `per_line_item`). May be `null` for older extractions where no level was recorded.
`prompt`	The original prompt. String prompts are returned as strings; structured prompts are returned as `{ "fields": [...], "general_prompt": "..." }`. Web-app extractions submitted with no prompt return an empty string.
`options`	Always present as `{ "exclude_columns": [...] }`, even when empty. The shape matches what `POST /v1/extractions` accepts as a body.

When `status: "completed"`

Field	Description
`credits_deducted`	The number of credits charged for this extraction.
`available_outputs`	List of output formats currently available for download (subset of `["xlsx", "csv", "json"]`). Empty when past `output_expires_at`.
`output_expires_at`	ISO 8601 timestamp at which the output files become unavailable.
`pages`	Same shape as the Poll for Results completed payload.
`ai_uncertainty_notes`	Same shape as the polling endpoint. Empty array if none.

When `status: "processing"`

Field	Description
`progress`	Integer 0–100 indicating approximate processing completion.

When `status: "cancelled"`

Field	Description
`credits_deducted`	Credits charged for AI work already completed before cancellation. No output files are available.

When `status: "failed"`

Field	Description
`error.code`	Failure reason code. Same set as the Poll for Results error codes.
`error.message`	Human-readable failure message.
`error.retryable`	Whether retrying the same submission could succeed.
`error.details`	Structured failure context (e.g. file names, credit balance) when applicable; `null` otherwise.

Note: unlike Poll for Results, this endpoint always returns success: true even when status: "failed". Failed extractions are valid records — the failure detail lives in the error field.

Under `scope=team`

Each response additionally includes a submitted_by field with the submitter's email address:

"submitted_by": { "email": "[email protected]" }

This field is included on every response under scope=team (including for extractions you submitted yourself), and is omitted entirely under scope=own. email is null for any user whose record could not be looked up.

Error Codes

Code	Status	Retryable	Message
`INVALID_INPUT`	400	No	`extraction_id` is not a valid UUID, or `scope` is not `own`/`team`.
`EXTRACTION_NOT_FOUND`	404	No	No extraction found for this `extraction_id` within the requested scope.
`FORBIDDEN`	403	No	Returned when a non-admin caller passes `scope=team`.

Delete Extraction

Permanently deletes an extraction, its output files, and its uploaded source files. Extractions that are currently being processed cannot be deleted.

Note: Deleting an extraction removes the uploaded source files associated with it. If you created multiple extractions from the same upload session, deleting one will not affect the others — source files are only removed when no other extraction is using them.

Our standard data retention policies apply automatically — uploaded documents and processing data are deleted on a schedule. Use this endpoint if you need to delete an extraction and its data immediately rather than waiting for automatic retention.

Endpoint

DELETE https://api.invoicedataextraction.com/v1/extractions/{extraction_id}

Authentication: Bearer token in the Authorization header.

Query Parameters

Parameter	Required	Description
`scope`	No	One of `own`, `team`. Same semantics as on Poll for Results — team admins default to `team`, allowing them to delete any of their team members' extractions. Other callers default to `own` and may not pass `scope=team`.

Example Request

curl -X DELETE "https://api.invoicedataextraction.com/v1/extractions/a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true
}

Error Codes

Code	Status	Retryable	Message
`EXTRACTION_NOT_FOUND`	404	No	No extraction found for this extraction_id.
`EXTRACTION_IN_PROGRESS`	409	No	This extraction is currently being processed and cannot be deleted. Wait for it to complete or fail, then try again.

Check Credit Balance

Returns your current credit balance, including credits reserved by extractions that are currently being processed.

Endpoint

GET https://api.invoicedataextraction.com/v1/credits/balance

Authentication: Bearer token in the Authorization header.

Example Request

curl "https://api.invoicedataextraction.com/v1/credits/balance" \
  -H "Authorization: Bearer $API_KEY"

Success Response (200)

{
  "success": true,
  "credits_balance": 150,
  "credits_reserved": 10
}

Field	Description
`credits_balance`	Your total credit balance (paid + free credits).
`credits_reserved`	Credits reserved by extractions currently being processed. Up to this amount will be deducted when processing completes depending on number of successful pages. Your usable balance is `credits_balance` minus `credits_reserved`.

Working with Output Files

You can control the structure and formatting of all output files in two main ways:

use output_structure to choose the top-level record shape, such as per_invoice or per_line_item
use your prompt to describe the fields, grouping, and overall structure you want, such as "one row per product" or "one row per PO"

You can also use your prompt to:

specify missing-value placeholders, such as empty string, N/A, or 0
specify formatting requirements, such as YYYY-MM-DD, digits only, or no currency symbol
specify the intended output type, such as text, number, date, datetime, boolean, currency, or percentage

These instructions may appear differently across JSON, CSV, and XLSX outputs, but they all affect how the final export is produced.

At a high level:

JSON output is string-based.
CSV is text-based.
XLSX can use native spreadsheet cell types when values can be safely interpreted.

Working with JSON Output

JSON value typing

In the JSON output file, extracted field values are returned as strings.

Standard fields are returned as strings.
If you ask for a field to contain JSON, that field is returned as a string containing valid JSON.
All values inside that JSON are also strings.

If you need numbers, booleans, or dates as typed values, parse them in your own code. If you plan to parse a value, state the formatting clearly in your prompt. For example:

"Do not include currency symbol"
"Use digits only"
"Return true or false"
"Use YYYY-MM-DD format"

Structured JSON fields

You can ask for a field to return structured JSON.

Example prompt:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    {
      "name": "Line Items",
      "prompt": "Return a JSON array with keys description, quantity, unit_price, and amount. Use digits only for quantity. Use a full stop as the decimal separator. Do not include currency symbols in unit_price or amount. Do not use thousands separators. Use an empty string when a value is missing."
    }
  ]
}

Example JSON output value:

"Line Items": "[{\"description\":\"Widget\",\"quantity\":\"2\",\"unit_price\":\"9.99\",\"amount\":\"19.98\"}]"

In the example above, Line Items is a string whose content is valid JSON.

Use nested line-item JSON like above, mainly for smaller or simpler cases, such as when there are only a few line items and you want a single invoice-level object.

Recommended approach for line items

If you need detailed line item extraction, prefer output_structure: "per_line_item" instead of returning line items inside a nested JSON field.

This is strongly recommended when:

invoices may contain around 7 or more line items
line items need detailed per-field instructions
you want the most reliable line item extraction

In per_line_item, define invoice-level fields and line-item fields as separate top-level fields.

Many workflows can use the per_line_item output directly, with one row/object per line item.

If your workflow needs a nested structure such as { invoice_fields..., line_items: [...] }, include your own stable invoice identifier such as Invoice Number so you can group related line item rows back into invoices in your own system.

Do not rely on Source File alone to group rows into invoices. Source File helps you trace where a row came from, but it is not a stable invoice identifier.

Example prompt for the recommended approach:

{
  "prompt": {
    "fields": [
      { "name": "Invoice Number" },
      { "name": "Invoice Date", "prompt": "Use YYYY-MM-DD format" },
      { "name": "Vendor Name" },
      { "name": "Line Item Description" },
      { "name": "Line Item Quantity", "prompt": "Use digits only" },
      { "name": "Line Item Unit Price" },
      { "name": "Line Item Amount" }
    ],
    "general_prompt": "For amount fields don't use thousands separators, use full stops as the decimal separator and do not include currency symbols."
  },
  "output_structure": "per_line_item"
}

Example JSON output rows:

[
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget A",
    "Line Item Quantity": "2",
    "Line Item Unit Price": "9.99",
    "Line Item Amount": "19.98"
  },
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget B",
    "Line Item Quantity": "1",
    "Line Item Unit Price": "5.00",
    "Line Item Amount": "5.00"
  }
]

Both rows above belong to the same invoice because they share the same Invoice Number. If your workflow needs one record per line item, you can use the rows as-is. If your workflow needs a nested invoice structure, you can group rows that share the same invoice identifier to build your own { invoice_fields..., line_items: [...] } structure.

CSV Output

CSV is a plain-text export. Every value in the CSV file is written as text.

XLSX Output

XLSX uses the most appropriate spreadsheet cell type for each value by default, and follows explicit prompt instructions where provided.

Node.js Example

A ready-to-run script that handles the full workflow — reads files from a local folder, uploads them, submits an extraction task, polls until completion, and downloads the results. No dependencies beyond Node.js 18+.

Save this as extract.js, set the three configuration variables at the top (API_KEY, FOLDER_PATH, PROMPT), and run with node extract.js. You'll have extraction results in minutes.

import { readdir, readFile, stat, writeFile, mkdir } from "fs/promises";
import { join, extname } from "path";

// ---------------------------------------------------------------------------
// Configuration — set these before running
// ---------------------------------------------------------------------------

// Your API key. Get one at: https://invoicedataextraction.com/dashboard?view=API
// IMPORTANT: This is hardcoded here for simplicity. In production, load from an
// environment variable (e.g. process.env.IDE_API_KEY) and never commit to Git.
const API_KEY = "YOUR_API_KEY";

// Absolute path to the local folder containing the files you want to process.
const FOLDER_PATH = "/Users/you/Documents/invoices";

// Tell the AI what data to extract from each document (plain-text instruction).
const PROMPT = "Extract invoice number, date, vendor name, and total amount";
// For exact output column names, pass an object instead:
//   const PROMPT = { fields: [{ name: "Invoice Number" }, { name: "Total", prompt: "No currency symbol" }], general_prompt: "..." };

// A label for this extraction task (3-40 characters). Used in your dashboard and output filenames.
const TASK_NAME = "My extraction task";

// How rows are grouped in the output: "automatic" (AI decides), "per_invoice", or "per_line_item".
const OUTPUT_STRUCTURE = "automatic";

// Which output formats to download. Any combination of "xlsx", "csv", "json".
const DOWNLOAD_FORMATS = ["xlsx", "csv", "json"];

// ---------------------------------------------------------------------------
// Internal constants — no changes needed
// ---------------------------------------------------------------------------

const API_BASE = "https://api.invoicedataextraction.com/v1";
const SUPPORTED_EXTENSIONS = new Set([".pdf", ".jpg", ".jpeg", ".png"]);
const MAX_RETRIES = 3;

async function apiRequest(path, body) {
  for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
    const response = await fetch(`${API_BASE}${path}`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(body),
    });
    const text = await response.text();
    let data;
    try {
      data = JSON.parse(text);
    } catch {
      // Non-JSON response — could be a Cloudflare rate limit or infrastructure error.
      // Retry on 429/503, throw on anything else.
      if ((response.status === 429 || response.status === 503) && attempt < MAX_RETRIES) {
        const delayMs = 5000 * attempt;
        console.warn(`Non-JSON ${response.status} response, retrying in ${delayMs / 1000}s...`);
        await new Promise((resolve) => setTimeout(resolve, delayMs));
        continue;
      }
      throw new Error(`API returned non-JSON response (${response.status}): ${text.slice(0, 200)}`);
    }
    if (data.success) return data;

    // If the error is retryable and we have attempts left, wait and retry
    if (data.error.retryable && attempt < MAX_RETRIES) {
      // Use the Retry-After header if present (rate limit responses), otherwise exponential backoff
      const retryAfter = response.headers.get("Retry-After");
      const delayMs = retryAfter ? parseInt(retryAfter, 10) * 1000 : 1000 * attempt;
      console.warn(`Retryable error (${data.error.code}), retrying in ${delayMs / 1000}s...`);
      await new Promise((resolve) => setTimeout(resolve, delayMs));
      continue;
    }

    throw new Error(
      `API error: ${data.error.code} — ${data.error.message}` +
        (data.error.details ? `\nDetails: ${JSON.stringify(data.error.details)}` : "")
    );
  }
}

// ---------------------------------------------------------------------------
// Step 1: Discover local files and create an upload session
// ---------------------------------------------------------------------------

// Scan the folder for supported file types
const entries = await readdir(FOLDER_PATH);
const files = [];

for (const entry of entries) {
  // Skip unsupported file types and subfolders
  const ext = extname(entry).toLowerCase();
  if (!SUPPORTED_EXTENSIONS.has(ext)) continue;
  const filePath = join(FOLDER_PATH, entry);
  const fileStat = await stat(filePath);
  if (!fileStat.isFile()) continue;

  // Add this file to the list with its size in bytes.
  // file_id must be unique within the session and can only contain letters, numbers,
  // dots, underscores, colons, and hyphens (no spaces). Use your own IDs (e.g., database
  // row IDs, UUIDs, or a simple counter).
  files.push({
    file_id: `file_${files.length + 1}`,
    file_name: entry,
    file_size_bytes: fileStat.size,
    localPath: filePath, // kept locally — not sent to the API
  });
}

// Optional: before uploading, you could calculate the credits required and check
// your balance. Each page costs one credit — for PDFs, count the pages; for
// images, each file is one credit. Then call GET /credits/balance to compare
// against your available balance (credits_balance minus credits_reserved).

// Generate a unique ID for this upload session (must be different for each new session)
const uploadSessionId = `session_${Date.now()}`;

// Create the upload session — registers all files with the API
let session;
try {
  session = await apiRequest("/uploads/sessions", {
    upload_session_id: uploadSessionId,
    files: files.map(({ file_id, file_name, file_size_bytes }) => ({
      file_id,
      file_name,
      file_size_bytes,
    })),
  });
} catch (error) {
  // Session creation failure is fatal — no files can be uploaded without a session
  console.error(`Failed to create upload session: ${error.message}`);
  process.exit(1);
}

console.log(`Upload session created: ${session.upload_session_id} (${files.length} files)`);

// The chunk size in bytes — always the same for all files in the session, so we read it from the first
const partSize = session.files[0].part_size;

// ---------------------------------------------------------------------------
// Steps 2 & 3: For each file — upload chunks, then complete the upload
// ---------------------------------------------------------------------------

const completedFileIds = [];

for (const file of files) {
  try {
    // Read the entire file into memory as a binary buffer
    const fileBuffer = await readFile(file.localPath);

    // Calculate how many parts this file needs
    const totalParts = Math.ceil(fileBuffer.length / partSize);
    const partNumbers = Array.from({ length: totalParts }, (_, i) => i + 1);

    // Request a presigned upload URL for each part
    const partsData = await apiRequest(`/uploads/sessions/${uploadSessionId}/parts`, {
      file_id: file.file_id,
      part_numbers: partNumbers,
    });

    // Upload each chunk to its presigned URL via PUT
    const completedParts = [];

    for (const { part_number, url } of partsData.part_urls) {
      // Slice the file buffer into a chunk for this part
      const start = (part_number - 1) * partSize;
      const end = Math.min(start + partSize, fileBuffer.length);
      const chunk = fileBuffer.subarray(start, end);

      // PUT the raw bytes directly to the presigned URL
      const putResponse = await fetch(url, { method: "PUT", body: chunk });
      if (!putResponse.ok) {
        const errorBody = await putResponse.text();
        throw new Error(
          `Upload failed for ${file.file_name} part ${part_number}: ${putResponse.status} ${putResponse.statusText}\n${errorBody}`
        );
      }

      // Save the ETag — needed to complete the upload in Step 3
      completedParts.push({
        part_number,
        e_tag: putResponse.headers.get("etag"),
      });
    }

    console.log(`Uploaded: ${file.file_name} (${totalParts} part${totalParts > 1 ? "s" : ""})`);

    // Complete the file upload with the collected ETags
    await apiRequest(`/uploads/sessions/${uploadSessionId}/complete`, {
      file_id: file.file_id,
      parts: completedParts,
    });

    console.log(`Completed: ${file.file_name}`);
    completedFileIds.push(file.file_id);
  } catch (error) {
    // By default, abort on any file failure to avoid silent partial uploads.
    // If you'd prefer to continue with remaining files, remove the process.exit.
    console.error(`Failed: ${file.file_name} — ${error.message}`);
    process.exit(1);
  }
}

// All files uploaded and completed successfully
console.log(`\n${completedFileIds.length} files ready for extraction.`);

// ---------------------------------------------------------------------------
// Steps 4 & 5: Submit the extraction task and poll until it completes
// ---------------------------------------------------------------------------

// Retryable polling errors (e.g., concurrent task limit, temporary server issues) trigger
// a fresh submission. Non-retryable errors require action from you — the log message tells
// you what to fix before re-running the script.

// Optional: human-readable guidance for non-retryable error codes (see error reference above).
// This just improves the console output — the API works the same without it.
const NON_RETRYABLE_GUIDANCE = {
  INSUFFICIENT_CREDITS: "Purchase credits at https://invoicedataextraction.com/dashboard?view=Billing then re-run this script.",
  FILE_PAGE_LIMIT_EXCEEDED: "Split the affected files into smaller documents and re-upload.",
  ENCRYPTED_FILE: "Remove encryption from the affected files and re-upload.",
  NO_PAGES_FOUND: "Check that your files are valid and contain extractable content.",
  PROMPT_REJECTED: "Revise your prompt to clearly describe what data to extract.",
  PROMPT_UNCLEAR: "Revise your prompt with clearer instructions and re-run.",
  FILE_SIZE_LIMIT_EXCEEDED: "Split large files into smaller documents and re-upload.",
  PROCESSING_FILE_SIZE_LIMIT_EXCEEDED: "Split the PDF into smaller page chunks and re-upload.",
};

const MAX_SUBMISSION_ATTEMPTS = 2;
const POLL_INTERVAL_MS = 5000;

let result;

for (let attempt = 1; attempt <= MAX_SUBMISSION_ATTEMPTS; attempt++) {
  // Each attempt needs a unique submission_id
  const submissionId = `sub_${Date.now()}_${attempt}`;

  const run = await apiRequest("/extractions", {
    submission_id: submissionId,
    upload_session_id: uploadSessionId,
    file_ids: completedFileIds,
    task_name: TASK_NAME,
    prompt: PROMPT,
    output_structure: OUTPUT_STRUCTURE,
  });

  console.log(`\nExtraction task submitted (extraction_id: ${run.extraction_id})`);

  // Poll until completed, failed, or cancelled
  let lastFailureCode = null;
  let consecutivePollErrors = 0;
  const MAX_CONSECUTIVE_POLL_ERRORS = 10;
  while (true) {
    const response = await fetch(`${API_BASE}/extractions/${run.extraction_id}`, {
      headers: { Authorization: `Bearer ${API_KEY}` },
    });
    const data = await response.json();

    if (data.status === "completed") {
      result = data;
      break;
    }

    if (data.status === "failed") {
      const { code, message, details, retryable } = data.error;
      console.error(`\nExtraction failed: ${code} — ${message}`);
      if (details) console.error(`Details: ${JSON.stringify(details)}`);

      if (!retryable) {
        const guidance = NON_RETRYABLE_GUIDANCE[code] || "Check the error above and re-run when resolved.";
        console.error(`\nAction required: ${guidance}`);
        process.exit(1);
      }

      // Retryable — wait then submit again.
      // Concurrent task limit means we wait longer (5 min) for other processing tasks to finish.
      // Other retryable errors are transient, so a short delay (10s) suffices.
      const delayMs = code === "CONCURRENT_TASK_LIMIT" ? 300_000 : 10_000;
      console.log(`Retrying in ${delayMs / 1000}s (attempt ${attempt}/${MAX_SUBMISSION_ATTEMPTS})...`);
      await new Promise((resolve) => setTimeout(resolve, delayMs));
      lastFailureCode = code;
      break;
    }

    if (data.status === "cancelled") {
      console.error(
        `\nExtraction was cancelled. Credits deducted: ${data.credits_deducted ?? 0}. No output files are available.`
      );
      process.exit(1);
    }

    // Still processing — reset error counter and poll again
    if (data.status === "processing") {
      consecutivePollErrors = 0;
      console.log(`Processing... ${data.progress ?? 0}%`);
    } else {
      consecutivePollErrors++;
      console.warn(`Polling issue (HTTP ${response.status}) — retrying in ${POLL_INTERVAL_MS / 1000}s... (${consecutivePollErrors}/${MAX_CONSECUTIVE_POLL_ERRORS})`);
      if (consecutivePollErrors >= MAX_CONSECUTIVE_POLL_ERRORS) {
        console.error(`\nToo many consecutive polling errors. The extraction may still be processing — check your dashboard or retry later.`);
        process.exit(1);
      }
    }
    await new Promise((resolve) => setTimeout(resolve, POLL_INTERVAL_MS));
  }

  if (result) break;

  if (lastFailureCode && attempt === MAX_SUBMISSION_ATTEMPTS) {
    const exitMessage = lastFailureCode === "CONCURRENT_TASK_LIMIT"
      ? `\nStill hitting the concurrent task limit after ${MAX_SUBMISSION_ATTEMPTS} attempts. Wait for your other extractions to finish, then re-run.`
      : `\nGave up after ${MAX_SUBMISSION_ATTEMPTS} attempts. There may be temporary service issues — please wait and try again later.`;
    console.error(exitMessage);
    process.exit(1);
  }
}

console.log(`\nExtraction completed!`);
console.log(`Credits deducted: ${result.credits_deducted}`);
console.log(`Output structure: ${result.output_structure}`);
console.log(`Pages: ${result.pages.successful_count} successful, ${result.pages.failed_count} failed`);

if (result.pages.failed_count > 0) {
  console.warn(`\nWarning: ${result.pages.failed_count} page(s) failed to extract. Data from these pages is missing from the output.`);
  for (const page of result.pages.failed) {
    console.warn(`  - ${page.file_name} (page ${page.page})`);
  }
}

if (result.ai_uncertainty_notes.length > 0) {
  console.log(`\n--- AI Uncertainty Notes ---`);
  console.log(`The AI made assumptions in ${result.ai_uncertainty_notes.length} area(s). Review these and consider adding the suggested prompt additions to improve future extractions.\n`);
  result.ai_uncertainty_notes.forEach((note, i) => {
    console.log(`  [${i + 1}] ${note.topic}`);
    console.log(`  ${note.description}`);
    for (const suggestion of note.suggested_prompt_additions) {
      console.log(`    → ${suggestion.purpose}: "${suggestion.instructions}"`);
    }
    console.log();
  });
  console.log(`---`);
}

// ---------------------------------------------------------------------------
// Step 6: Download the output files
// ---------------------------------------------------------------------------

const timestamp = new Date().toISOString().replace(/[:.]/g, "-").slice(0, 19);
const safeName = TASK_NAME.replace(/[^a-zA-Z0-9_-]/g, "_");
await mkdir("output", { recursive: true });

for (const format of DOWNLOAD_FORMATS) {
  const url = result.output[`${format}_url`];
  if (!url) {
    console.warn(`No ${format} download available.`);
    continue;
  }
  const response = await fetch(url);
  if (!response.ok) {
    console.error(`Failed to download ${format}: ${response.status}`);
    continue;
  }
  const buffer = Buffer.from(await response.arrayBuffer());
  const outputPath = `output/${safeName}_${timestamp}.${format}`;
  await writeFile(outputPath, buffer);
  console.log(`Downloaded: ${outputPath}`);
}

console.log(`\nDone. Extraction ${result.extraction_id} completed successfully.`);

REST API Reference

Invoice Data Extraction API

Overview

Authentication

Teams and your API key

Error Responses

Rate Limits

Step 1: Create Upload Session

Endpoint

Request Body

File Limits

Example Request

Success Response (200)

Error Codes

Idempotency

Next Step

Step 2: Get Part Upload URLs

Endpoint

Request Body

How to calculate part numbers

Example: Small file (single part)

Example: Large file (multiple parts)

Uploading parts

How it works

Error Codes

Next Step

Step 3: Complete File Upload

Endpoint

Request Body

Example Request

Success Response (200)

Idempotency

Error Codes

Next Step

Step 4: Submit Extraction Task

Endpoint

Request Body

Output structure

Prompt

Options

Example: String prompt

Example: Object prompt

Success Response (202)

Idempotency

Next Step

Step 5: Poll for Results

Endpoint

Query Parameters

Response

Processing (keep polling)

Completed

Failed

Cancelled

Polling Strategy

Next Step

Step 6: Download Output

Endpoint

Query Parameters

Example Request

Success Response (200)

Error Codes

List Extractions

Endpoint

Query Parameters

Scope

Status filter semantics

Example Request

Success Response (200)

Common fields (all items)

Status-conditional fields

Under scope=team

Pagination fields

Pagination

Error Codes

Get Extraction Details

When to use this vs Poll for Results

Endpoint

Query Parameters

Example Request

Success Response (200)

Under `scope=team`

When `status: "completed"`

When `status: "processing"`

When `status: "cancelled"`

When `status: "failed"`

Under `scope=team`