Invoice Data Extraction Python SDK

Official Python SDK for Invoice Data Extraction. Handles file upload, extraction submission, polling, and result download so you can go from local files to structured output in a few lines of code.

Python 3.9 or later

Install

pip install invoicedataextraction-sdk

Quick Start

import json
import os
import sys

from invoicedataextraction import InvoiceDataExtraction
from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    client = InvoiceDataExtraction(
        api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
    )

    result = client.extract(
        folder_path="./invoices",
        prompt="Extract invoice number and total",
        output_structure="per_invoice",
        download={
            "formats": ["xlsx", "json"],
            "output_path": "./output",
        },
        console_output=True,  # remove to disable console logging
    )
except (SdkError, ApiResponseError) as error:
    print(json.dumps(error.body, indent=2), file=sys.stderr)
    raise SystemExit(1)

extract(...) uploads your files (pass a folder_path or a list of files), submits the extraction, polls until it finishes, and downloads the results. The returned result is the final polling response from the API. For completed responses, check result["pages"]["failed_count"] to verify that all uploaded pages were processed successfully; if greater than 0, inspect result["pages"]["failed"] and result["pages"]["failure_reasons"]. Also check result["review_needed"]["count"] before relying on the extracted data.

Generate an API key from your dashboard. Every account includes 50 free pages per month. Additional credits can be purchased on a pay-as-you-go basis with no subscription needed.

Constructor

import os
from invoicedataextraction import InvoiceDataExtraction

client = InvoiceDataExtraction(
    api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
)

Parameter	Required	Description
`api_key`	Yes	Your API key.
`base_url`	No	API base URL. Defaults to `https://api.invoicedataextraction.com/v1`. Only needed for testing or non-production environments.

`extract(...)`

Run a complete extraction in a single call. Pass a folder path or a list of file paths, tell the SDK what to extract, and optionally download the extracted data to disk as Excel, CSV, or JSON. The method returns the extraction task results — credits deducted, successful and failed pages, Review Needed warnings, and prompt notes in ai_uncertainty_notes. The SDK handles upload, submission, polling, and download internally.

Underlying API workflow: upload session → submit extraction → poll for results → download output. See File limits for size and count constraints.

result = client.extract(
    folder_path="./invoices",
    prompt="Extract invoice number, date, vendor name, and total amount",
    output_structure="per_invoice",
    download={
        "formats": ["xlsx", "json"],
        "output_path": "./output",
    },
    console_output=True,  # remove to disable console logging
)

Parameters

Parameter	Required	Description
`folder_path`	One of `folder_path` or `files`	Path to a local folder. The SDK uploads every supported file in the folder (`.pdf`, `.jpg`, `.jpeg`, `.png`). Not recursive.
`files`	One of `folder_path` or `files`	List of local file paths to upload. Supported types: `.pdf`, `.jpg`, `.jpeg`, `.png`.
`prompt`	Yes	Extraction instructions. String or dict — see Prompt below.
`output_structure`	Yes	Controls how the extracted data is structured — see Output structure below.
`task_name`	No	Your label for this extraction (3–40 characters). Appears in the web dashboard. If omitted, the SDK generates one as `extraction_YYYYMMDD_HHMMSS`.
`exclude_columns`	No	List of system-generated columns to exclude from output. By default, output files include a "Source File" column indicating which uploaded file/page each row was extracted from, and a "Review Needed" column marking rows that need human verification. If your workflow requires an exact output structure, you can exclude either column. Valid values: `"source_file"`, `"review_needed"`. Excluding `"review_needed"` removes only the export column; Review Needed warnings can still be generated and returned in the completed response.
`download`	No	Download options — see Download below. If omitted, no files are downloaded.
`polling`	No	Polling options — see Polling below.
`console_output`	No	Boolean. When `True`, the SDK logs progress to the console during upload, polling, and download. Off by default.
`on_update`	No	Callable for lifecycle updates — see on_update below.

Output structure

Controls how the extracted data is structured:

Value	Meaning
`automatic`	The AI decides based on your prompt and documents.
`per_invoice`	Each invoice becomes a single row (spreadsheet/CSV) or object (JSON).
`per_line_item`	Each individual product/service listed within an invoice becomes its own row (spreadsheet/CSV) or object (JSON).

Prompt

The prompt tells the AI what data to extract. It can be a string or a dict.

String — describe what you want in natural language (max 2,500 characters):

prompt="Extract invoice number, date, vendor name, and total amount"

With a string, the AI chooses output field names based on your instructions.

Dict — use a dict when you need exact output field names. Each name is guaranteed to appear exactly as written in the extracted data. You can also add optional per-field and general instructions:

prompt={
    "fields": [
        {"name": "Invoice Number"},
        {"name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the due date"},
        {"name": "Vendor Name"},
        {"name": "Total Amount", "prompt": "No currency symbol, 2 decimal places"},
    ],
    "general_prompt": "Extract one record per invoice or credit note. Ignore email cover letters. Dates should be in YYYY-MM-DD format.",
}

Each item in fields:

Field	Type	Required	Description
`name`	string	Yes	The name for this data point in the output (2–50 characters). Prefer clear, descriptive names (e.g., "Invoice Number", not "Field A").
`prompt`	string	No	Specific instructions for extracting this data point (3–600 characters). Use this to clarify ambiguities or instruct special handling.

Field	Type	Required	Description
`general_prompt`	string	No	Instructions that apply to the full task and across all fields (max 1,500 characters). Use this to provide special handling instructions, specify output formatting, or describe the extraction goal.

fields must be a non-empty list.

For guidance on writing effective prompts, see the Extraction Guide.

Download

When download is provided, the SDK downloads output files after a successful extraction.

download={
    "formats": ["xlsx", "csv", "json"],
    "output_path": "./output",
}

Field	Required	Description
`formats`	Yes	List of output formats to download. One or more of `"xlsx"`, `"csv"`, `"json"`.
`output_path`	Yes	Destination folder for downloaded files. Created automatically if it doesn't exist.

Downloaded files are named {task_name}_{timestamp}.{format}.

Auto-download is a best-effort convenience. If the extraction completed but a download fails, the SDK surfaces a warning through console_output / on_update and still returns the completed extraction response. You can retry the download later using download_output(...).

Auto-download does not overwrite existing files. If a generated file path already exists, the SDK skips that file and surfaces a warning.

Returns

extract(...) returns the terminal polling response from the API unchanged — for completed, failed, and cancelled extractions.

Verifying results: When extract(...) returns a completed extraction, check result["pages"]["failed_count"]. If it's 0, every uploaded page was processed successfully and is included in the output. If it's greater than 0, inspect result["pages"]["failed"] and result["pages"]["failure_reasons"] to see which specific files/pages failed and why — those pages are not included in the output. This is the primary check to confirm that everything you submitted was extracted without issue.

Completed:

{
  "success": true,
  "status": "completed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 25,
  "output_structure": "per_invoice",
  "output_expires_at": "2026-07-14T10:30:00Z",
  "pages": {
    "successful_count": 10,
    "failed_count": 2,
    "successful": [
      { "file_name": "invoice-1.pdf", "page": 1 }
    ],
    "failed": [
      { "file_name": "damaged.pdf", "page": 1 }
    ],
    "failure_reasons": [
      {
        "code": "PROCESSING_FILE_SIZE_LIMIT_EXCEEDED",
        "message": "The upload was accepted, but during processing part of the PDF became too large for our file-processing limit. This can happen when a compressed PDF is processed internally. Split the PDF into smaller page chunks and resubmit.",
        "affected_pages": [
          { "file_name": "damaged.pdf", "pages": [1] }
        ]
      }
    ]
  },
  "ai_uncertainty_notes": [],
  "review_needed": {
    "count": 1,
    "items": [
      {
        "message": "Check whether the extracted total should include the handwritten adjustment near the bottom of the document.",
        "affected_fields": ["Total Amount"],
        "output_row_numbers": [4],
        "source_references": ["invoice-1.pdf (Page 2)"]
      }
    ]
  },
  "output": {
    "xlsx_url": "https://...",
    "csv_url": "https://...",
    "json_url": "https://..."
  }
}

Field	Description
`credits_deducted`	Credits charged for this extraction (one credit per successful page).
`output_structure`	The output structure used: `"per_invoice"` or `"per_line_item"`. If you submitted `"automatic"`, this tells you what the AI chose.
`output_expires_at`	ISO 8601 timestamp marking when the generated output files will be deleted under the 90-day retention policy. After this time, `output.*_url` fields are `None` and `download_output(...)` raises `OUTPUT_EXPIRED`. See Output expiry.
`pages.successful_count`	Number of pages successfully processed.
`pages.failed_count`	Number of pages that failed processing.
`pages.successful`	List of successfully processed pages. Each item has `file_name` (the uploaded file name) and `page` (the page number within that file).
`pages.failed`	List of pages that failed processing. Same shape as `successful`.
`pages.failure_reasons`	Page-failure reason metadata when available. Empty list if none. Each item has `code`, user-facing `message`, and `affected_pages` grouped by uploaded `file_name` with source-file page numbers. The current public `code` value is `"PROCESSING_FILE_SIZE_LIMIT_EXCEEDED"`.
`ai_uncertainty_notes`	Prompt notes: areas where your prompt left room for interpretation and the AI made an assumption about how to apply it to the documents. Empty list if none. Each note has a `topic`, a `description` of what was assumed, and a `suggested_prompt_additions` list of prompt additions you can use to remove the ambiguity in future extractions. Each suggestion has a `purpose` (why you'd add it) and `instructions` (prompt text you can add).
`review_needed`	Result-level warnings for records that need human verification before you rely on the output. Always present on completed responses as `{"count": ..., "items": [...]}`. Check `result["review_needed"]["count"]`; if greater than `0`, route the listed rows for manual verification. Each item has `message`, `affected_fields`, `output_row_numbers`, and `source_references`. `affected_fields` is populated only for field-specific concerns. `output_row_numbers` contains one or more 1-based extracted data row numbers and does not include the Excel/CSV header row.
`output`	Presigned download URLs for each format (`xlsx_url`, `csv_url`, `json_url`). `None` if not available — including when the output has aged past `output_expires_at`. URLs expire after 5 minutes — use `download_output(...)` or `get_download_url(...)` for a fresh URL while output is still retained.

File uploads are all-or-nothing — if extract(...) returns without raising, every file was uploaded successfully. The only failures to check for are in pages.failed and pages.failure_reasons, which describe pages that failed during extraction processing. If pages.failed_count is 0, all uploaded files and pages were processed successfully.

We strongly recommend checking result["review_needed"]["count"] before relying on extracted data. If it is greater than 0, route the listed rows for manual verification in your workflow.

Failed:

When the extraction task itself fails, extract(...) returns the failed polling response — it does not raise. The failure details are in the returned response body, not on error.body.

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "INSUFFICIENT_CREDITS",
    "message": "Insufficient credits to process this extraction.",
    "retryable": false,
    "details": { "credits_required": 25, "credits_balance": 15, "credits_reserved": 10 }
  }
}

See the API docs for the full list of task failure codes.

Cancelled:

If an extraction is cancelled from View results in the web app while queued or processing, extract(...) returns the cancelled polling response unchanged. No output files are available.

{
  "success": true,
  "status": "cancelled",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 4
}

When `extract(...)` raises

extract(...) only raises before a terminal extraction response is available — for example if upload, submission, or polling fails due to invalid input, network errors, or a polling timeout. These are SDK/API errors and are read from error.body as described in Errors.

Staged Workflow

extract(...) runs the full pipeline in one call. If you need control over individual steps — for example, uploading files in one part of your system and triggering extraction in another, running multiple extractions against the same uploaded files, or fitting each step into your own error handling and retry logic — use these methods instead:

import json
import os
import sys

from invoicedataextraction import InvoiceDataExtraction
from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    client = InvoiceDataExtraction(
        api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
    )

    upload = client.upload_files(
        files=["./invoice1.pdf", "./invoice2.pdf"],
        console_output=True,
    )

    submitted = client.submit_extraction(
        upload_session_id=upload["upload_session_id"],
        file_ids=upload["file_ids"],
        prompt="Extract invoice number and total",
        output_structure="per_invoice",
    )

    result = client.wait_for_extraction_to_finish(
        extraction_id=submitted["extraction_id"],
        console_output=True,
    )

    # Verify all pages were processed
    if result["pages"]["failed_count"] > 0:
        print("Some pages failed processing:", result["pages"]["failed"])

    client.download_output(
        extraction_id=submitted["extraction_id"],
        format="xlsx",
        file_path="./output/invoices.xlsx",
    )
except (SdkError, ApiResponseError) as error:
    print(json.dumps(error.body, indent=2), file=sys.stderr)
    raise SystemExit(1)

`upload_files(...)`

Upload local files without starting an extraction. Use this when you want to upload once and submit extractions separately — for example, to run different prompts against the same files, or to upload in one part of your system and extract in another.

Underlying API workflow: create upload session → upload file parts → complete each file. See File limits for size and count constraints.

Parameter	Required	Description
`folder_path`	One of `folder_path` or `files`	Path to a local folder. The SDK uploads every supported file in the folder (`.pdf`, `.jpg`, `.jpeg`, `.png`). Not recursive.
`files`	One of `folder_path` or `files`	List of local file paths to upload. Supported types: `.pdf`, `.jpg`, `.jpeg`, `.png`.
`upload_session_id`	No	Your own session ID. If omitted, the SDK generates one. If an upload fails partway through, that session cannot be resumed — start a new upload with a fresh session ID.
`console_output`	No	Boolean. When `True`, the SDK logs upload progress to the console.
`on_update`	No	Callable for upload lifecycle updates — see on_update.

Returns

{
  "upload_session_id": "session_a1b2c3d4-...",
  "file_ids": ["file_abc123", "file_def456"]
}

Pass upload_session_id and file_ids to submit_extraction(...) to start an extraction.

File uploads are all-or-nothing. If any file fails to upload, the method raises immediately — there is no partial success state. If upload_files(...) returns without raising, every file was uploaded successfully.

The API checks your credit balance when the upload session is created. If you don't have enough credits, upload_files(...) raises INSUFFICIENT_CREDITS before any files are uploaded.

`submit_extraction(...)`

Submit an extraction task for files that have already been uploaded. The method returns immediately — it does not wait for the extraction to finish.

Underlying API endpoint: POST /extractions.

Parameter	Required	Description
`upload_session_id`	Yes	The upload session ID returned by `upload_files(...)`.
`file_ids`	Yes	List of file IDs returned by `upload_files(...)`.
`prompt`	Yes	Extraction instructions. String or dict — see Prompt.
`output_structure`	Yes	Controls how the extracted data is structured — see Output structure.
`task_name`	No	Your label for this extraction (3–40 characters). Appears in the web dashboard. If omitted, the SDK generates one as `extraction_YYYYMMDD_HHMMSS`.
`exclude_columns`	No	List of system-generated columns to exclude from output. By default, output files include a "Source File" column indicating which uploaded file/page each row was extracted from, and a "Review Needed" column marking rows that need human verification. If your workflow requires an exact output structure, you can exclude either column. Valid values: `"source_file"`, `"review_needed"`. Excluding `"review_needed"` removes only the export column; Review Needed warnings can still be generated and returned in the completed response.
`submission_id`	No	Your own idempotency ID for this submission. If omitted, the SDK generates one. If a request fails or times out, retry with the same `submission_id` to safely retrieve the existing task instead of creating a duplicate.

Returns

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "submission_state": "received"
}

The task is now queued for processing. Use extraction_id to poll for results with wait_for_extraction_to_finish(...) or check_extraction(...). Submitted tasks also appear in the web dashboard where you can view progress and results.

`wait_for_extraction_to_finish(...)`

Poll an extraction until it reaches a terminal state (completed, failed, or cancelled). Use this after submit_extraction(...) when you want the SDK to handle the polling loop for you.

Underlying API endpoint: GET /extractions/{extraction_id} (polled repeatedly).

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID returned by `submit_extraction(...)`.
`polling`	No	Polling options — see Polling.
`console_output`	No	Boolean. When `True`, the SDK logs polling progress to the console.
`on_update`	No	Callable for waiting lifecycle updates — see on_update.

Returns

Returns the terminal polling response from the API unchanged — the same shape documented for extract(...) returns.

When the extraction completes, you get the full result with credits_deducted, pages, ai_uncertainty_notes, review_needed, and output URLs. Check result["pages"]["failed_count"] to verify all pages were processed, check result["review_needed"]["count"] for result-level warnings before relying on the data, and use ai_uncertainty_notes for prompt assumptions you may want to clarify in future runs. When it fails, you get the failed response with result["error"]["code"] and result["error"]["message"]. If the task is cancelled from View results in the web app while the SDK is polling, you get {"success": True, "status": "cancelled", "extraction_id": ..., "credits_deducted": ...}. In all cases the terminal response is returned, not raised.

If polling.timeout_ms is set and the extraction hasn't finished in time, the method raises SDK_TIMEOUT_ERROR. The extraction may still be processing — you can check later with check_extraction(...) or from the web dashboard.

`download_output(...)`

Download a single output file for a completed extraction to disk. Use this for manual downloads after using the staged workflow, or to retry a failed auto-download from extract(...).

Underlying API workflow: request a fresh presigned download URL → download the file → write to disk.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID whose output you want to download.
`format`	Yes	A single output format: `"xlsx"`, `"csv"`, or `"json"`.
`file_path`	Yes	Full destination file path on disk. The file extension must match the requested `format`. The parent directory is created automatically if it doesn't exist.

download_output(...) does not overwrite existing files. If file_path already exists, the SDK raises SDK_FILESYSTEM_ERROR with guidance to choose a new path or remove the existing file.

The extraction must be completed before downloading. If the output is not available — for example, the extraction hasn't finished or the format was not generated — the method raises OUTPUT_NOT_AVAILABLE. If the output existed but has aged past the 90-day retention window, the method raises OUTPUT_EXPIRED. See Output expiry.

Returns

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "format": "xlsx",
  "file_path": "./output/invoices.xlsx"
}

`check_extraction(...)`

Check the current status of a submitted extraction without polling. Use this when you want a single point-in-time status check — for example, in a job queue where you check periodically on your own schedule rather than having the SDK poll using wait_for_extraction_to_finish(...).

Underlying API endpoint: GET /extractions/{extraction_id}.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID to check.

Returns

Returns the current polling response from the API unchanged. The response may represent a processing, completed, cancelled, or failed extraction — the same shapes documented for extract(...) returns. A processing response includes a progress field (0–100) indicating approximate completion.

check_extraction(...) wraps the polling endpoint and is intended for "is it done yet?" checks. To retrieve the full record (including the original prompt, options, full pages, prompt notes in ai_uncertainty_notes, Review Needed warnings, and the full failure message/details) for any extraction in any state, use get_extraction(...).

`get_download_url(...)`

Request a fresh presigned download URL for an extraction's output. Use this when you want to handle the download yourself rather than using download_output(...).

Underlying API endpoint: GET /extractions/{extraction_id}/output?format={format}.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID whose output you want to download.
`format`	Yes	A single output format: `"xlsx"`, `"csv"`, or `"json"`.

Returns

{
  "download_url": "https://storage.example.com/...?X-Amz-Signature=...",
  "format": "xlsx",
  "expires_in_seconds": 300
}

The URL is a temporary, pre-authenticated link. Make a plain GET request to it — no Authorization header needed. It expires after 5 minutes.

The extraction must be completed before requesting a download URL. If the output is not available, the method raises OUTPUT_NOT_AVAILABLE. If the output existed but has aged past the 90-day retention window, the method raises OUTPUT_EXPIRED. See Output expiry.

`delete_extraction(...)`

Permanently delete an extraction, its output files, and its uploaded source files. Use this when you need to remove data immediately rather than waiting for automatic data retention. Extractions that are currently being processed cannot be deleted.

If you created multiple extractions from the same upload session, deleting one will not affect the others — source files are only removed when no other extraction is using them.

Underlying API endpoint: DELETE /extractions/{extraction_id}.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID to delete.

Returns

Returns the API response unchanged.

`get_credits_balance()`

Check your current credit balance and reserved credits.

Underlying API endpoint: GET /credits/balance.

This method takes no arguments.

Returns

{
  "success": true,
  "credits_balance": 150,
  "credits_reserved": 10
}

Field	Description
`credits_balance`	Your total credit balance (paid + free credits).
`credits_reserved`	Credits reserved by extractions currently being processed. Your usable balance is `credits_balance` minus `credits_reserved`.

`list_extractions(...)`

Retrieve a paginated list of your extractions, with optional filters. Items use a slim shape designed for browsing — for the full record (including the original prompt, options, full pages, prompt notes in ai_uncertainty_notes, review_needed, and the full failure message/details), call get_extraction(...) for a specific item.

list_extractions(...) returns a single page. To iterate every matching extraction without writing the cursor loop yourself, use iterate_extractions(...).

Underlying API endpoint: GET /extractions.

page = client.list_extractions(
    status="completed",
    submission_method="api",
    limit=50,
)

for item in page["extractions"]:
    print(item["extraction_id"], item["task_name"], item["created_at"])

if page["has_more"]:
    next_page = client.list_extractions(
        status="completed",
        submission_method="api",
        limit=50,
        cursor=page["next_cursor"],
    )

Parameters

All filters are optional. Call list_extractions() with no arguments to list every extraction visible to your API key.

Parameter	Required	Description
`status`	No	One of `"processing"`, `"completed"`, `"cancelled"`, or `"failed"`. Filter by current status. `cancelled` represents tasks cancelled from the web app while queued or processing.
`submission_method`	No	`"api"` or `"web_app"`. Filter by how the extraction was submitted. The `web_app` value matches the database column verbatim.
`created_after`	No	ISO 8601 string or timezone-aware `datetime.datetime`. Returns extractions created on or after this timestamp. `datetime` values are serialized via `isoformat()` before being sent. Naive datetimes are rejected — the API requires an offset.
`created_before`	No	ISO 8601 string or timezone-aware `datetime.datetime`. Returns extractions created on or before this timestamp.
`limit`	No	Integer from 1 to 100. The number of items to return per page.
`cursor`	No	Opaque pagination token returned as `next_cursor` from a previous page. Treat it as a string and pass it back unchanged.
`scope`	No	`"own"` or `"team"`. Only relevant for Team accounts. Team admins default to team-visible history; pass `"own"` to list only your own extractions. Non-admins can omit it.

For team admins, omitting scope returns team-visible history, equivalent to the dashboard's Team tasks view. Use scope="own" when you want only your own extractions. scope="team" is accepted for explicitness, but only team admins can use it.

Returns

{
  "success": true,
  "extractions": [
    {
      "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "submission_id": "sub_abc",
      "task_name": "March invoices",
      "status": "completed",
      "created_at": "2026-04-15T10:30:00Z",
      "submission_method": "api",
      "file_count": 3,
      "file_names_preview": ["a.pdf", "b.pdf", "c.pdf"],
      "file_names_truncated": false,
      "output_structure": "per_invoice",
      "credits_deducted": 5,
      "available_outputs": ["xlsx", "csv", "json"],
      "output_expires_at": "2026-07-14T10:30:00Z"
    }
  ],
  "has_more": true,
  "next_cursor": "eyJjIjoiMjAyNi0wNC0xNVQxMDozMDowMFoiLCJpIjoxMjM0fQ"
}

When no extractions match, the SDK returns {"success": True, "extractions": [], "has_more": False, "next_cursor": None}.

List item shape

Every list item includes these fields:

Field	Description
`extraction_id`	The extraction's UUID. Use this with `get_extraction(...)`, `get_download_url(...)`, etc.
`submission_id`	Your idempotency ID from `submit_extraction(...)`, or `None` for web-app submissions.
`task_name`	The label you gave the extraction at submission, or `None`.
`status`	`"processing"`, `"completed"`, `"cancelled"`, or `"failed"`. `cancelled` represents a task cancelled from the web app while queued or processing.
`created_at`	ISO 8601 timestamp.
`submission_method`	`"api"` or `"web_app"`.
`file_count`	Total number of files uploaded for this extraction.
`file_names_preview`	The first up to 5 file names, in submission order. Use `get_extraction(...)` to retrieve the full list.
`file_names_truncated`	`True` when `file_count > len(file_names_preview)` (i.e., the extraction has more files than fit in the preview).
`output_structure`	`"per_invoice"`, `"per_line_item"`, `"automatic"` (only while an automatic run is still resolving), or `None` for legacy/unknown rows.

Status-specific fields:

Completed items add credits_deducted (number), available_outputs (a list of "xlsx"/"csv"/"json" indicating which formats can currently be downloaded — empty when the output has aged past output_expires_at), and output_expires_at (ISO 8601 string).
Cancelled items represent tasks cancelled from the web app while queued or processing. They add credits_deducted; no output files are available.
Processing items add progress (0-100).
Failed items add error with the slim {"code": ..., "retryable": ...} shape — for the full message and details, call get_extraction(...) on that extraction.

When a team admin lists team-visible history, every item also includes submitted_by with shape {"email": str | None} identifying the team member who created the extraction. The field is absent in own-only listings. The SDK never exposes a user ID.

`iterate_extractions(...)`

Auto-paginating generator over list_extractions(...). Yields one extraction summary record at a time and transparently fetches the next page when the current one is exhausted. Use this when you want to process every matching extraction without writing the cursor loop yourself.

Underlying API endpoint: GET /extractions (paged).

for extraction in client.iterate_extractions(status="completed"):
    print(extraction["extraction_id"], extraction["task_name"])

Parameters

Identical to list_extractions(...), including scope for team-admin listing behavior. A caller-provided cursor is used as the starting point — the iterator manages cursor advancement from that point on.

Behavior

Yields individual records, not pages. The iterator yields each list item directly, in the same order as list_extractions(...) would return them.
Pages are fetched lazily. The iterator does not request the next page until every item from the current page has been yielded. Breaking out of the for loop early (or otherwise terminating the iterator) prevents the next page from being fetched.
Filters are preserved across pages. The original arguments you pass are reused for every page; only cursor advances.
Mid-stream errors propagate. If a page request fails, the iterator raises on the corresponding next() call and the consumer's for loop re-raises. Items already yielded remain yielded.
Defensive guard on bad pagination state. If the API ever returns has_more: True without a usable next_cursor, the iterator raises SDK_HTTP_ERROR rather than risking an infinite loop.

Validation runs eagerly: iterate_extractions(<invalid kwargs>) raises synchronously at the call site before the generator is returned. You don't need to start iterating to discover bad input.

Return type

A generator object (Python iterator). Usable with for ... in ..., list(...), next(...), etc.

`get_extraction(...)`

Retrieve a single extraction's full record. The record is the same regardless of state — processing, completed, cancelled, and failed extractions are all returned with success: true and the failure details (when present) on extraction["error"].

Use this when you want the full picture of an extraction: the original prompt and options, the complete file list, page-level results, prompt notes in ai_uncertainty_notes, Review Needed warnings, the full failure error (message and details, not just code and retryable), and available_outputs so you can decide what to download.

get_extraction(...) is record retrieval — distinct from check_extraction(...), which wraps the polling endpoint and is intended for "is it done yet?" checks against in-flight extractions.

Underlying API endpoint: GET /extractions/{extraction_id}/details.

result = client.get_extraction(
    extraction_id="a1b2c3d4-e5f6-7890-abcd-ef1234567890",
)
extraction = result["extraction"]

if extraction["status"] == "failed":
    print(extraction["error"]["code"], extraction["error"]["message"])
elif extraction["status"] == "completed":
    print("Available formats:", extraction["available_outputs"])

Parameters

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID to retrieve.

Returns

{
  "success": true,
  "extraction": {
    "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "submission_id": "sub_abc",
    "task_name": "March invoices",
    "status": "completed",
    "created_at": "2026-04-15T10:30:00Z",
    "submission_method": "api",
    "file_count": 3,
    "file_names": ["a.pdf", "b.pdf", "c.pdf"],
    "output_structure": "per_invoice",
    "prompt": "Extract invoice number, date, vendor, total",
    "options": { "exclude_columns": [] },
    "credits_deducted": 5,
    "available_outputs": ["xlsx", "csv", "json"],
    "output_expires_at": "2026-07-14T10:30:00Z",
    "pages": {
      "successful_count": 6,
      "failed_count": 0,
      "successful": [{ "file_name": "a.pdf", "page": 1 }],
      "failed": [],
      "failure_reasons": []
    },
    "ai_uncertainty_notes": [],
    "review_needed": {
      "count": 0,
      "items": []
    }
  }
}

Record shape

Every record includes these fields:

Field	Description
`extraction_id`, `submission_id`, `task_name`, `created_at`, `submission_method`, `output_structure`	Same semantics as the list item shape.
`file_count`	Total number of files uploaded.
`file_names`	Full list of file names, in submission order (no truncation).
`prompt`	The original `prompt` you submitted: a string, a structured dict (`{"fields": [...], "general_prompt": "..."}`), an empty string for web-app submissions with no explicit prompt, or `None` only for legacy/edge rows.
`options`	Always `{"exclude_columns": [...]}`, even when nothing was excluded.

Status-specific fields:

Completed records add credits_deducted, available_outputs, output_expires_at, full pages (with successful, failed, and failure_reasons lists), prompt notes in ai_uncertainty_notes, and result-level review_needed warnings (same shape as on extract(...) returns).
Cancelled records represent tasks cancelled from the web app while queued or processing. They add credits_deducted; no output files are available.
Processing records add progress (0-100).
Failed records add error with the full shape {"code": ..., "message": ..., "retryable": ..., "details": ...} — get_extraction(...) exposes the full failure information regardless of when the extraction failed.

For team-admin lookups, the record may also include submitted_by with shape {"email": str | None}.

The details endpoint never includes signed download URLs — call download_output(...) or get_download_url(...) to actually download files.

`get_extraction(...)` does not raise on failed extractions

A failed extraction is a valid record. get_extraction(...) returns it like any other state, with the failure details on extraction["error"]. It only raises for request-level failures: invalid input, authentication errors, EXTRACTION_NOT_FOUND, network failures, etc.

Common workflow: list, then download

Browsing past extractions and downloading their output is a two-step pattern. List/details responses don't include signed download URLs. Instead, use list_extractions(...) or iterate_extractions(...) to find the extraction you want, then call get_download_url(...) (or download_output(...)) for a fresh signed URL when you're ready to download:

for extraction in client.iterate_extractions(
    status="completed",
    submission_method="api",
):
    task_name = extraction.get("task_name") or ""
    if task_name.startswith("March invoices"):
        if "xlsx" in extraction["available_outputs"]:
            client.download_output(
                extraction_id=extraction["extraction_id"],
                format="xlsx",
                file_path=f"./march/{extraction['extraction_id']}.xlsx",
            )
        else:
            # available_outputs is empty when output_expires_at has passed —
            # the underlying file has been deleted by the 90-day retention policy.
            print(f"Output no longer available for {extraction['extraction_id']}")
        break

Working with Output Files

You can control the structure and formatting of all output files in two main ways:

use output_structure to choose the top-level record shape, such as per_invoice or per_line_item
use your prompt to describe the fields, grouping, and overall structure you want, such as "one row per product" or "one row per PO"

You can also use your prompt to:

specify missing-value placeholders, such as empty string, N/A, or 0
specify formatting requirements, such as YYYY-MM-DD, digits only, or no currency symbol
specify the intended output type, such as text, number, date, datetime, boolean, currency, or percentage

These instructions may appear differently across JSON, CSV, and XLSX outputs, but they all affect how the final export is produced.

At a high level:

JSON output is string-based.
CSV is text-based.
XLSX can use native spreadsheet cell types when values can be safely interpreted.

Working with JSON Output

JSON value typing

In the JSON output file, extracted field values are returned as strings.

Standard fields are returned as strings.
If you ask for a field to contain JSON, that field is returned as a string containing valid JSON.
All values inside that JSON are also strings.

If you need numbers, booleans, or dates as typed values, parse them in your own code. If you plan to parse a value, state the formatting clearly in your prompt. For example:

"Do not include currency symbol"
"Use digits only"
"Return true or false"
"Use YYYY-MM-DD format"

Structured JSON fields

You can ask for a field to return structured JSON.

Example prompt:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    {
      "name": "Line Items",
      "prompt": "Return a JSON array with keys description, quantity, unit_price, and amount. Use digits only for quantity. Use a full stop as the decimal separator. Do not include currency symbols in unit_price or amount. Do not use thousands separators. Use an empty string when a value is missing."
    }
  ]
}

Example JSON output value:

"Line Items": "[{\"description\":\"Widget\",\"quantity\":\"2\",\"unit_price\":\"9.99\",\"amount\":\"19.98\"}]"

In the example above, Line Items is a string whose content is valid JSON.

Use nested line-item JSON like above, mainly for smaller or simpler cases, such as when there are only a few line items and you want a single invoice-level object.

Recommended approach for line items

If you need detailed line item extraction, prefer output_structure: "per_line_item" instead of returning line items inside a nested JSON field.

This is strongly recommended when:

invoices may contain around 7 or more line items
line items need detailed per-field instructions
you want the most reliable line item extraction

In per_line_item, define invoice-level fields and line-item fields as separate top-level fields.

Many workflows can use the per_line_item output directly, with one row/object per line item.

If your workflow needs a nested structure such as { invoice_fields..., line_items: [...] }, include your own stable invoice identifier such as Invoice Number so you can group related line item rows back into invoices in your own system.

Do not rely on Source File alone to group rows into invoices. Source File helps you trace where a row came from, but it is not a stable invoice identifier.

Example prompt for the recommended approach:

{
  "prompt": {
    "fields": [
      { "name": "Invoice Number" },
      { "name": "Invoice Date", "prompt": "Use YYYY-MM-DD format" },
      { "name": "Vendor Name" },
      { "name": "Line Item Description" },
      { "name": "Line Item Quantity", "prompt": "Use digits only" },
      { "name": "Line Item Unit Price" },
      { "name": "Line Item Amount" }
    ],
    "general_prompt": "For amount fields don't use thousands separators, use full stops as the decimal separator and do not include currency symbols."
  },
  "output_structure": "per_line_item"
}

Example JSON output rows:

[
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget A",
    "Line Item Quantity": "2",
    "Line Item Unit Price": "9.99",
    "Line Item Amount": "19.98"
  },
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget B",
    "Line Item Quantity": "1",
    "Line Item Unit Price": "5.00",
    "Line Item Amount": "5.00"
  }
]

Both rows above belong to the same invoice because they share the same Invoice Number. If your workflow needs one record per line item, you can use the rows as-is. If your workflow needs a nested invoice structure, you can group rows that share the same invoice identifier to build your own { invoice_fields..., line_items: [...] } structure.

CSV Output

CSV is a plain-text export. Every value in the CSV file is written as text.

XLSX Output

XLSX uses the most appropriate spreadsheet cell type for each value by default, and follows explicit prompt instructions where provided.

File Limits

Type	Max size
PDF	150 MB
JPG / JPEG / PNG	5 MB
Total batch size	2 GB
Max files per session	6,000

Applies to extract(...) and upload_files(...).

Polling

Several methods accept a polling option to control how the SDK polls for extraction status.

Field	Default	Description
`interval_ms`	`10000`	Milliseconds between polls. Minimum `5000`.
`timeout_ms`	`None`	Maximum time to wait in milliseconds. `None` means no timeout.

Used by: extract(...), wait_for_extraction_to_finish(...).

`on_update`

Optional callable that receives lifecycle updates across all stages. Use this when you want to handle progress reporting yourself — for example to update a UI, feed a progress bar, or route updates to your own logging instead of the built-in console_output.

def on_update(payload):
    # payload is a dict with: stage, level, message, progress, extraction_id
    print(payload["message"])

Field	Description
`stage`	Current lifecycle stage: `"upload"`, `"submission"`, `"waiting"`, `"download"`, or `"completion"`.
`level`	`"info"`, `"warn"`, or `"error"`.
`message`	Human-readable status message.
`progress`	Numeric progress when available, otherwise `None`.
`extraction_id`	The extraction ID once available, otherwise `None`.

Used by: extract(...), upload_files(...), wait_for_extraction_to_finish(...).

Output expiry

There are two unrelated time limits on output files. Don't confuse them:

Limit	What expires	Duration	What to do
Signed download URL	The presigned URL itself	5 minutes	Request a fresh URL via `download_output(...)` or `get_download_url(...)`
Output file retention	The generated file in storage	90 days from `created_at`	Re-run the extraction; the original output is gone

output_expires_at (on completed responses) tells you when the underlying file will be deleted. After that timestamp:

output["xlsx_url"], output["csv_url"], output["json_url"] on polling/extract responses are None.
available_outputs on list/details responses is an empty list.
get_download_url(...) and download_output(...) raise OUTPUT_EXPIRED.

OUTPUT_NOT_AVAILABLE is a different error: it means the extraction either hasn't completed, or the requested format was never generated for it. OUTPUT_EXPIRED means the output existed but has aged out of retention.

Conventions

Method names are snake_case: extract(...), upload_files(...), submit_extraction(...).
All parameter names are snake_case: api_key, folder_path, output_structure, task_name, upload_session_id, file_ids, console_output, on_update, file_path.
Response fields are snake_case, matching the API exactly. The SDK returns the same JSON shapes as the raw API — if you have the API docs, those response examples are valid for the SDK too.
The files parameter accepts local file paths as strings only. File objects, byte streams, and in-memory buffers are not supported in v1.

Rate Limits

All API endpoints are rate limited per API key. The SDK automatically retries rate-limited requests, but you should be aware of the limits if you are making many calls. Sustained overuse will result in a RATE_LIMITED error.

Endpoints	Limit
Upload endpoints (create session, get part URLs, complete upload)	600 requests per minute
Submit extraction	30 requests per minute
Poll extraction status	120 requests per minute
List extractions	60 requests per minute
Get extraction details	60 requests per minute
Download output	30 requests per minute
Delete extraction	30 requests per minute
Check credit balance	60 requests per minute

Errors

SDK methods raise exceptions on failure:

On failure, a method raises an SdkError (for SDK-level and validation errors) or ApiResponseError (for API response errors).
The structured error body is available on error.body.
error.body uses the same JSON error shape as the API.

Note: if you let an exception go uncaught, Python will usually only show the top-level error message in the traceback. To read the full structured SDK/API error payload, catch the exception and inspect error.body.

Error body shape:

{
  "success": false,
  "error": {
    "code": "SOME_ERROR_CODE",
    "message": "Human-readable message.",
    "retryable": false,
    "details": null
  }
}

Read the error like this:

from invoicedataextraction import InvoiceDataExtraction
from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    client.check_extraction(extraction_id="...")
except (SdkError, ApiResponseError) as error:
    print(error.body["error"]["code"])
    print(error.body["error"]["message"])
    print(error.body["error"]["retryable"])
    print(error.body["error"]["details"])

Every error includes a code (machine-readable), message (human-readable), and retryable (whether retrying may succeed). The message is descriptive enough to act on directly in most cases. details provides additional context when available — for example, INVALID_INPUT errors include a details.issues list with the specific validation problems.

INVALID_INPUT can come from either the SDK (caught before the request is sent) or the API. Handle it the same way in both cases.

Authentication errors (UNAUTHENTICATED, API_KEY_EXPIRED, API_KEY_REVOKED) indicate a problem with your API key — generate a new one from your dashboard.

The SDK automatically retries RATE_LIMITED and transient INTERNAL_ERROR responses, but will surface them if retries are exhausted.

Method-specific errors like EXTRACTION_NOT_FOUND, OUTPUT_NOT_AVAILABLE, OUTPUT_EXPIRED, EXTRACTION_IN_PROGRESS, and INSUFFICIENT_CREDITS are documented in the relevant method sections above. For full endpoint-level error details, see the API docs.

Extraction task failure codes

When an extraction task itself fails, the failure code comes from the API-owned extraction failure taxonomy documented in the REST API docs. In the SDK, that code appears on result["error"]["code"] from extract(...), check_extraction(...), and wait_for_extraction_to_finish(...), or on extraction["error"]["code"] from get_extraction(...).

Branch on error["code"], error["retryable"], and any returned error["details"] directly. The SDK does not provide subclasses for individual extraction failure codes.

Task failure vs SDK/API failure:

After an extraction task has been accepted, the task itself can still finish with status: "failed".
That is a task outcome, not an SDK error.
check_extraction(...), wait_for_extraction_to_finish(...), and extract(...) return the polling response body for task states such as processing, completed, cancelled, and failed.
When a task ends with status: "failed", the failure details are in the returned response body, not on error.body.
error.body is only used when the SDK method/request itself fails — validation errors, authentication errors, network failures, timeouts, or other operational failures.

SDK-specific error codes:

Code	When the SDK uses it
`SDK_FILESYSTEM_ERROR`	A local filesystem operation failed, such as reading an input file, creating a directory, or writing a downloaded file.
`SDK_NETWORK_ERROR`	A network request failed before the SDK received a valid HTTP response.
`SDK_HTTP_ERROR`	The SDK received an unexpected HTTP response shape, such as a non-JSON response or another response that does not match the documented contract.
`SDK_TIMEOUT_ERROR`	`wait_for_extraction_to_finish(...)` timed out before the extraction finished.
`SDK_DOWNLOAD_ERROR`	An SDK-managed download step failed.
`SDK_UPLOAD_ERROR`	An SDK-managed upload orchestration step failed.

Method to API Endpoint Mapping

SDK Method	Underlying API
`extract(...)`	`upload_files(...)` → `submit_extraction(...)` → `wait_for_extraction_to_finish(...)` → `download_output(...)`
`upload_files(...)`	`POST /uploads/sessions` → `POST /uploads/sessions/{id}/parts` → `POST /uploads/sessions/{id}/complete`
`submit_extraction(...)`	`POST /extractions`
`wait_for_extraction_to_finish(...)`	`GET /extractions/{extraction_id}` (polled)
`download_output(...)`	`GET /extractions/{extraction_id}/output?format={format}` → presigned URL download
`check_extraction(...)`	`GET /extractions/{extraction_id}`
`get_download_url(...)`	`GET /extractions/{extraction_id}/output?format={format}`
`delete_extraction(...)`	`DELETE /extractions/{extraction_id}`
`get_credits_balance()`	`GET /credits/balance`
`list_extractions(...)`	`GET /extractions`
`iterate_extractions(...)`	`GET /extractions` (auto-paginated)
`get_extraction(...)`	`GET /extractions/{extraction_id}/details`

Invoice Data Extraction Python SDK

Official Python SDK for Invoice Data Extraction. Handles file upload, extraction submission, polling, and result download so you can go from local files to structured output in a few lines of code.

Python 3.9 or later

Install

pip install invoicedataextraction-sdk

Quick Start

import json
import os
import sys

from invoicedataextraction import InvoiceDataExtraction
from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    client = InvoiceDataExtraction(
        api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
    )

    result = client.extract(
        folder_path="./invoices",
        prompt="Extract invoice number and total",
        output_structure="per_invoice",
        download={
            "formats": ["xlsx", "json"],
            "output_path": "./output",
        },
        console_output=True,  # remove to disable console logging
    )
except (SdkError, ApiResponseError) as error:
    print(json.dumps(error.body, indent=2), file=sys.stderr)
    raise SystemExit(1)

Generate an API key from your dashboard. Every account includes 50 free pages per month. Additional credits can be purchased on a pay-as-you-go basis with no subscription needed.

Constructor

import os
from invoicedataextraction import InvoiceDataExtraction

client = InvoiceDataExtraction(
    api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
)

Parameter	Required	Description
`api_key`	Yes	Your API key.
`base_url`	No	API base URL. Defaults to `https://api.invoicedataextraction.com/v1`. Only needed for testing or non-production environments.

`extract(...)`

Underlying API workflow: upload session → submit extraction → poll for results → download output. See File limits for size and count constraints.

result = client.extract(
    folder_path="./invoices",
    prompt="Extract invoice number, date, vendor name, and total amount",
    output_structure="per_invoice",
    download={
        "formats": ["xlsx", "json"],
        "output_path": "./output",
    },
    console_output=True,  # remove to disable console logging
)

Parameters

Parameter	Required	Description
`folder_path`	One of `folder_path` or `files`	Path to a local folder. The SDK uploads every supported file in the folder (`.pdf`, `.jpg`, `.jpeg`, `.png`). Not recursive.
`files`	One of `folder_path` or `files`	List of local file paths to upload. Supported types: `.pdf`, `.jpg`, `.jpeg`, `.png`.
`prompt`	Yes	Extraction instructions. String or dict — see Prompt below.
`output_structure`	Yes	Controls how the extracted data is structured — see Output structure below.
`task_name`	No	Your label for this extraction (3–40 characters). Appears in the web dashboard. If omitted, the SDK generates one as `extraction_YYYYMMDD_HHMMSS`.
`exclude_columns`	No	List of system-generated columns to exclude from output. By default, output files include a "Source File" column indicating which uploaded file/page each row was extracted from, and a "Review Needed" column marking rows that need human verification. If your workflow requires an exact output structure, you can exclude either column. Valid values: `"source_file"`, `"review_needed"`. Excluding `"review_needed"` removes only the export column; Review Needed warnings can still be generated and returned in the completed response.
`download`	No	Download options — see Download below. If omitted, no files are downloaded.
`polling`	No	Polling options — see Polling below.
`console_output`	No	Boolean. When `True`, the SDK logs progress to the console during upload, polling, and download. Off by default.
`on_update`	No	Callable for lifecycle updates — see on_update below.

Output structure

Controls how the extracted data is structured:

Value	Meaning
`automatic`	The AI decides based on your prompt and documents.
`per_invoice`	Each invoice becomes a single row (spreadsheet/CSV) or object (JSON).
`per_line_item`	Each individual product/service listed within an invoice becomes its own row (spreadsheet/CSV) or object (JSON).

Prompt

The prompt tells the AI what data to extract. It can be a string or a dict.

String — describe what you want in natural language (max 2,500 characters):

prompt="Extract invoice number, date, vendor name, and total amount"

With a string, the AI chooses output field names based on your instructions.

prompt={
    "fields": [
        {"name": "Invoice Number"},
        {"name": "Invoice Date", "prompt": "The date the invoice was issued, NOT the due date"},
        {"name": "Vendor Name"},
        {"name": "Total Amount", "prompt": "No currency symbol, 2 decimal places"},
    ],
    "general_prompt": "Extract one record per invoice or credit note. Ignore email cover letters. Dates should be in YYYY-MM-DD format.",
}

Each item in fields:

Field	Type	Required	Description
`name`	string	Yes	The name for this data point in the output (2–50 characters). Prefer clear, descriptive names (e.g., "Invoice Number", not "Field A").
`prompt`	string	No	Specific instructions for extracting this data point (3–600 characters). Use this to clarify ambiguities or instruct special handling.

Field	Type	Required	Description
`general_prompt`	string	No	Instructions that apply to the full task and across all fields (max 1,500 characters). Use this to provide special handling instructions, specify output formatting, or describe the extraction goal.

fields must be a non-empty list.

For guidance on writing effective prompts, see the Extraction Guide.

Download

When download is provided, the SDK downloads output files after a successful extraction.

download={
    "formats": ["xlsx", "csv", "json"],
    "output_path": "./output",
}

Field	Required	Description
`formats`	Yes	List of output formats to download. One or more of `"xlsx"`, `"csv"`, `"json"`.
`output_path`	Yes	Destination folder for downloaded files. Created automatically if it doesn't exist.

Downloaded files are named {task_name}_{timestamp}.{format}.

Auto-download does not overwrite existing files. If a generated file path already exists, the SDK skips that file and surfaces a warning.

Returns

extract(...) returns the terminal polling response from the API unchanged — for completed, failed, and cancelled extractions.

Completed:

{
  "success": true,
  "status": "completed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 25,
  "output_structure": "per_invoice",
  "output_expires_at": "2026-07-14T10:30:00Z",
  "pages": {
    "successful_count": 10,
    "failed_count": 2,
    "successful": [
      { "file_name": "invoice-1.pdf", "page": 1 }
    ],
    "failed": [
      { "file_name": "damaged.pdf", "page": 1 }
    ],
    "failure_reasons": [
      {
        "code": "PROCESSING_FILE_SIZE_LIMIT_EXCEEDED",
        "message": "The upload was accepted, but during processing part of the PDF became too large for our file-processing limit. This can happen when a compressed PDF is processed internally. Split the PDF into smaller page chunks and resubmit.",
        "affected_pages": [
          { "file_name": "damaged.pdf", "pages": [1] }
        ]
      }
    ]
  },
  "ai_uncertainty_notes": [],
  "review_needed": {
    "count": 1,
    "items": [
      {
        "message": "Check whether the extracted total should include the handwritten adjustment near the bottom of the document.",
        "affected_fields": ["Total Amount"],
        "output_row_numbers": [4],
        "source_references": ["invoice-1.pdf (Page 2)"]
      }
    ]
  },
  "output": {
    "xlsx_url": "https://...",
    "csv_url": "https://...",
    "json_url": "https://..."
  }
}

Field	Description
`credits_deducted`	Credits charged for this extraction (one credit per successful page).
`output_structure`	The output structure used: `"per_invoice"` or `"per_line_item"`. If you submitted `"automatic"`, this tells you what the AI chose.
`output_expires_at`	ISO 8601 timestamp marking when the generated output files will be deleted under the 90-day retention policy. After this time, `output.*_url` fields are `None` and `download_output(...)` raises `OUTPUT_EXPIRED`. See Output expiry.
`pages.successful_count`	Number of pages successfully processed.
`pages.failed_count`	Number of pages that failed processing.
`pages.successful`	List of successfully processed pages. Each item has `file_name` (the uploaded file name) and `page` (the page number within that file).
`pages.failed`	List of pages that failed processing. Same shape as `successful`.
`pages.failure_reasons`	Page-failure reason metadata when available. Empty list if none. Each item has `code`, user-facing `message`, and `affected_pages` grouped by uploaded `file_name` with source-file page numbers. The current public `code` value is `"PROCESSING_FILE_SIZE_LIMIT_EXCEEDED"`.
`ai_uncertainty_notes`	Prompt notes: areas where your prompt left room for interpretation and the AI made an assumption about how to apply it to the documents. Empty list if none. Each note has a `topic`, a `description` of what was assumed, and a `suggested_prompt_additions` list of prompt additions you can use to remove the ambiguity in future extractions. Each suggestion has a `purpose` (why you'd add it) and `instructions` (prompt text you can add).
`review_needed`	Result-level warnings for records that need human verification before you rely on the output. Always present on completed responses as `{"count": ..., "items": [...]}`. Check `result["review_needed"]["count"]`; if greater than `0`, route the listed rows for manual verification. Each item has `message`, `affected_fields`, `output_row_numbers`, and `source_references`. `affected_fields` is populated only for field-specific concerns. `output_row_numbers` contains one or more 1-based extracted data row numbers and does not include the Excel/CSV header row.
`output`	Presigned download URLs for each format (`xlsx_url`, `csv_url`, `json_url`). `None` if not available — including when the output has aged past `output_expires_at`. URLs expire after 5 minutes — use `download_output(...)` or `get_download_url(...)` for a fresh URL while output is still retained.

We strongly recommend checking result["review_needed"]["count"] before relying on extracted data. If it is greater than 0, route the listed rows for manual verification in your workflow.

Failed:

When the extraction task itself fails, extract(...) returns the failed polling response — it does not raise. The failure details are in the returned response body, not on error.body.

{
  "success": false,
  "status": "failed",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "error": {
    "code": "INSUFFICIENT_CREDITS",
    "message": "Insufficient credits to process this extraction.",
    "retryable": false,
    "details": { "credits_required": 25, "credits_balance": 15, "credits_reserved": 10 }
  }
}

See the API docs for the full list of task failure codes.

Cancelled:

If an extraction is cancelled from View results in the web app while queued or processing, extract(...) returns the cancelled polling response unchanged. No output files are available.

{
  "success": true,
  "status": "cancelled",
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "credits_deducted": 4
}

When `extract(...)` raises

Staged Workflow

import json
import os
import sys

from invoicedataextraction import InvoiceDataExtraction
from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    client = InvoiceDataExtraction(
        api_key=os.environ.get("INVOICE_DATA_EXTRACTION_API_KEY"),
    )

    upload = client.upload_files(
        files=["./invoice1.pdf", "./invoice2.pdf"],
        console_output=True,
    )

    submitted = client.submit_extraction(
        upload_session_id=upload["upload_session_id"],
        file_ids=upload["file_ids"],
        prompt="Extract invoice number and total",
        output_structure="per_invoice",
    )

    result = client.wait_for_extraction_to_finish(
        extraction_id=submitted["extraction_id"],
        console_output=True,
    )

    # Verify all pages were processed
    if result["pages"]["failed_count"] > 0:
        print("Some pages failed processing:", result["pages"]["failed"])

    client.download_output(
        extraction_id=submitted["extraction_id"],
        format="xlsx",
        file_path="./output/invoices.xlsx",
    )
except (SdkError, ApiResponseError) as error:
    print(json.dumps(error.body, indent=2), file=sys.stderr)
    raise SystemExit(1)

`upload_files(...)`

Underlying API workflow: create upload session → upload file parts → complete each file. See File limits for size and count constraints.

Parameter	Required	Description
`folder_path`	One of `folder_path` or `files`	Path to a local folder. The SDK uploads every supported file in the folder (`.pdf`, `.jpg`, `.jpeg`, `.png`). Not recursive.
`files`	One of `folder_path` or `files`	List of local file paths to upload. Supported types: `.pdf`, `.jpg`, `.jpeg`, `.png`.
`upload_session_id`	No	Your own session ID. If omitted, the SDK generates one. If an upload fails partway through, that session cannot be resumed — start a new upload with a fresh session ID.
`console_output`	No	Boolean. When `True`, the SDK logs upload progress to the console.
`on_update`	No	Callable for upload lifecycle updates — see on_update.

Returns

{
  "upload_session_id": "session_a1b2c3d4-...",
  "file_ids": ["file_abc123", "file_def456"]
}

Pass upload_session_id and file_ids to submit_extraction(...) to start an extraction.

The API checks your credit balance when the upload session is created. If you don't have enough credits, upload_files(...) raises INSUFFICIENT_CREDITS before any files are uploaded.

`submit_extraction(...)`

Submit an extraction task for files that have already been uploaded. The method returns immediately — it does not wait for the extraction to finish.

Underlying API endpoint: POST /extractions.

Parameter	Required	Description
`upload_session_id`	Yes	The upload session ID returned by `upload_files(...)`.
`file_ids`	Yes	List of file IDs returned by `upload_files(...)`.
`prompt`	Yes	Extraction instructions. String or dict — see Prompt.
`output_structure`	Yes	Controls how the extracted data is structured — see Output structure.
`task_name`	No	Your label for this extraction (3–40 characters). Appears in the web dashboard. If omitted, the SDK generates one as `extraction_YYYYMMDD_HHMMSS`.
`exclude_columns`	No	List of system-generated columns to exclude from output. By default, output files include a "Source File" column indicating which uploaded file/page each row was extracted from, and a "Review Needed" column marking rows that need human verification. If your workflow requires an exact output structure, you can exclude either column. Valid values: `"source_file"`, `"review_needed"`. Excluding `"review_needed"` removes only the export column; Review Needed warnings can still be generated and returned in the completed response.
`submission_id`	No	Your own idempotency ID for this submission. If omitted, the SDK generates one. If a request fails or times out, retry with the same `submission_id` to safely retrieve the existing task instead of creating a duplicate.

Returns

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "submission_state": "received"
}

`wait_for_extraction_to_finish(...)`

Poll an extraction until it reaches a terminal state (completed, failed, or cancelled). Use this after submit_extraction(...) when you want the SDK to handle the polling loop for you.

Underlying API endpoint: GET /extractions/{extraction_id} (polled repeatedly).

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID returned by `submit_extraction(...)`.
`polling`	No	Polling options — see Polling.
`console_output`	No	Boolean. When `True`, the SDK logs polling progress to the console.
`on_update`	No	Callable for waiting lifecycle updates — see on_update.

Returns

Returns the terminal polling response from the API unchanged — the same shape documented for extract(...) returns.

`download_output(...)`

Download a single output file for a completed extraction to disk. Use this for manual downloads after using the staged workflow, or to retry a failed auto-download from extract(...).

Underlying API workflow: request a fresh presigned download URL → download the file → write to disk.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID whose output you want to download.
`format`	Yes	A single output format: `"xlsx"`, `"csv"`, or `"json"`.
`file_path`	Yes	Full destination file path on disk. The file extension must match the requested `format`. The parent directory is created automatically if it doesn't exist.

download_output(...) does not overwrite existing files. If file_path already exists, the SDK raises SDK_FILESYSTEM_ERROR with guidance to choose a new path or remove the existing file.

Returns

{
  "success": true,
  "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "format": "xlsx",
  "file_path": "./output/invoices.xlsx"
}

`check_extraction(...)`

Underlying API endpoint: GET /extractions/{extraction_id}.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID to check.

Returns

`get_download_url(...)`

Request a fresh presigned download URL for an extraction's output. Use this when you want to handle the download yourself rather than using download_output(...).

Underlying API endpoint: GET /extractions/{extraction_id}/output?format={format}.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID whose output you want to download.
`format`	Yes	A single output format: `"xlsx"`, `"csv"`, or `"json"`.

Returns

{
  "download_url": "https://storage.example.com/...?X-Amz-Signature=...",
  "format": "xlsx",
  "expires_in_seconds": 300
}

The URL is a temporary, pre-authenticated link. Make a plain GET request to it — no Authorization header needed. It expires after 5 minutes.

`delete_extraction(...)`

If you created multiple extractions from the same upload session, deleting one will not affect the others — source files are only removed when no other extraction is using them.

Underlying API endpoint: DELETE /extractions/{extraction_id}.

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID to delete.

Returns

Returns the API response unchanged.

`get_credits_balance()`

Check your current credit balance and reserved credits.

Underlying API endpoint: GET /credits/balance.

This method takes no arguments.

Returns

{
  "success": true,
  "credits_balance": 150,
  "credits_reserved": 10
}

Field	Description
`credits_balance`	Your total credit balance (paid + free credits).
`credits_reserved`	Credits reserved by extractions currently being processed. Your usable balance is `credits_balance` minus `credits_reserved`.

`list_extractions(...)`

list_extractions(...) returns a single page. To iterate every matching extraction without writing the cursor loop yourself, use iterate_extractions(...).

Underlying API endpoint: GET /extractions.

page = client.list_extractions(
    status="completed",
    submission_method="api",
    limit=50,
)

for item in page["extractions"]:
    print(item["extraction_id"], item["task_name"], item["created_at"])

if page["has_more"]:
    next_page = client.list_extractions(
        status="completed",
        submission_method="api",
        limit=50,
        cursor=page["next_cursor"],
    )

Parameters

All filters are optional. Call list_extractions() with no arguments to list every extraction visible to your API key.

Parameter	Required	Description
`status`	No	One of `"processing"`, `"completed"`, `"cancelled"`, or `"failed"`. Filter by current status. `cancelled` represents tasks cancelled from the web app while queued or processing.
`submission_method`	No	`"api"` or `"web_app"`. Filter by how the extraction was submitted. The `web_app` value matches the database column verbatim.
`created_after`	No	ISO 8601 string or timezone-aware `datetime.datetime`. Returns extractions created on or after this timestamp. `datetime` values are serialized via `isoformat()` before being sent. Naive datetimes are rejected — the API requires an offset.
`created_before`	No	ISO 8601 string or timezone-aware `datetime.datetime`. Returns extractions created on or before this timestamp.
`limit`	No	Integer from 1 to 100. The number of items to return per page.
`cursor`	No	Opaque pagination token returned as `next_cursor` from a previous page. Treat it as a string and pass it back unchanged.
`scope`	No	`"own"` or `"team"`. Only relevant for Team accounts. Team admins default to team-visible history; pass `"own"` to list only your own extractions. Non-admins can omit it.

Returns

{
  "success": true,
  "extractions": [
    {
      "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "submission_id": "sub_abc",
      "task_name": "March invoices",
      "status": "completed",
      "created_at": "2026-04-15T10:30:00Z",
      "submission_method": "api",
      "file_count": 3,
      "file_names_preview": ["a.pdf", "b.pdf", "c.pdf"],
      "file_names_truncated": false,
      "output_structure": "per_invoice",
      "credits_deducted": 5,
      "available_outputs": ["xlsx", "csv", "json"],
      "output_expires_at": "2026-07-14T10:30:00Z"
    }
  ],
  "has_more": true,
  "next_cursor": "eyJjIjoiMjAyNi0wNC0xNVQxMDozMDowMFoiLCJpIjoxMjM0fQ"
}

When no extractions match, the SDK returns {"success": True, "extractions": [], "has_more": False, "next_cursor": None}.

List item shape

Every list item includes these fields:

Field	Description
`extraction_id`	The extraction's UUID. Use this with `get_extraction(...)`, `get_download_url(...)`, etc.
`submission_id`	Your idempotency ID from `submit_extraction(...)`, or `None` for web-app submissions.
`task_name`	The label you gave the extraction at submission, or `None`.
`status`	`"processing"`, `"completed"`, `"cancelled"`, or `"failed"`. `cancelled` represents a task cancelled from the web app while queued or processing.
`created_at`	ISO 8601 timestamp.
`submission_method`	`"api"` or `"web_app"`.
`file_count`	Total number of files uploaded for this extraction.
`file_names_preview`	The first up to 5 file names, in submission order. Use `get_extraction(...)` to retrieve the full list.
`file_names_truncated`	`True` when `file_count > len(file_names_preview)` (i.e., the extraction has more files than fit in the preview).
`output_structure`	`"per_invoice"`, `"per_line_item"`, `"automatic"` (only while an automatic run is still resolving), or `None` for legacy/unknown rows.

Status-specific fields:

Completed items add credits_deducted (number), available_outputs (a list of "xlsx"/"csv"/"json" indicating which formats can currently be downloaded — empty when the output has aged past output_expires_at), and output_expires_at (ISO 8601 string).
Cancelled items represent tasks cancelled from the web app while queued or processing. They add credits_deducted; no output files are available.
Processing items add progress (0-100).
Failed items add error with the slim {"code": ..., "retryable": ...} shape — for the full message and details, call get_extraction(...) on that extraction.

`iterate_extractions(...)`

Underlying API endpoint: GET /extractions (paged).

for extraction in client.iterate_extractions(status="completed"):
    print(extraction["extraction_id"], extraction["task_name"])

Parameters

Behavior

Yields individual records, not pages. The iterator yields each list item directly, in the same order as list_extractions(...) would return them.
Pages are fetched lazily. The iterator does not request the next page until every item from the current page has been yielded. Breaking out of the for loop early (or otherwise terminating the iterator) prevents the next page from being fetched.
Filters are preserved across pages. The original arguments you pass are reused for every page; only cursor advances.
Mid-stream errors propagate. If a page request fails, the iterator raises on the corresponding next() call and the consumer's for loop re-raises. Items already yielded remain yielded.
Defensive guard on bad pagination state. If the API ever returns has_more: True without a usable next_cursor, the iterator raises SDK_HTTP_ERROR rather than risking an infinite loop.

Validation runs eagerly: iterate_extractions(<invalid kwargs>) raises synchronously at the call site before the generator is returned. You don't need to start iterating to discover bad input.

Return type

A generator object (Python iterator). Usable with for ... in ..., list(...), next(...), etc.

`get_extraction(...)`

get_extraction(...) is record retrieval — distinct from check_extraction(...), which wraps the polling endpoint and is intended for "is it done yet?" checks against in-flight extractions.

Underlying API endpoint: GET /extractions/{extraction_id}/details.

result = client.get_extraction(
    extraction_id="a1b2c3d4-e5f6-7890-abcd-ef1234567890",
)
extraction = result["extraction"]

if extraction["status"] == "failed":
    print(extraction["error"]["code"], extraction["error"]["message"])
elif extraction["status"] == "completed":
    print("Available formats:", extraction["available_outputs"])

Parameters

Parameter	Required	Description
`extraction_id`	Yes	The extraction ID to retrieve.

Returns

{
  "success": true,
  "extraction": {
    "extraction_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "submission_id": "sub_abc",
    "task_name": "March invoices",
    "status": "completed",
    "created_at": "2026-04-15T10:30:00Z",
    "submission_method": "api",
    "file_count": 3,
    "file_names": ["a.pdf", "b.pdf", "c.pdf"],
    "output_structure": "per_invoice",
    "prompt": "Extract invoice number, date, vendor, total",
    "options": { "exclude_columns": [] },
    "credits_deducted": 5,
    "available_outputs": ["xlsx", "csv", "json"],
    "output_expires_at": "2026-07-14T10:30:00Z",
    "pages": {
      "successful_count": 6,
      "failed_count": 0,
      "successful": [{ "file_name": "a.pdf", "page": 1 }],
      "failed": [],
      "failure_reasons": []
    },
    "ai_uncertainty_notes": [],
    "review_needed": {
      "count": 0,
      "items": []
    }
  }
}

Record shape

Every record includes these fields:

Field	Description
`extraction_id`, `submission_id`, `task_name`, `created_at`, `submission_method`, `output_structure`	Same semantics as the list item shape.
`file_count`	Total number of files uploaded.
`file_names`	Full list of file names, in submission order (no truncation).
`prompt`	The original `prompt` you submitted: a string, a structured dict (`{"fields": [...], "general_prompt": "..."}`), an empty string for web-app submissions with no explicit prompt, or `None` only for legacy/edge rows.
`options`	Always `{"exclude_columns": [...]}`, even when nothing was excluded.

Status-specific fields:

Completed records add credits_deducted, available_outputs, output_expires_at, full pages (with successful, failed, and failure_reasons lists), prompt notes in ai_uncertainty_notes, and result-level review_needed warnings (same shape as on extract(...) returns).
Cancelled records represent tasks cancelled from the web app while queued or processing. They add credits_deducted; no output files are available.
Processing records add progress (0-100).
Failed records add error with the full shape {"code": ..., "message": ..., "retryable": ..., "details": ...} — get_extraction(...) exposes the full failure information regardless of when the extraction failed.

For team-admin lookups, the record may also include submitted_by with shape {"email": str | None}.

The details endpoint never includes signed download URLs — call download_output(...) or get_download_url(...) to actually download files.

`get_extraction(...)` does not raise on failed extractions

Common workflow: list, then download

for extraction in client.iterate_extractions(
    status="completed",
    submission_method="api",
):
    task_name = extraction.get("task_name") or ""
    if task_name.startswith("March invoices"):
        if "xlsx" in extraction["available_outputs"]:
            client.download_output(
                extraction_id=extraction["extraction_id"],
                format="xlsx",
                file_path=f"./march/{extraction['extraction_id']}.xlsx",
            )
        else:
            # available_outputs is empty when output_expires_at has passed —
            # the underlying file has been deleted by the 90-day retention policy.
            print(f"Output no longer available for {extraction['extraction_id']}")
        break

Working with Output Files

You can control the structure and formatting of all output files in two main ways:

use output_structure to choose the top-level record shape, such as per_invoice or per_line_item
use your prompt to describe the fields, grouping, and overall structure you want, such as "one row per product" or "one row per PO"

You can also use your prompt to:

specify missing-value placeholders, such as empty string, N/A, or 0
specify formatting requirements, such as YYYY-MM-DD, digits only, or no currency symbol
specify the intended output type, such as text, number, date, datetime, boolean, currency, or percentage

These instructions may appear differently across JSON, CSV, and XLSX outputs, but they all affect how the final export is produced.

At a high level:

JSON output is string-based.
CSV is text-based.
XLSX can use native spreadsheet cell types when values can be safely interpreted.

Working with JSON Output

JSON value typing

In the JSON output file, extracted field values are returned as strings.

Standard fields are returned as strings.
If you ask for a field to contain JSON, that field is returned as a string containing valid JSON.
All values inside that JSON are also strings.

If you need numbers, booleans, or dates as typed values, parse them in your own code. If you plan to parse a value, state the formatting clearly in your prompt. For example:

"Do not include currency symbol"
"Use digits only"
"Return true or false"
"Use YYYY-MM-DD format"

Structured JSON fields

You can ask for a field to return structured JSON.

Example prompt:

"prompt": {
  "fields": [
    { "name": "Invoice Number" },
    {
      "name": "Line Items",
      "prompt": "Return a JSON array with keys description, quantity, unit_price, and amount. Use digits only for quantity. Use a full stop as the decimal separator. Do not include currency symbols in unit_price or amount. Do not use thousands separators. Use an empty string when a value is missing."
    }
  ]
}

Example JSON output value:

"Line Items": "[{\"description\":\"Widget\",\"quantity\":\"2\",\"unit_price\":\"9.99\",\"amount\":\"19.98\"}]"

In the example above, Line Items is a string whose content is valid JSON.

Use nested line-item JSON like above, mainly for smaller or simpler cases, such as when there are only a few line items and you want a single invoice-level object.

Recommended approach for line items

If you need detailed line item extraction, prefer output_structure: "per_line_item" instead of returning line items inside a nested JSON field.

This is strongly recommended when:

invoices may contain around 7 or more line items
line items need detailed per-field instructions
you want the most reliable line item extraction

In per_line_item, define invoice-level fields and line-item fields as separate top-level fields.

Many workflows can use the per_line_item output directly, with one row/object per line item.

Do not rely on Source File alone to group rows into invoices. Source File helps you trace where a row came from, but it is not a stable invoice identifier.

Example prompt for the recommended approach:

{
  "prompt": {
    "fields": [
      { "name": "Invoice Number" },
      { "name": "Invoice Date", "prompt": "Use YYYY-MM-DD format" },
      { "name": "Vendor Name" },
      { "name": "Line Item Description" },
      { "name": "Line Item Quantity", "prompt": "Use digits only" },
      { "name": "Line Item Unit Price" },
      { "name": "Line Item Amount" }
    ],
    "general_prompt": "For amount fields don't use thousands separators, use full stops as the decimal separator and do not include currency symbols."
  },
  "output_structure": "per_line_item"
}

Example JSON output rows:

[
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget A",
    "Line Item Quantity": "2",
    "Line Item Unit Price": "9.99",
    "Line Item Amount": "19.98"
  },
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Line Item Description": "Widget B",
    "Line Item Quantity": "1",
    "Line Item Unit Price": "5.00",
    "Line Item Amount": "5.00"
  }
]

CSV Output

CSV is a plain-text export. Every value in the CSV file is written as text.

XLSX Output

XLSX uses the most appropriate spreadsheet cell type for each value by default, and follows explicit prompt instructions where provided.

File Limits

Type	Max size
PDF	150 MB
JPG / JPEG / PNG	5 MB
Total batch size	2 GB
Max files per session	6,000

Applies to extract(...) and upload_files(...).

Polling

Several methods accept a polling option to control how the SDK polls for extraction status.

Field	Default	Description
`interval_ms`	`10000`	Milliseconds between polls. Minimum `5000`.
`timeout_ms`	`None`	Maximum time to wait in milliseconds. `None` means no timeout.

Used by: extract(...), wait_for_extraction_to_finish(...).

`on_update`

def on_update(payload):
    # payload is a dict with: stage, level, message, progress, extraction_id
    print(payload["message"])

Field	Description
`stage`	Current lifecycle stage: `"upload"`, `"submission"`, `"waiting"`, `"download"`, or `"completion"`.
`level`	`"info"`, `"warn"`, or `"error"`.
`message`	Human-readable status message.
`progress`	Numeric progress when available, otherwise `None`.
`extraction_id`	The extraction ID once available, otherwise `None`.

Used by: extract(...), upload_files(...), wait_for_extraction_to_finish(...).

Output expiry

There are two unrelated time limits on output files. Don't confuse them:

Limit	What expires	Duration	What to do
Signed download URL	The presigned URL itself	5 minutes	Request a fresh URL via `download_output(...)` or `get_download_url(...)`
Output file retention	The generated file in storage	90 days from `created_at`	Re-run the extraction; the original output is gone

output_expires_at (on completed responses) tells you when the underlying file will be deleted. After that timestamp:

output["xlsx_url"], output["csv_url"], output["json_url"] on polling/extract responses are None.
available_outputs on list/details responses is an empty list.
get_download_url(...) and download_output(...) raise OUTPUT_EXPIRED.

Conventions

Method names are snake_case: extract(...), upload_files(...), submit_extraction(...).
All parameter names are snake_case: api_key, folder_path, output_structure, task_name, upload_session_id, file_ids, console_output, on_update, file_path.
Response fields are snake_case, matching the API exactly. The SDK returns the same JSON shapes as the raw API — if you have the API docs, those response examples are valid for the SDK too.
The files parameter accepts local file paths as strings only. File objects, byte streams, and in-memory buffers are not supported in v1.

Rate Limits

Endpoints	Limit
Upload endpoints (create session, get part URLs, complete upload)	600 requests per minute
Submit extraction	30 requests per minute
Poll extraction status	120 requests per minute
List extractions	60 requests per minute
Get extraction details	60 requests per minute
Download output	30 requests per minute
Delete extraction	30 requests per minute
Check credit balance	60 requests per minute

Errors

SDK methods raise exceptions on failure:

On failure, a method raises an SdkError (for SDK-level and validation errors) or ApiResponseError (for API response errors).
The structured error body is available on error.body.
error.body uses the same JSON error shape as the API.

Error body shape:

{
  "success": false,
  "error": {
    "code": "SOME_ERROR_CODE",
    "message": "Human-readable message.",
    "retryable": false,
    "details": null
  }
}

Read the error like this:

from invoicedataextraction import InvoiceDataExtraction
from invoicedataextraction.errors import SdkError, ApiResponseError

try:
    client.check_extraction(extraction_id="...")
except (SdkError, ApiResponseError) as error:
    print(error.body["error"]["code"])
    print(error.body["error"]["message"])
    print(error.body["error"]["retryable"])
    print(error.body["error"]["details"])

INVALID_INPUT can come from either the SDK (caught before the request is sent) or the API. Handle it the same way in both cases.

Authentication errors (UNAUTHENTICATED, API_KEY_EXPIRED, API_KEY_REVOKED) indicate a problem with your API key — generate a new one from your dashboard.

The SDK automatically retries RATE_LIMITED and transient INTERNAL_ERROR responses, but will surface them if retries are exhausted.

Extraction task failure codes

Branch on error["code"], error["retryable"], and any returned error["details"] directly. The SDK does not provide subclasses for individual extraction failure codes.

Task failure vs SDK/API failure:

After an extraction task has been accepted, the task itself can still finish with status: "failed".
That is a task outcome, not an SDK error.
check_extraction(...), wait_for_extraction_to_finish(...), and extract(...) return the polling response body for task states such as processing, completed, cancelled, and failed.
When a task ends with status: "failed", the failure details are in the returned response body, not on error.body.
error.body is only used when the SDK method/request itself fails — validation errors, authentication errors, network failures, timeouts, or other operational failures.

SDK-specific error codes:

Code	When the SDK uses it
`SDK_FILESYSTEM_ERROR`	A local filesystem operation failed, such as reading an input file, creating a directory, or writing a downloaded file.
`SDK_NETWORK_ERROR`	A network request failed before the SDK received a valid HTTP response.
`SDK_HTTP_ERROR`	The SDK received an unexpected HTTP response shape, such as a non-JSON response or another response that does not match the documented contract.
`SDK_TIMEOUT_ERROR`	`wait_for_extraction_to_finish(...)` timed out before the extraction finished.
`SDK_DOWNLOAD_ERROR`	An SDK-managed download step failed.
`SDK_UPLOAD_ERROR`	An SDK-managed upload orchestration step failed.

Method to API Endpoint Mapping

SDK Method	Underlying API
`extract(...)`	`upload_files(...)` → `submit_extraction(...)` → `wait_for_extraction_to_finish(...)` → `download_output(...)`
`upload_files(...)`	`POST /uploads/sessions` → `POST /uploads/sessions/{id}/parts` → `POST /uploads/sessions/{id}/complete`
`submit_extraction(...)`	`POST /extractions`
`wait_for_extraction_to_finish(...)`	`GET /extractions/{extraction_id}` (polled)
`download_output(...)`	`GET /extractions/{extraction_id}/output?format={format}` → presigned URL download
`check_extraction(...)`	`GET /extractions/{extraction_id}`
`get_download_url(...)`	`GET /extractions/{extraction_id}/output?format={format}`
`delete_extraction(...)`	`DELETE /extractions/{extraction_id}`
`get_credits_balance()`	`GET /credits/balance`
`list_extractions(...)`	`GET /extractions`
`iterate_extractions(...)`	`GET /extractions` (auto-paginated)
`get_extraction(...)`	`GET /extractions/{extraction_id}/details`

Invoice Data Extraction Python SDK

Invoice Data Extraction Python SDK

Install

Quick Start

Constructor

extract(...)

Parameters

Output structure

Prompt

Download

Returns

When extract(...) raises

Staged Workflow

upload_files(...)

Returns

submit_extraction(...)

Returns

wait_for_extraction_to_finish(...)

Returns

download_output(...)

Returns

check_extraction(...)

Returns

get_download_url(...)

Returns

delete_extraction(...)

Returns

get_credits_balance()

Returns

list_extractions(...)

Parameters

Returns

List item shape

iterate_extractions(...)

Parameters

Behavior

Return type

get_extraction(...)

Parameters

Returns

Record shape

get_extraction(...) does not raise on failed extractions

Common workflow: list, then download

Working with Output Files

Working with JSON Output

JSON value typing

Structured JSON fields

Recommended approach for line items

CSV Output

XLSX Output

File Limits

Polling

on_update

Output expiry

Conventions

Rate Limits

Errors

Extraction task failure codes

Task failure vs SDK/API failure:

SDK-specific error codes:

Method to API Endpoint Mapping

Invoice Data Extraction Python SDK

Invoice Data Extraction Python SDK

Install

Quick Start

Constructor

extract(...)

Parameters

Output structure

Prompt

Download

Returns

When extract(...) raises

Staged Workflow

upload_files(...)

Returns

submit_extraction(...)

Returns

wait_for_extraction_to_finish(...)

Returns

`extract(...)`

When `extract(...)` raises

`upload_files(...)`

`submit_extraction(...)`

`wait_for_extraction_to_finish(...)`

`download_output(...)`

`check_extraction(...)`

`get_download_url(...)`

`delete_extraction(...)`

`get_credits_balance()`

`list_extractions(...)`

`iterate_extractions(...)`

`get_extraction(...)`

`get_extraction(...)` does not raise on failed extractions

`on_update`

`extract(...)`

When `extract(...)` raises

`upload_files(...)`

`submit_extraction(...)`

`wait_for_extraction_to_finish(...)`

`download_output(...)`

`check_extraction(...)`

`get_download_url(...)`

`delete_extraction(...)`

`get_credits_balance()`

`list_extractions(...)`

`iterate_extractions(...)`

`get_extraction(...)`

`get_extraction(...)` does not raise on failed extractions

`on_update`