Build a Streamlit Invoice Extraction App in Python

Build a Streamlit invoice extraction app in Python with file upload, structured results, validation, and CSV or Excel export using one SDK-based flow.

Published
Updated
Reading Time
14 min
Topics:
API & Developer IntegrationPythonStreamlitinternal toolsrapid prototypingfile upload UI

A Streamlit invoice extraction app usually needs four parts: file upload, an extraction backend, a normalized results view, and CSV or Excel export. That is what makes Streamlit a good fit for this workflow. You can build the interface in Python, iterate quickly, and keep the app useful even before it becomes a broader internal tool.

That pattern is becoming more common across data-heavy tools. JetBrains' 2025 Python framework survey reported Streamlit at 12% usage in 2024, up 4 percentage points from 2023. For internal document workflows, that matters because Streamlit is strong at the parts developers need first: upload controls, stateful interaction, and a review surface that can go from prototype to shared tool without switching languages.

The mistake many tutorials make is treating the UI as the hard part. In practice, the bottleneck is structured extraction. A useful streamlit invoice extraction guide should show you how to upload invoices, send them to a backend that understands invoice headers and line items, review the output, and export clean files without dragging you through AWS-heavy setup or custom-model plumbing.

The next sections stay focused on that practical path: choose one backend, wire it into the UI, add a review layer, and finish with downloads a finance team can actually use.

Choose the Extraction Backend Before You Build the UI

If you start with the interface and postpone the extraction decision, you usually end up rewriting the app later. The backend determines what your app can display reliably, how much post-processing you have to own, and whether your exports are usable after the first handful of test invoices.

Here is the practical tradeoff:

ApproachGood fitMain friction
DIY OCR pipelineYou need low-level control over OCR and downstream rulesOCR gives you text, not dependable invoice fields, line items, or export-ready structure
Raw LLM or direct API orchestrationYou need full HTTP control or a non-Python stackYou own upload, submit, poll, download, retries, and schema stability
Managed Python SDKYou are building in Python and want the fastest credible path to a working appYou trade some low-level ceremony for a much shorter build path

That is why streamlit invoice OCR is often too narrow for the real job. OCR helps you see text on the page. It does not guarantee normalized invoice number, date, vendor, tax, total, or line-item fields in a format your finance team can review or export. If you want a broader survey of core Python invoice extraction approaches beyond the UI layer, that topic is worth reading separately. For this build, the question is narrower: how do you get a usable Streamlit app working fast?

For Python, the cleanest answer is to use the SDK rather than wire the REST API by hand on day one. The REST API uses Bearer-token authentication and a multi-step lifecycle: create an upload session, upload file parts, complete the upload, submit the extraction, poll until it finishes, then download the output. That flow is exactly what you want if you need explicit HTTP control later, and the same engine is available through a production-ready invoice extraction API for Streamlit apps. But for a Python-first tutorial, the SDK is the better fit because it wraps the upload, submission, polling, and download steps into one extraction flow.

It also lets you choose the output structure up front. Use automatic when you are experimenting, per_invoice when each invoice should become one row or one object, and per_line_item when line items are the main deliverable. This tutorial sticks with per_invoice for the core implementation because it keeps the first version of the app predictable. If your real workflow depends on detailed line-item extraction, you can switch the same app to per_line_item once the upload and review flow is in place.


Set Up the Streamlit App Skeleton

Keep the first version small. You do not need queues, background workers, or a custom frontend to prove the workflow. You need one Streamlit app that can accept invoices, persist them to temporary storage, trigger extraction, and keep the results available across reruns.

The minimal dependency stack is straightforward: Streamlit for the UI, pandas for tabular review and export, openpyxl for Excel output, Pydantic for validation, and the invoice extraction SDK for the backend call. Use one file named app.py for the first pass, then run it locally with this setup:

pip install streamlit pandas openpyxl pydantic invoicedataextraction-sdk
streamlit run app.py

If you prefer Streamlit secrets, create .streamlit/secrets.toml. If you would rather use an environment variable, export INVOICE_DATA_EXTRACTION_API_KEY before starting Streamlit:

INVOICE_DATA_EXTRACTION_API_KEY = "your_api_key_here"

The rest of the tutorial keeps extending this same app.py file, so each code block is cumulative rather than a separate example.

from pathlib import Path
import tempfile

import streamlit as st

st.set_page_config(page_title="Invoice Extraction", layout="wide")
st.title("Invoice Extraction Demo")

if "uploaded_paths" not in st.session_state:
    st.session_state.uploaded_paths = []
if "extraction_result" not in st.session_state:
    st.session_state.extraction_result = None
if "output_dir" not in st.session_state:
    st.session_state.output_dir = None
if "rows" not in st.session_state:
    st.session_state.rows = []
if "validation_errors" not in st.session_state:
    st.session_state.validation_errors = []


def persist_uploads(uploaded_files):
    work_dir = Path(tempfile.mkdtemp(prefix="streamlit-invoices-"))
    saved_paths = []

    for uploaded_file in uploaded_files:
        destination = work_dir / uploaded_file.name
        destination.write_bytes(uploaded_file.getbuffer())
        saved_paths.append(str(destination))

    return saved_paths


uploaded_files = st.file_uploader(
    "Upload invoice PDFs or images",
    type=["pdf", "png", "jpg", "jpeg"],
    accept_multiple_files=True,
)

if uploaded_files:
    st.session_state.uploaded_paths = persist_uploads(uploaded_files)
    st.session_state.extraction_result = None
    st.session_state.output_dir = None
    st.session_state.rows = []
    st.session_state.validation_errors = []

    st.caption("Files ready for extraction")
    st.write([Path(path).name for path in st.session_state.uploaded_paths])

run_extraction = st.button(
    "Extract invoice data",
    disabled=not st.session_state.uploaded_paths,
)

This gives you a working Python invoice upload UI with a clear boundary between the files the user uploaded and the data your app will generate later. That separation matters in a Streamlit invoice processing app because reruns are normal. If uploaded files, extraction results, and downloadable outputs all live in the same bucket of state, one small interaction can force the user to upload everything again.

The other important detail is temporary storage. If you are building a Streamlit PDF extraction app, the upload widget gives you file-like objects, but your backend integration is easier to reason about when it receives real file paths. Persisting uploads to a temporary directory keeps the UI code tidy and makes the next step, calling the extraction layer, much more predictable.

Wire Streamlit to the Python SDK and Extract Structured Invoice Data

Now commit to one backend path and wire it all the way through. For this tutorial, the best fit is the official Python SDK because it handles file upload, extraction submission, polling, and result download in one flow. That keeps the Streamlit code focused on the UI instead of turning the app into a hand-rolled orchestration layer.

Install the package, then initialize the client with an API key from Streamlit secrets or your environment. The extraction prompt should be explicit enough to stabilize the output shape, which is why an object-style prompt is better than a loose one-line string for this build.

pip install invoicedataextraction-sdk
from pathlib import Path
import tempfile

import streamlit as st
from invoicedataextraction import InvoiceDataExtraction


def build_invoice_prompt():
    return {
        "fields": [
            {"name": "Invoice Number"},
            {"name": "Invoice Date", "prompt": "Use YYYY-MM-DD format"},
            {"name": "Vendor Name"},
            {"name": "Net Amount", "prompt": "No currency symbol, 2 decimal places"},
            {"name": "Tax Amount", "prompt": "Use 0 if tax is missing"},
            {"name": "Total Amount", "prompt": "No currency symbol, 2 decimal places"},
        ],
        "general_prompt": (
            "Extract one record per invoice. Ignore email cover sheets. "
            "If a field is missing, leave it blank unless the field instructions say otherwise."
        ),
    }


def run_invoice_extraction(file_paths):
    output_dir = Path(tempfile.mkdtemp(prefix="invoice-output-"))
    client = InvoiceDataExtraction(
        api_key=st.secrets["INVOICE_DATA_EXTRACTION_API_KEY"]
    )

    result = client.extract(
        files=file_paths,
        prompt=build_invoice_prompt(),
        output_structure="per_invoice",
        download={"formats": ["json", "xlsx"], "output_path": str(output_dir)},
        console_output=False,
    )

    if result["status"] == "failed":
        raise RuntimeError(result["error"]["message"])

    return result, output_dir


if run_extraction:
    with st.spinner("Extracting invoice data..."):
        result, output_dir = run_invoice_extraction(st.session_state.uploaded_paths)
        st.session_state.extraction_result = result
        st.session_state.output_dir = str(output_dir)

There are three practical details worth paying attention to here.

First, the extract method is doing more than a single HTTP request. It wraps the upload, submit, poll, and download lifecycle that you would otherwise manage yourself through the REST API. That is the main reason it is the right choice for a Python tutorial like this one.

Second, the returned result object is operationally important. Check the pages.failed_count value after every run. If it is greater than zero, some pages failed processing and will not be in your output. The result can also include pages.failed for specific file and page failures, ai_uncertainty_notes for ambiguous extractions, and output URLs for the generated files. Those fields are what let your app become reviewable instead of blindly optimistic.

Third, choose the output structure on purpose. Per-invoice output is the cleanest starting point for a dashboard-style app because each invoice becomes one row or one object. Switch to per-line-item when invoice line-item extraction is central to the workflow or when downstream users need each product or service row separately. If you do that, keep a stable field like Invoice Number in the prompt so you can regroup rows by invoice later.

The same extraction engine also exists through the web platform and API, so this is not a toy path. It is a UI-first way to reach structured JSON, CSV, or XLSX output without rebuilding the orchestration yourself.

Show Reviewable Results, Validate Fields, and Export CSV or Excel

Once extraction works, the app still needs to answer the user-facing question: can I trust what I am about to download? That is why the review layer matters. A useful Streamlit data extraction dashboard should show normalized invoice rows, surface any failed pages, expose uncertainty notes, and stop obviously broken data before export.

With the per-invoice prompt from the previous section, the downloaded JSON will look roughly like this:

[
  {
    "Invoice Number": "INV-1001",
    "Invoice Date": "2025-01-15",
    "Vendor Name": "Acme Ltd",
    "Net Amount": "100.00",
    "Tax Amount": "25.00",
    "Total Amount": "125.00"
  }
]

That shape is simple enough to review directly, but the app still benefits from one small normalization layer before it renders anything. The same file can also support optional line-item handling later if you switch the extraction to per_line_item and keep Invoice Number on every returned row.

import json
from io import BytesIO
from pathlib import Path

import pandas as pd
import streamlit as st
from pydantic import BaseModel, Field, ValidationError


class InvoiceRow(BaseModel):
    invoice_number: str = Field(alias="Invoice Number")
    invoice_date: str = Field(alias="Invoice Date")
    vendor_name: str = Field(alias="Vendor Name")
    total_amount: str = Field(alias="Total Amount")


def load_rows(output_dir, extraction_result):
    json_files = sorted(
        Path(output_dir).glob("*.json"),
        key=lambda path: path.stat().st_mtime,
    )
    if not json_files:
        output_urls = extraction_result.get("output", {})
        raise RuntimeError(
            "No local JSON output was downloaded. Retry the SDK download step "
            f"or use the returned json_url: {output_urls.get('json_url')}"
        )
    latest_json = json_files[-1]
    return json.loads(latest_json.read_text(encoding="utf-8"))


def normalize_rows(rows):
    invoice_rows = []
    line_item_rows = []

    for row in rows:
        invoice_rows.append(
            {
                "Invoice Number": row.get("Invoice Number"),
                "Invoice Date": row.get("Invoice Date"),
                "Vendor Name": row.get("Vendor Name"),
                "Net Amount": row.get("Net Amount"),
                "Tax Amount": row.get("Tax Amount"),
                "Total Amount": row.get("Total Amount"),
            }
        )

        if row.get("Line Item Description"):
            line_item_rows.append(
                {
                    "Invoice Number": row.get("Invoice Number"),
                    "Line Item Description": row.get("Line Item Description"),
                    "Line Item Quantity": row.get("Line Item Quantity"),
                    "Line Item Unit Price": row.get("Line Item Unit Price"),
                    "Line Item Amount": row.get("Line Item Amount"),
                }
            )

    return pd.DataFrame(invoice_rows), pd.DataFrame(line_item_rows)


def validate_invoice_rows(invoice_df):
    valid_rows = []
    invalid_rows = []

    for row in invoice_df.to_dict(orient="records"):
        try:
            valid_rows.append(
                InvoiceRow.model_validate(row).model_dump(by_alias=True)
            )
        except ValidationError as error:
            invalid_rows.append({"row": row, "errors": error.errors()})

    return pd.DataFrame(valid_rows), invalid_rows


if st.session_state.extraction_result:
    rows = load_rows(
        st.session_state.output_dir,
        st.session_state.extraction_result,
    )
    invoice_df, line_item_df = normalize_rows(rows)
    validated_invoice_df, invalid_rows = validate_invoice_rows(invoice_df)

    pages = st.session_state.extraction_result["pages"]
    metric_col_1, metric_col_2, metric_col_3 = st.columns(3)
    metric_col_1.metric("Successful pages", pages["successful_count"])
    metric_col_2.metric("Failed pages", pages["failed_count"])
    metric_col_3.metric("Validated rows", len(validated_invoice_df))

    invoice_tab, line_item_tab = st.tabs(["Invoices", "Line items"])
    with invoice_tab:
        st.dataframe(validated_invoice_df, use_container_width=True)
    with line_item_tab:
        if line_item_df.empty:
            st.caption(
                "Switch to per_line_item and add line-item fields when you need item-level review."
            )
        else:
            st.dataframe(line_item_df, use_container_width=True)

    if invalid_rows:
        st.warning(f"{len(invalid_rows)} rows failed validation")
        st.json(invalid_rows)

    if pages["failed_count"] > 0:
        with st.expander("Failed pages"):
            st.json(pages["failed"])

    if st.session_state.extraction_result["ai_uncertainty_notes"]:
        with st.expander("AI uncertainty notes"):
            st.json(st.session_state.extraction_result["ai_uncertainty_notes"])

    csv_bytes = validated_invoice_df.to_csv(index=False).encode("utf-8")

    excel_buffer = BytesIO()
    with pd.ExcelWriter(excel_buffer, engine="openpyxl") as writer:
        validated_invoice_df.to_excel(writer, index=False, sheet_name="Invoices")
        if not line_item_df.empty:
            line_item_df.to_excel(writer, index=False, sheet_name="LineItems")

    st.download_button("Download CSV", csv_bytes, "invoices.csv", "text/csv")
    st.download_button(
        "Download Excel",
        excel_buffer.getvalue(),
        "invoices.xlsx",
        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    )

This is where schema validation earns its place. If invoice date, vendor, or total are mandatory for your workflow, do not wait until a user opens the spreadsheet to discover that those fields are missing or malformed. A companion pattern is covered in Pydantic validation for extracted invoice JSON, but the key idea is simple: validate before export, not after.

It also gives you a cleaner review experience than a single dataframe dump. The status metrics tell the user whether the extraction completed cleanly. Tabs separate invoice-level review from optional line-item review. Expanders keep failed pages and uncertainty notes visible without overwhelming the main table. That makes the app feel closer to a real internal tool than a thin demo.

One accuracy detail is easy to miss: auto-download is a convenience, not a guarantee. If the extraction completed but the JSON file did not land locally, use the returned output URL or a later download step instead of assuming the file will always be there.

At this point, the same app.py file contains upload, extraction, validation, review, and export in one runnable flow. CSV and Excel simply serve different handoff points. CSV is usually the faster path into another script, database load, or automation step. Excel is better when finance users want typed spreadsheet output or a file they can review manually before posting.


Turn the Demo Into a Shared Internal Tool

A streamlit invoice processing demo is often enough when one person is testing prompts, reviewing sample invoices, and exporting results for a small internal workflow. Once other users depend on it, the priorities change. You need secrets stored outside the app code, clearer status messages, retry handling for failed files, and a record of what was processed and when. Those are the differences between a useful prototype and a tool other people can trust.

A good dividing line is whether Streamlit is still doing two jobs at once. If the app both manages the interface and owns the full extraction workflow, it works well for early internal tools with light traffic. If multiple users are uploading files, waiting on longer jobs, or expecting consistent operational behavior, move the extraction logic behind a service boundary and let the UI call it. That is where a headless FastAPI invoice extraction service becomes a cleaner fit, because the UI can focus on uploads, progress, and review while the backend handles orchestration.

The production-minded step is to stop relying only on one-call execution. The Python SDK also supports a staged workflow:

  • upload_files(...)
  • submit_extraction(...)
  • wait_for_extraction_to_finish(...)
  • download_output(...)

Those methods matter when you want queues, worker processes, or background jobs rather than tying the entire extraction lifecycle to one Streamlit request. If your team later needs explicit HTTP control or a non-Python stack, the same engine is available through the REST API, but for Python-based internal tools the staged SDK path is usually the cleaner step up.

Use a short decision framework:

  • Keep everything inside Streamlit if one team uses it, files are modest, and you mainly need a fast internal utility.
  • Add stronger validation, background processing, and audit trails if the app is becoming shared infrastructure for analysts or AP staff.
  • Split the UI from the extraction service when you need clearer security boundaries, better concurrency, or richer product behavior than Streamlit should carry on its own.

That last category is also where interface choice starts to matter. Streamlit is excellent for Python-first internal tools, but if you need more custom document workflows, broader user roles, or a more productized browser experience, a Next.js alternative for upload-and-extract document apps may be the better front end. The practical next move is to decide which stage you are in today, then implement only the upgrade that matches it: secrets and logging for a solo tool, queueing and staged SDK calls for a shared team app, or a separate UI and extraction service when the workflow becomes a real internal platform.

About the author

DH

David Harding

Founder, Invoice Data Extraction

David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.

Editorial process

This page is reviewed as part of Invoice Data Extraction's editorial process.

If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.

Continue Reading

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours