Key-value pair (KVP) extraction is the step in a document pipeline that turns the raw text an OCR system reads off an invoice into labeled fields — pairing keys like Invoice Number, Invoice Date, Vendor, Subtotal, Tax, and Total with their corresponding values on the page. It sits between optical character recognition (OCR), which gives you characters with no sense of what any of them mean, and the structured output a spreadsheet, ERP, or validation rule can consume directly.
Run key-value pair extraction from invoices against a routine supplier invoice and the output reads like this:
- Invoice Number: INV-2024-0487
- Invoice Date: 2024-11-04
- Vendor Name: Acme Supplies Ltd
- Subtotal: 1,240.00
- Tax (VAT 20%): 248.00
- Total: 1,488.00
Each line is a pair. The left side is the key, which names what the value represents. The right side is the value, which is the actual string or number lifted from the document. The extraction step has decided, for every field, which text on the page belongs under which label — not just that a date appears somewhere, but that 2024-11-04 is the invoice date rather than the due date, the shipment date, or the payment terms reference.
The concept appears in research papers and product documentation under overlapping names — key information extraction (KIE), field extraction, document parsing — but the mental model is the same: identify the keys a document carries and pull out the value attached to each one. The approach is feasible on invoices and the broader financial-document cluster (purchase orders, utility bills, credit notes, vendor statements, receipts) because the field set is small and mostly predictable even when layouts vary by supplier. It does not extend to government forms, passports, or identity documents, where the regulatory context and field sets differ. Across the rest of this article, invoice key value extraction is the running example, with detours into adjacent document types only where they illustrate a point more clearly than an invoice would.
Why KVP Output Is Not the Same as Plain OCR Text
Send the same invoice through a pure OCR engine — AWS Textract in text mode, Tesseract, or a vision model asked only to transcribe — and the response is a different kind of object entirely. OCR returns a linearized stream of the characters on the page, usually in reading order, sometimes with positional coordinates attached to each token. On our example supplier invoice the text response reads roughly as:
INVOICE INV-2024-0487 Acme Supplies Ltd 15 Warehouse Lane Bill To Thornhill Trading Ltd Invoice Date 2024-11-04 Due Date 2024-12-04 PO Ref PO-778 Description Quantity Unit Price Line Total A4 Paper 80gsm 20 12.50 250.00 Subtotal 1,240.00 VAT 20% 248.00 Total 1,488.00
Every substring on the invoice made it into the response, but nothing in the response carries a label. The number 1,488.00 is just the last number in the stream; nothing marks it as the grand total rather than a line total, a prior balance, or an unrelated figure in the footer. The date 2024-11-04 is just the first date; nothing distinguishes it from the due date that follows. The system reading this output has all of the text and none of the meaning.
Run the same invoice through a KVP extraction step on top of that OCR layer and the output changes shape. Each field becomes addressable by name. The extraction did three things the OCR layer could not:
- It attached a label. Invoice Number is a key; INV-2024-0487 is the value paired with it. A downstream system asks for the invoice number by name rather than guessing which substring in a text blob is the one it wants.
- It resolved ambiguity. Real invoices routinely carry multiple dates (invoice date, due date, service period, shipment date), multiple reference numbers (invoice number, PO number, account number), and multiple totals (line total, subtotal, tax amount, grand total). A linearized text stream cannot disambiguate these. The extraction step does, and that disambiguation is the work a KVP layer actually performs.
- It produced addressable output. Downstream code queries the total field and gets 1,488.00 back, not the last number in a text blob that happened to look like money. Addressable-by-name is the property that separates a field the rest of your stack can use from a string your stack has to parse a second time.
There is a subtlety worth naming directly. A modern KVP extraction system usually does its own text recognition — either by calling an OCR engine internally or by using a vision model that performs equivalent character recognition as part of its inference. KVP extraction is not a replacement for OCR; it is the labeling layer above it. Teams new to invoice OCR often conflate the two, partly because vendor marketing uses "OCR" loosely to describe the whole pipeline. Separating the layers correctly, with recognition at the bottom and labeling on top, is what lets you reason about which vendor demo is actually solving your problem rather than the one below it.
The consequence for anything downstream is what motivates the rest of this article. An invoice that needs to clear a three-way match against a PO and a delivery note, route to an approver, post to the AP subledger, or flow into a month-end close spreadsheet cannot do any of those things from unlabeled OCR text. It needs fields, not strings. Every subsequent stage in an invoice workflow is written against labeled data, which means the extraction step either delivers labels or quietly becomes the bottleneck of the whole pipeline.
The Four Approaches to Invoice KVP Extraction, and Where Each Breaks
KVP extraction on invoices has moved through four successive regimes in production over roughly the last decade. Each regime solved a real problem its predecessor could not, then ran into a specific failure mode that the next regime addressed. Walking them in order is the fastest way to place any vendor claim, any open-source library, or any internal build into the broader landscape.
Template and zonal extraction
The operator opens a reference invoice from each supplier, draws bounding boxes around every field that matters, and labels them — this box is the Invoice Number, this box is the Total. The system stores the layout and, for every subsequent invoice from that supplier, lifts the text out of those coordinates.
Strengths are real. Templates are fast to run, cheap to host, and fully interpretable — you can point at a failed field and say which box moved. For a narrow, stable supplier base, template extraction can be close to perfect.
The failure mode is invoice reality. Layouts vary across suppliers, which makes every new supplier an engineering task: a human has to sit with a sample invoice and define the boxes. Layouts also drift within a supplier — a new logo bumps the header block down, a new column pushes the table right, a row added to the line-item section shifts the totals below it. The system either fails quietly with wrong values in the right fields, or fails loudly with extraction errors that require the template to be rebuilt. A template-driven pipeline scales linearly with supplier count in engineering cost, which is why it stops being viable at the point invoice processing stops being someone's side task. This is also the regime that template-less invoice extraction exists to displace: the rest of the progression below consists of approaches that do not require a reference layout per supplier.
Schema-based extraction
The next step forward gave up the per-supplier template and replaced it with a target schema — the list of fields to extract and their expected types (Invoice Number as string, Invoice Date as ISO date, Total as decimal). Schema-based key-value extraction combines rules, lexical lookups, positional heuristics, and lightweight machine learning to locate each schema field on an invoice the system has never seen before. The system is no longer asking "what is the text at these coordinates?" — it is asking "where is the invoice number on this document?"
Layout tolerance improves substantially. Suppliers can redesign their invoices without breaking the contract, and new suppliers do not require per-supplier setup. For a portfolio of reasonably well-behaved documents, this regime closes a lot of the gap pure templates left open.
The failure mode is fields with ambiguous identity. Two dates on the page — the invoice date and the due date — both satisfy the schema's date rule; the system has to decide which is which using context that is not always explicit on the document. Two totals, one before and one after an early-payment discount, both match the grand-total pattern. Multi-currency invoices, where the line items run in one currency and the totals in another, break a schema that assumed one currency per document. A schema is a contract the document is expected to honor; invoices in the wild frequently fail to honor it, and each failure mode usually shows up first as a silent mis-extraction rather than an error.
Layout-aware transformer models
The LayoutLM family and its successors added a structural idea that pure schema extraction did not have: feed the model not just the recognized text but the 2D position of every token on the page, and let it learn the relationship between textual and spatial signals. A layout-aware key-value extraction model reasons about visual structure the way a human reader does — the number at the bottom-right of a totals block is the grand total because of where it sits, not solely because of the word beside it. For semi-structured financial documents, where meaning is carried as much by position as by vocabulary, this closes a large fraction of the remaining gap.
The failure mode is more subtle. These models have traditionally been trained against a fixed set of expected keys, so they handle documents close to their training distribution and struggle when a document introduces a key they have never seen. Real supplier invoices routinely carry long-tail fields — retention percentages on a construction invoice, duty and customs line items on an import document, per-site meter reads on a consolidated utility bill — and a model trained to extract a canonical invoice schema will miss those.
The research community has named this limit directly. IBM Research's KVP10k benchmark dataset exists to evaluate it: IBM Research released KVP10k, a benchmark dataset of over 10,000 richly annotated business-document pages designed to evaluate key-value pair extraction across diverse layouts without relying on predefined keys. The benchmark was built precisely because the previous generation of models was too anchored on enrolled schemas to handle real-world variety.
Vision LLM and prompt-based extraction
The regime currently reshaping the space takes a multimodal model — one that can read the visual content of a document page directly — and couples it with a natural-language prompt that describes what to extract. The operator writes something like "Extract invoice number, invoice date, vendor name, subtotal, tax, total" and the model returns the labeled values. The template, the schema, and the layout-specific model training all collapse into a single inference step, and the model inherits its visual and language understanding from pretraining, so it can handle layouts and keys the operator never had to pre-define.
For an invoice-processing team this is the first regime where a new supplier does not require engineering work at all. It is also the regime that handles long-tail fields directly — the prompt is the configuration, and asking for "and any retention percentage in the footer" is now a viable way to pick up a field the system has never seen.
In practice the operator workflow looks like this. A prompt-based invoice extraction product takes the uploaded invoice plus a natural-language instruction — something as terse as a field list ("Extract Invoice Number, Invoice Date, Vendor Legal Name, Net Amount, VAT Rate, VAT Amount, Total") or as detailed as a full set of formatting and classification rules — and returns the labeled values as a structured spreadsheet, CSV, or JSON file. The same prompt re-applied to the next batch produces the same output shape. That prompt-as-configuration pattern is what distinguishes this regime operationally from the ones above it.
The failure modes are different in character from the earlier regimes. Raw cost per page runs higher than template or schema extraction at equivalent volume, and inference latency on large batches is what forces most production pipelines into asynchronous job queues rather than synchronous request-response. Consistency is the subtler issue: temperature, prompt variation, and output-format drift can all produce different structured output from the same invoice on the same day in a naive prompt-and-model setup. Production-grade use in finance workflows depends on the engineering around the model — schema enforcement, consistent output formatting, validation, retry logic — as much as on the model itself. Tables, multi-value fields, and pathological layouts still need specific handling. This regime is reshaping the space, but it is not a universal solution.
Where Key-Value Extraction Stops Being Enough on Real Invoices
The KVP shape — one key, one value — does not fit every piece of data an invoice actually carries. Recognizing which parts of the invoice exceed that shape is what separates a team that will hit the limits of their approach from a team that already planned for them.
Line items are rows, not pairs
The line-item table on an invoice is a repeating structure. Each row carries its own description, quantity, unit price, line total, optionally a per-line tax amount, optionally a SKU or product code. A single "Line Items" key does not capture this. Each line item is its own record with its own fields, and the number of records runs from two on a paper-stationery invoice to several hundred on a wholesale bill. Extracting line items reliably is a tabular extraction problem, not a KVP one, and it needs its own handling path. For how that handling works in practice — row identification, column alignment, carry-forward headers across multi-page tables — see our breakdown of invoice line item extraction.
Multi-value fields
Many invoice fields that look singular at first glance are multi-valued in practice. Tax is the most common example. A UK invoice carrying goods at the standard VAT rate and others at the zero rate has two tax rows and two tax totals; a single "Tax" key picks one and silently discards the other, which is the kind of error that reaches the finance team three weeks later as a reconciliation gap. Separate billing and shipping addresses on the same invoice break a single "Address" key. Consolidated invoices that cover several purchase orders break a single "PO Number" key. The extraction output has to represent these as lists or sub-records; a naive flat KVP schema either truncates to the first value or picks one of the matches arbitrarily.
Implicit or position-coded keys
Some invoices do not explicitly label their fields at all. The grand total may be the number in the bottom-right cell of the totals block with no "Total" label beside it, recognizable only by its position, its font size, and its role at the end of an arithmetic chain. The invoice number may sit in a header block, distinguished from a sales-order reference only by formatting convention. Where the key exists as a layout convention rather than as written text, KVP extraction demands that the system reason about position and structure, not only about matching labels. This is where template and layout-aware regimes earn their keep, and where a literal key-match approach falls over.
Cross-field consistency KVP does not cover
KVP extraction returns fields; it does not, on its own, guarantee that the sum of line totals plus tax equals the grand total, that the currency on the invoice matches the vendor country, that the invoice date precedes the due date, or that the PO number exists in the approved-POs table. Those are validation properties, and they live downstream of the extraction step. A KVP layer can return clean pairs and still hand downstream validation a document that fails every rule.
Where KVP fits
KVP extraction handles invoice-level scalar fields well. The pieces that exceed its shape — line-item tables, multi-value fields, implicit keys, cross-field rules — belong to adjacent capabilities in the same pipeline, not to a KVP layer stretched to absorb them.
From Extracted Pairs to an AP Workflow That Actually Runs
A KVP extraction output that never reaches a validation rule, a review queue, or an ERP posting is a screenshot, not a workflow. Extraction is the first stage in a pipeline that exists to drive action — paying suppliers, closing the month, reconciling spend against budget — and the stages downstream of it determine whether the labeled fields actually translate into anything useful.
Field-level validation
Once fields are labeled, rule-based validation becomes reachable. Invoice date must parse as a date. Total must be positive. Vendor name must match an entry in the approved-supplier table. The sum of line totals plus tax must equal the grand total within tolerance. The VAT rate must correspond to a current tax-jurisdiction rate. These rules are trivial to write against labeled output and impossible to write against unlabeled OCR text — the labels are what make every rule addressable.
The rule set belongs somewhere in the pipeline. Some teams run it inside the extraction service; others run it in a dedicated validation layer that consumes extraction output and emits a pass/fail verdict per invoice. The layering is an implementation choice that turns on how the rest of the stack is organized. For the patterns that work when this gets built programmatically, see how to validate extracted invoice data in an API workflow.
Confidence thresholds
Extraction systems that are worth using return a confidence score per field — a numeric indicator of how certain the extraction was about a particular value. The threshold set on that confidence determines which invoices flow through straight-through processing and which land in a review queue. Too low, and invoices with wrong data post to the ERP because nothing flagged them. Too high, and reviewers see invoices the system was essentially sure about, which wastes their time and creates its own backlog.
The right threshold is not a constant. Fields with low tolerance for error — grand totals, vendor identity, currency, bank account details on payment instructions — warrant stricter thresholds than lower-consequence fields like descriptions or internal notes. Some teams tune thresholds per field type; others tune per supplier, giving well-behaved suppliers more latitude than a new or problematic one. Either way, the system needs to return the confidence signal in the first place, which is a structural property of the extraction layer rather than something that can be bolted on afterwards.
Review-by-exception routing
When a field fails a validation rule or falls below threshold, the invoice routes to a human reviewer. The reviewer sees the extracted fields, the source document, and the specific failures, fixes what needs fixing, approves, and the invoice continues through the pipeline. The goal of the extraction and validation layers together is to keep this queue small — not empty, because some documents genuinely need human judgment, but scaled to the documents that actually benefit from a human looking at them rather than the ones the system could have resolved alone.
How review queues get structured in practice — field-level review versus whole-invoice review, routing rules that send different exception types to different reviewers, the feedback loop from reviewer corrections back into threshold tuning — sits in a broader treatment of invoice OCR error handling and review routing. The quality of the KVP extraction determines the shape and size of this queue, which in turn determines the operational cost of the whole workflow.
ERP and spreadsheet handoff
The terminal stage for most invoice extraction work is either an ERP posting — into an AP subledger, against GL codes, into an approval workflow — or a structured spreadsheet that the finance team reviews, reconciles, and imports into its systems. Both consume labeled fields by name. The ERP expects Invoice Number, Invoice Date, Net Amount, Tax Amount, Total, Vendor ID, GL Code; the spreadsheet expects columns that match those names and data types the tool can work with directly.
This is where the shape of the KVP output and the downstream consumer's schema have to be reasoned about together. A prompt-based invoice extraction product produces its labeled spreadsheet — Excel, CSV, or JSON — keyed by the exact field names the operator defines in the prompt, which is the workflow AI-powered invoice data extraction is built to feed.
Evaluating a KVP Approach for Your Invoices
The criteria that matter when weighing a KVP approach — whether you are building in-house, buying a platform, or integrating an API — are not the ones in a generic document-AI buyer's guide. They follow directly from the failure modes and downstream requirements covered above, and they are specific enough to apply against your own document set rather than taking a vendor's pre-selected examples on trust.
Accuracy on messy layouts, not pristine ones
Ask the vendor, or your own build, to process a sample of real invoices from your actual supplier mix. Include the scanned ones, the ones with handwritten notes in the margin, the ones where the layout changes quarterly, the phone-photo submissions, the ones where the header got cropped. A system that performs well on a clean reference invoice and poorly on your actual document mix is not fit for purpose. Head-to-head testing on documents you already have tells you more than any curated vendor benchmark.
Line-item and repeating-group behaviour
Does the approach handle line items at all, and how does it represent invoices with multiple tax rates, split shipping and billing addresses, or consolidated billing across several POs? These are the cases competitor demos gloss past and production use hits routinely. A vendor that cannot cleanly demonstrate multi-tax-rate or multi-address handling on a document you control is telling you something about the depth of the underlying system.
Confidence reporting and validation surface
Does the system report a per-field confidence score? Does the output format support arbitrary validation rules, or does it force your validation layer to re-infer structure from unlabeled strings? Is the vendor's own validation layer, if any, configurable to your supplier-specific rules? An extraction approach that returns values with no confidence signal pushes the entire downstream validation burden onto your own code, which is workable at low volume and expensive at scale.
Review workflow
When a field fails, the reviewer needs a side-by-side view of the source document and the extracted fields, corrections that feed back into threshold tuning or retraining signals, and routing that scales to a multi-reviewer team with exception-type rules rather than one queue everyone pulls from. A review surface that requires a human to retype every flagged field is a bottleneck that does not go away as extraction accuracy improves.
Consistency and reproducibility
The same invoice submitted twice to the same approach, with the same configuration, should produce the same extracted output. In template and schema regimes this is close to automatic; in prompt-based regimes, temperature, prompt variation, and output-format drift can all produce different structured output from the same invoice on the same day. Ask the vendor how they enforce reproducibility, or if building in-house, test it yourself against a representative sample before relying on the result in anything that posts to a general ledger.
Fit to your downstream consumer
Does the output shape match what your ERP, spreadsheet, or integration target actually expects — field names, data types, multi-value representation, currency handling, null conventions on optional fields? A prompt-based extraction product purpose-built for finance workflows outputs a spreadsheet or JSON keyed by the exact field names the operator defines in the prompt itself, which turns the fit into a configuration choice rather than an engineering project.
What to do next
The single most useful next step for a team weighing KVP approaches is not reading more benchmark pages. It is running a head-to-head trial on a representative sample of your own invoices — twenty to fifty documents that span the supplier mix, the layout variability, and the edge cases that actually exist in your pipeline — and scoring each approach against the criteria above. The vendors that handle those documents well on the first pass, with labeled output in a shape your downstream systems can consume, are the ones worth a longer conversation. The ones that do not, regardless of the polish of the demo, are not.
Related Articles
Explore adjacent guides and reference articles on this topic.
Best Affinda Alternatives for Finance Teams in 2026
Compare Affinda alternatives on invoice extraction, line-item depth, export quality, validation workload, and implementation fit for finance teams.
Meta Ads Invoice to Excel: Download, Convert, Reconcile
Download Meta Ads invoices, convert to Excel or CSV, and reconcile invoice totals against credit-card charges, Ads Manager spend, and client rebills.
How to Convert PDF Invoices to E-Invoices
Convert PDF invoices to e-invoices with a workflow for extraction, schema mapping, validation, and delivery across Peppol, UBL, XRechnung, and Factur-X.
Extract invoice data to Excel with natural language prompts
Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.