For bookkeeping, capture supplier-invoice fields in five categories: header fields for posting, line-level fields for allocation and matching, control fields for payment integrity, tax fields for VAT/GST and statutory reporting, and audit-trace fields that connect each posted record back to the source document.
The core extractable fields are straightforward. Header capture needs the invoice number, invoice date, supplier identity, buyer entity, net/tax/gross totals, currency, due date, and payment terms. Line-level capture needs description, quantity, unit price, line total, line tax rate, and any product, SKU, project, or cost-center reference the team actually uses. Control capture needs PO numbers, GRN or delivery-note references, contract references, bank details, supplier IDs, and payment-discount terms. Tax capture needs rate breakdowns, statutory tax IDs, reverse-charge or exempt flags, and e-invoicing identifiers where they apply. Audit-trace capture needs the source-file reference, page number, ingestion timestamp, extraction confidence, and version history.
The categories matter because each missing field breaks a different job. Missing header fields stop posting. Missing line fields force blanket coding or break line-level matching. Missing control fields let wrong payments move too far through the workflow. Missing tax fields weaken the VAT/GST return and the audit file. Missing audit-trace fields leave AP unable to prove where a posted value came from after the document has moved through extraction, review, and posting.
This split is grounded in established invoice standards. The OASIS UBL 2.1 invoice schema includes invoice-level identifiers and references, invoice lines, tax totals, and allowance/charge structures, which map naturally to the same header, line, tax, and control decisions an in-house AP schema has to make. The same shape shows up across ERP screens, e-invoicing formats, and AP automation tooling because the underlying bookkeeping work is the same.
If you need a flat field-by-field sample rather than a schema decision framework, use the field-by-field invoice data entry walk-through as the companion reference.
Must-Capture Header Fields: The Minimum for an AP Record to Exist
Header fields create the AP record. Without them, there is nothing to post, match, pay, age, or retrieve for audit.
Capture these header fields:
- Invoice number
- Invoice date and, where shown, tax point or supply date
- Supplier legal name and supplier ID
- Buyer legal entity
- Currency
- Net, tax, and gross totals
- Due date
- Payment terms
- Supplier tax ID and bank details when they appear on the invoice
The bookkeeping decision is invoice posting. A wrong invoice date can put the invoice in the wrong VAT or GST period. Loose supplier identity capture can create duplicate supplier accounts, which then breaks supplier-statement reconciliation. Missing net or tax totals force the bookkeeper to derive them from line items, which fails when supplier rounding means the line sum does not tie cleanly to the printed header.
The buyer entity is easy to ignore in small companies and costly to miss in groups. If the same AP team processes invoices for multiple legal entities, the "bill to" entity decides which ledger, VAT registration, approval route, and bank account the invoice belongs to. Likewise, payment terms should be captured from the document or supplier master deliberately rather than assumed from a default. A 30-day default applied to a supplier with negotiated 14-day terms can create avoidable overdue balances; a discount term missed at capture can leave cash savings invisible until it is too late to act.
Normalize header fields before they reach the ledger. Dates should land as ISO YYYY-MM-DD after the supplier locale has been interpreted. Currency should land as an ISO 4217 code such as USD, EUR, or GBP, not as a printed symbol. Supplier identity should resolve to the buyer-side supplier code or a tax ID where one exists, rather than to whichever printed name appears on the document. Invoice numbers need one consistent rule for leading zeros, separators, and supplier prefixes so duplicate-payment checks compare like with like.
Header capture is not the same as intake. The invoice still has to reach AP before any field decision can be applied; the digital mailroom for AP invoice intake covers that upstream handoff.
Line-Level Fields: When Header Capture Is Not Enough
Line-level capture is needed when the invoice cannot be posted, matched, or reported from header totals alone. It is optional for a single-category recurring service invoice; it is required when lines drive coding, matching, tax treatment, or project allocation.
Capture these line fields when the downstream workflow uses them:
- Line description
- Quantity
- Unit price
- Line total
- Line tax rate and line tax amount
- Product, SKU, or service code
- PO line, job, project, cost center, or account hint
The decision rule is practical. Header-only capture works when the whole invoice posts to one GL account, one cost center, one tax treatment, and no PO line match. Line capture is required when different lines code to different accounts, when three-way matching runs at PO-line resolution, when a document mixes tax rates, or when project/job profitability depends on each line carrying its own cost attribution.
The common failure is hidden until month end. A header-only invoice can look balanced while the costs are in the wrong GL account, the PO match is satisfied at the wrong level, or the mixed-rate VAT/GST detail cannot support the return. Line capture also matters when a supplier rolls freight, discounts, deposits, or credits into the document in ways that affect margin or tax treatment.
The optional fields should stay optional. Capturing SKU, product code, project code, or account hints on every supplier invoice only helps when the downstream system uses those values. A facilities invoice with one service line may not need a product code at all. A construction materials invoice may need SKU and quantity because the job-costing system reconciles materials by line. The schema should make that distinction explicit instead of treating every field printed in a table as automatically worth extracting.
Normalize line data with the same discipline as header data. Unit-of-measure values such as each, EA, pcs, kg, and hrs need canonical treatment where the receiving system depends on them. Rounding rules must say whether header tax or summed line tax is authoritative when they differ by a cent. Discount handling should preserve enough detail to reproduce the supplier's arithmetic when a header discount affects multiple lines.
Getting line data out of supplier tables is its own extraction problem; the table-specific mechanics are covered in extracting invoice line items from table data.
Control and Matching Fields: Stopping Wrong Payments Before They Happen
Control and matching fields exist because AP pays money out. Any serious AP schema has to capture the fields that stop wrong payments before they are posted or released.
Capture these control fields:
- PO number and, where needed, PO line reference
- GRN, delivery-note, or service-entry reference
- Contract reference
- Supplier ID
- Supplier bank account details
- Payment terms, net days, and early-payment discount terms
These fields serve matching, duplicate-payment detection, fraud control, and cash-flow timing. PO and receipt references tie the invoice to what was ordered and received. Supplier ID, invoice number, date, and amount support duplicate checks. Structured bank details let AP compare the payment destination on the invoice against the supplier master before a payment run. Payment terms decide when the invoice should be scheduled and whether an early-payment discount is worth taking.
The failure modes are direct. A PO match fails when one side stores PO-00042 and the other stores PO42. A line-level PO reference captured only at the header can match the wrong receipt line while still appearing clean at the document level. A duplicate invoice can slip through when the same document arrives once as a PDF and once in an email body, with two different invoice-number formats. A fraudulent bank-detail change becomes harder to catch when the account number lives in a footer text blob instead of in structured fields.
Control capture also needs a source-of-truth rule. Supplier bank details printed on an invoice should not automatically overwrite the supplier master; they should be captured as evidence for comparison. A contract reference should not replace the PO number if the buyer's matching workflow is PO-led. A delivery-note or GRN reference should be captured only when the receiving process uses it to clear the invoice exception. Otherwise AP collects fields that look controlled but do not drive any control.
Normalize control fields aggressively. Decide how PO numbers are canonicalized, which reference is the matching anchor when both contract and PO numbers appear, and which bank-detail components must be captured separately. Then define the validation rule that consumes the field: what counts as a valid PO match, what triggers a duplicate-payment exception, and what routes a bank-detail change for manual approval. The invoice validation process and checklist covers that consumption layer.
Tax and Compliance Fields: VAT Returns, Reverse Charges, and Statutory IDs
These fields belong in the AP capture schema, not in a separate tax checklist, because the consequences are monetary and audit-visible: a missed input-VAT reclaim costs cash, a mishandled reverse charge triggers a return correction, and a missing statutory ID can become an audit finding months later.
Capture these tax and compliance fields:
- Tax rate per line and per rate group
- Header-level tax breakdown by rate
- Supplier statutory tax ID, such as VAT number, ABN, GSTIN, EIN, or local equivalent
- Reverse-charge, zero-rate, exempt, or out-of-scope flag when shown
- Place-of-supply evidence for cross-border transactions
- E-invoicing identifiers, Peppol IDs, or structured-format references where required
- Withholding-tax indicators where the buyer must withhold before payment
The bookkeeping decisions cluster around the periodic return and the audit file. VAT or GST returns need rate-level breakdowns, not just one header tax amount. Input-tax reclaim depends on retaining the supplier's tax ID against the posted invoice. Reverse-charge accounting depends on capturing the document evidence that tells the buyer to self-account. Withholding tax depends on knowing which supplier and line types fall under the rule.
The failure pattern is usually not a dramatic posting error. It is a tax return prepared from incomplete evidence: a reclaim held back because the supplier VAT number was not captured, a mixed-rate invoice reported from one aggregate tax value, or a reverse-charge service invoice posted as ordinary input tax because the flag stayed in a PDF footer. For AP teams, the question is not whether the tax rule lives in the invoice schema. It is whether the fields needed to apply that rule survive the handoff from document to ledger.
When captured tax fields need to become posting logic, link them to the tax-code assignment, reverse-charge, and jurisdiction-specific validation rules the AP team actually uses. Examples include automated invoice tax-code assignment, NZ reverse-charge GST on imported services, Brazil CFOP and NCM codes on supplier invoices, and PDF invoice conversion to Peppol PINT A-NZ.
Normalize tax fields before the return engine sees them. A rate captured as 20, 20%, 0.20, and 0.2 may represent the same tax rate, but downstream systems treat those values differently. The schema should pick one canonical representation and convert at capture. It should also distinguish between tax IDs that serve different legal purposes, such as an EU VAT number versus a domestic registration number printed on the same supplier invoice.
E-invoicing can move some schema decisions outside the finance team. Where a structured format is mandated, the external format defines many fields; the AP schema still has to preserve the values the ledger, tax return, audit file, and reconciliation process consume.
Audit-Trace Fields: Connecting Every Posted Record Back to a Document
Audit-trace fields explain where the posted data came from. They are not usually printed on the supplier invoice, but they are essential when AP has to prove, review, or correct a posting later.
Capture these audit-trace fields:
- Source-file name or document ID
- Page number or page range
- Ingestion timestamp
- Extraction run ID or version
- Extraction confidence by field or record
- Human review status and reviewer identity where a manual correction was made
These fields serve retrieval and accountability. If a supplier disputes a deduction, AP needs the posted record to point back to the exact page that showed the line. If an auditor samples a transaction, the team needs the source file and the version of the extracted data that produced the posting. If a reviewer corrected a field, the record should show what changed and who approved it.
Most field lists skip audit-trace fields because they are produced by the workflow, not printed on the invoice: intake assigns the source-file reference, extraction assigns confidence, and posting records the ingestion timestamp. That does not make them optional. Without them, the team can have accurate-looking ledger data that cannot be defended when someone asks where it came from.
Field-level confidence is most useful when it triggers action. A low-confidence invoice number may require review before posting because it affects duplicate detection. A low-confidence line description may be acceptable if the PO match and totals are clean. A schema that stores confidence only as one document-level score loses that distinction; a schema that stores field-level confidence but never routes exceptions creates noise.
Normalize trace fields like any other schema element. Page references should be stable enough to identify the source page after files are merged or split. Timestamps should use one timezone, usually UTC. Confidence should use one scale across the schema, whether that is 0-1, a percentage, or a categorical low/medium/high value.
Schema Boundaries: What to Capture, Calculate, or Leave Out
The five categories describe the field universe. Three boundary decisions keep the schema usable: what to capture, what to calculate, and what to ignore.
Capture what the supplier wrote on the invoice when that value matters to posting, payment, tax, matching, or audit. Calculate values that are functions of captured fields, such as gross-equals-net-plus-tax checks, PO variance, or coding suggestions derived from supplier, description, and cost-center hints. Do not ask an extraction step to "capture" a value the supplier never printed unless the field is clearly labelled as derived.
This boundary avoids two opposite errors. Capturing a calculated value creates reconciliation work when the captured value disagrees with its inputs. Calculating a value the supplier actually printed can create payment disputes when the buyer's math differs from the supplier's document. GL coding is usually derived from captured evidence; the supplier's printed line total is captured because it is the amount the supplier is asking to be paid.
Leave out fields that do not support posting, matching, tax, audit retrieval, or reporting. Optional fields captured "just in case" create certain reconciliation work for uncertain future value; the right schema is the smallest set the team's actual decisions depend on.
Document boundaries also belong in the schema decision. One PDF can contain one multi-page invoice, several separate invoices, an invoice plus a statement, or a duplicate copy of the same invoice received through another channel. The schema needs a rule for tying every captured field to the correct logical invoice. Supplier-specific statement workflows face the same problem, such as turning roofing supplier statements and invoices into Excel before job-costing or QuickBooks review.
Normalization ownership is the last boundary. Some cleanup belongs at extraction because every downstream system needs the same value, such as ISO dates or canonical currency codes. Some cleanup belongs at posting because the ledger has the authoritative supplier master, tax-code table, or PO master. The problem is not which layer owns every rule; the problem is letting both layers apply different rules or letting neither layer own the rule.
Once the five categories are agreed, encode them in the tool that consumes the data: an extraction prompt, CSV import template, manual coding standard, or API field definition. For prompt wording, use a practical invoice extraction prompt structure; for automated batch capture, the automate supplier invoice data extraction workflow applies the same schema consistently across supplier files.
Extract invoice data to Excel with natural language prompts
Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.
Related Articles
Explore adjacent guides and reference articles on this topic.
Automate Tax Code Assignment From Supplier Invoices
Learn how to automate supplier invoice tax-code assignment by extracting VAT/GST evidence, mapping rules, and review flags before posting.
US Lumberyard Supplier Invoice to Excel for Contractors
Extract line items from Builders FirstSource, 84 Lumber, US LBM, and other US lumberyards into Excel with UOM, sales tax, and job-cost coding handled.
Multilingual Swiss Invoice Extraction for Finance Teams
Extract Swiss vendor invoices in German, French, Italian, and English into one AP schema for MWST review, QR-bill checks, and ERP handoff.