VAT Return Data Extractor: From Invoices to Working Paper

Turn supplier invoices, receipts, and credit notes into a VAT-return-ready working paper: field schema, validation controls, and audit-trail design.

Published
Updated
Reading Time
16 min
Topics:
Tax & ComplianceVAT return preparationpurchase registersupplier invoicescredit notesVAT working paperinput VATreverse charge VAT

A VAT return data extractor pulls VAT-relevant fields from supplier invoices, receipts, and credit notes into a structured working paper that feeds the return. The minimum field set is short enough to remember and complete enough to defend: supplier name, supplier VAT/GST ID, invoice number, tax point, document type, net and VAT amounts split by rate, reverse-charge indicator, non-claimable reason, credit-note reference back to the original invoice, and source filename and page. The same schema is reusable across jurisdictions because each local return differs only in how those fields are aggregated into box totals, not in which fields the working paper has to carry.

That reusability is the point of separating the extraction layer from the return itself. Generic invoice OCR sits below this layer — it converts pixels to text but carries no VAT semantics, so a mixed-rate invoice or a credit note arrives as a flat header capture with no rate split and no sign. Jurisdiction-specific return software sits above this layer — it assumes the data is already structured, leaving the working paper to whoever feeds it. The extractor occupies the middle: it knows what a VAT return needs, but it does not impose a single country's filing form.

The Field Schema a VAT Return Working Paper Needs

The schema below groups fields into four clusters: party and document identification, dates and document type, amounts and rates, and references and review trail. Each field carries a one-line justification tied to what fails downstream — at box-total level or in audit defence — if it is missing or wrong. Use the list as a test against any extractor, including this one.

Party and document identification.

  • Supplier name. The reclaim is recorded against a named supplier. Without the name the row cannot be reconciled against the supplier ledger and any later supplier-side query has no anchor.
  • Supplier VAT/GST ID. In most jurisdictions an input VAT reclaim requires that the supply came from a registered taxable person; a missing or incorrect ID can disqualify the line from the reclaim. The ID is also half of the duplicate-detection key.
  • Invoice number. The uniqueness anchor. Combined with the supplier VAT ID it identifies the document for duplicate checks and provides the traceback the auditor will ask for.

Dates and document type.

  • Tax point and invoice date. The tax point determines the return period; the invoice date is not always the same value. Advance payments, continuous supplies, EU acquisitions, and self-billing arrangements all separate the two. Capture both as distinct fields and apply the tax point for period assignment.
  • Document type as a controlled value. Invoice, credit note, receipt, utility bill, other. The downstream behaviour diverges by type — credit notes are negative and require an original-invoice reference, receipts have evidence thresholds, utility bills have their own continuous-supply tax-point rules. Free text here is a trap; a controlled value is the foundation that every later validation builds on. The supplier invoice fields to capture for bookkeeping for general AP sit adjacent to this VAT-specific schema and share most of the identifier fields, with the VAT working paper layering rate, reverse-charge, and non-claimable fields on top.

Amounts and rates.

  • Net amount, VAT/GST amount, gross amount. The header-level totals every reviewer expects to see, and the basis against which line-level totals reconcile.
  • VAT/GST rate split per line, where the document mixes rates. This is the single field that separates a real working paper from a flat header capture. Mixed-rate invoices are the most common source of return error, and a per-line rate column is the only way to feed the standard, reduced, and zero subtotals that any local return needs.
  • Currency. A supplier in another currency reports VAT in that currency; the return is filed in the home currency. Without a currency field there is nothing to convert, and without a conversion at the right rate-date the reclaim is wrong.
  • Reverse-charge indicator (boolean). A reverse-charge supply belongs in the return with notional output VAT alongside input VAT, not as ordinary input. A boolean flag at row level routes the row into the right box totals later.
  • Exemption or non-claimable reason (coded). Entertainment, partial-exemption restriction, motor expenses, and similar carve-outs exist in every jurisdiction with their own labels. A coded field — entertainment, motor, partial-exemption, exempt-supply, and so on — lets the working paper exclude the row from the reclaim subtotal while keeping it visible and explained. A free-text reason cannot be subtotalled, so the carve-out becomes invisible at review time and surfaces only when an auditor asks.

References and review trail.

  • Credit-note reference back to the original invoice. Without it, the credit cannot be netted against the right period or supplier and the original invoice cannot be marked as adjusted. The deeper treatment is in credit note data extraction fields and controls, which covers the credit-note-specific edge cases.
  • Source filename and page reference. The audit anchor. Every total in the return has to be traceable back to a named PDF and page; a row without this field is a number without provenance.
  • Review status, reviewer, and timestamp. Raw, reviewed, queried, approved — plus the initials and the date of the change. This is what turns the working paper into a record of who saw what and when, which is the difference between a spreadsheet and defensible evidence.

The schema is jurisdiction-agnostic on purpose. Local box numbers — UK VAT100 boxes, Irish VAT3 fields, UAE VAT201 categories — are mappings of these fields, not additions to them. The regional articles handle the mapping work.

Validation Controls and Common Failure Modes

A working paper with the right fields can still feed a wrong return. The schema is necessary but not sufficient; the validation layer is what catches the specific ways VAT data goes wrong between document and filing. The failure modes below are the ones that surface in review when a partner spots them and in audit when nobody did.

Mixed-rate invoices treated as single-rate. A single supplier invoice can carry items at the standard rate, the reduced rate, and zero rate — utilities, hospitality, and trade-supply invoices routinely do. An extractor that captures only the header VAT total collapses this into one number and misclassifies the standard, reduced, and zero subtotals. The control is per-line rate capture, with a totals reconciliation that checks the line-level sums against the document header and flags any variance.

Credit notes booked as positive invoices. When a credit note is captured with the same sign as an invoice, the period's input VAT and gross are both overstated, and the reclaim is overstated by the credit's VAT amount. The control is document-type classification at extraction time: credit-note amounts are stored as negative, the document-type field is set to credit note, and the credit-note reference field ties the row back to the original invoice so the adjustment lands against the right supply.

Non-claimable VAT included in reclaim totals. Entertainment, certain motor expenses, partial-exemption restrictions, and similar carve-outs vary in scope by jurisdiction but exist in every one. If they are reclaimed in error, the recovery is overstated and the carve-out is invisible until an audit asks for proof. The control is a non-claimable reason field, populated at the row, that excludes the row from the reclaim subtotal at the working-paper layer rather than relying on the return preparer to remember at filing time.

Duplicate invoices. Suppliers occasionally re-issue the same invoice as a replacement; AP teams occasionally re-submit a scan they thought was missed. Without duplicate detection, the same VAT is reclaimed twice. The control is uniqueness on the supplier-VAT-ID-plus-invoice-number tuple, with potential duplicates surfaced for human review rather than silently merged — sometimes the second copy is a corrected re-issue and the original needs to be retired, sometimes it is a true duplicate and one row needs to be deleted.

Foreign-currency totals without home-currency conversion. A supplier in EUR or USD reports VAT in that currency; the return is filed in the home currency. Without a conversion at the correct rate-date — the tax point, not the extraction date and not the payment date — the reclaim is wrong by the FX drift between the two. The control is a currency field on every row and a rate-date convention applied consistently across the period, so the home-currency VAT figure is reproducible and defensible.

Receipts without valid tax-invoice evidence. Below a jurisdictional threshold a receipt may serve as evidence of a supply without carrying the full tax-invoice elements; above the threshold it must. Receipts that lack the supplier VAT/GST ID or the tax breakdown above the threshold are not admissible evidence for the input claim. The control is document-type-aware evidence checks that flag receipts missing the required elements so the reclaim is built only on admissible documents.

Missing supplier VAT/GST ID. Without the supplier's VAT/GST ID on the captured record, the supply may fail to qualify for reclaim at all, regardless of the other fields. The ID can be missing because the document genuinely lacks it, because the supplier is unregistered, or because the extraction missed a field that is on the page. The control is a hard validation: missing IDs are flagged for review before the working paper is taken forward, and the reviewer decides whether to chase the supplier, exclude the row, or treat it as a non-VAT cost.

Tax point versus invoice date confusion. Most invoices have a tax point that equals the invoice date, which is why the distinction is easy to forget. The cases where it does not — advance payments before the supply, continuous supplies (utilities, leases), self-billed invoices, EU intra-community acquisitions — are also the cases where getting the period wrong sends VAT into the wrong return. The control is capturing tax point and invoice date as separate fields and using the tax point for period assignment, full stop.

Reverse-charge supplies misclassified as standard input. Cross-border B2B services and intra-community acquisitions under the reverse-charge mechanism belong in the return with notional output VAT alongside the input VAT claim, not as ordinary input. A row that lacks the reverse-charge indicator is routed into the standard input box and the corresponding output box is silently understated. The control is the reverse-charge boolean field on the row; the deeper edge cases — particularly for EU acquisitions — are covered in intra-community acquisition reverse-charge entries, which walks the acquisition-specific treatment.

Working-Paper Structure and the Audit Trail

The schema and the validation controls define what has to be in the working paper. The structure defines how the rows are arranged so a reviewer can use it, a preparer can map it to a local return, and an auditor can trace any total back to a source.

Row convention. One row per document is the default. Where a document carries mixed rates, the row expands into one row per rate-split line, with the header fields (supplier name, supplier VAT/GST ID, invoice number, tax point, invoice date, document type, currency) repeated on every line so any subset of rows can be totalled or filtered without joining tables. Per-line rows carry their own net, VAT amount, and rate; the document-level gross is reconstructed by sum where needed. The trade-off is row count for self-containment, and it is the right trade — a single working-paper view that supports both invoice-level and line-level totalling beats two tables that have to be reconciled.

Column order. The principle is that a reviewer reads the row left to right in the order they think about a document. Identifiers come first (supplier name, supplier VAT/GST ID, invoice number), then dates (tax point, invoice date), then document type, then amounts (net, VAT amount, gross, rate, currency), then flags (reverse-charge indicator, non-claimable reason), then references (credit-note reference, source filename and page), then review status. There is no single correct ordering, but the underlying test is the same — a reviewer scanning across a row should be able to rebuild the document's tax position without having to scroll back to a header or jump to another sheet.

Operations the structure must support. Sort and filter by supplier, by VAT rate, by document type, by period, by review status. Subtotal by rate, so the standard, reduced, and zero rate totals are produced as a by-product of the working paper rather than re-derived for the return. Subtotal by non-claimable reason, so the carve-outs are visible as a category rather than buried in row-level flags. Surface unreviewed and queried rows for the reviewer, so a closing pass can see what is outstanding without trawling every row. A spreadsheet with named columns and the rate, reason, and review-status fields populated supports all of these natively; the field design is what enables the operations, not the spreadsheet software.

Audit trail. Each row carries a source filename and page reference back to the original PDF, so any subtotal that ends up in a return box can be traced to a named source document — the auditor who asks "show me the underlying invoices for the standard-rate input total" gets a row-level answer, not a folder-level one. The reviewer fields — initials, timestamp, status, and any query notes — record who saw the row and when. This is what turns the working paper from a list of numbers into evidence: every figure has a named source on one side and a named reviewer on the other.

The working paper is not the return. The return preparer maps the working paper's subtotals into the local return's box structure — UK VAT100 boxes, Irish VAT3 boxes, UAE VAT201 categories, and so on. The working paper is the layer that survives jurisdiction change (the preparer maps to a different box set), software change (the file moves between tax engines), and team change (the reviewer of record is documented on the row).


From a Mixed Batch of Documents to a Filed-Ready Working Paper

In practice, the schema and controls land in a finance team's hands as a folder, an inbox, or a shared drive containing a few hundred to a few thousand documents for the period: supplier invoices in PDF, scans of receipts, supplier-issued credit notes, utility bills with their own continuous-supply tax-point quirks, occasional proforma invoices that should not be in the reclaim, and a sprinkling of items in another language or another currency. The working paper has to be assembled from this heterogeneous input — not from a templated feed of one document type at a time.

That assembly requirement is what separates tools that fit the workflow from tools that do not. A workable extractor has to accept a described schema in plain language rather than a fixed template — the field set above is broader than most pre-built templates carry, and a real working paper needs the carve-out fields, the reverse-charge flag, and the credit-note reference that templated tools rarely expose. It has to handle multiple document types in one run, because real batches mix invoices, receipts, credit notes, and utility bills. And it has to export to a structured file the return preparer can take into review — XLSX for an accountant who works in Excel, CSV for an import into accounting software, JSON for a downstream automation.

This is the operational fit for AI-powered invoice data extraction and the prompt-based surface it runs on. The user uploads the mixed batch, describes the VAT working-paper schema in a single prompt, and downloads the structured spreadsheet — the prompt is the configuration, with no template to maintain between periods. Mixed-format batches run to 6,000 files per job and single PDFs to 5,000 pages, with output as XLSX, CSV, or JSON. Every row carries a reference to the source file and page number, so the audit-trail field in the schema is populated automatically rather than added by hand. Where the AI had to resolve an ambiguity — an unflagged credit note, an inferred rate split, an exchange rate applied to a foreign-currency invoice — the extraction notes record the decision against the row. The prompt itself can be saved to a prompt library and applied again next period, so the schema is set once and reused.

What the prompt looks like in practice is a plain-language brief, not code. A finance team preparing a quarterly return might write something along the lines of: extract one row per supplier invoice and credit note, with one row per VAT rate where a document mixes rates. Include supplier name, supplier VAT ID, invoice number, tax point, invoice date, document type (invoice, credit note, receipt, utility, other), net amount, VAT amount, gross amount, VAT rate, currency, reverse-charge indicator, non-claimable reason where applicable, original-invoice reference for credit notes, and source filename and page. Treat credit notes as negative with the original-invoice reference populated, and flag any row where the supplier VAT ID is missing. For foreign-currency invoices, keep the document currency on the row so home-currency conversion can be applied at the tax-point rate; classify receipts that lack a supplier VAT ID and a tax breakdown as evidence-deficient. That brief, in prose, is the configuration; the structured spreadsheet on the other side is the working paper.

The operational discipline of a clean working paper is no longer just internal good practice. The regulatory direction of travel is making structured VAT data a baseline expectation across the EU and the wider tax-administration world. According to the European Commission's ViDA 2026 work programme, from 1 July 2030, cross-border B2B transactions in the EU will be subject to Digital Reporting Requirements based on mandatory e-invoicing, and by 1 January 2035 Member States with domestic real-time reporting obligations must align those systems with the cross-border digital reporting system. The structured, field-level VAT data that the working paper carries today is the same data that those Digital Reporting Requirements will demand at transaction speed in a few years — building the discipline now means the workflow already meets the bar when the deadline arrives.

Where the Reusable Schema Meets Local Returns

The schema and the controls above are reusable; the local return defines how those captured fields are aggregated into box totals and which jurisdictional carve-outs apply. The generic working paper is the parent layer, and the regional articles below are the children that handle the mapping to a specific return.

For an EU member-state return from supplier invoices and credit notes, the Cyprus VAT return from supplier invoices and credit notes walks through the same field schema mapped into a standard EU VAT return form, with the credit-note treatment and the input-VAT boxes called out by name. Where the local return is built around an explicit rate-split presentation, the Irish VAT3 return with rate-split working paper shows how the per-line rate column in the working paper feeds the VAT3 boxes directly, and where the EU acquisitions and reverse-charge entries land in the Irish form. For a GST-style jurisdiction, the UAE VAT purchase register from invoices shows the equivalent purchase-register layer that feeds VAT201 — and the same approach applies for GST return data extraction from invoices in other jurisdictions that follow the purchase-register pattern, with the same field schema and different local labels on the boxes.

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours
Continue Reading