Invoice Data Schema Guide: JSON, UBL, and Peppol Compared

An invoice data schema defines how invoice information is represented in a structured, machine-readable format. It specifies the fields, data types, nesting relationships, and validation rules that govern how systems store, transmit, and interpret invoice data. Whether you're building an internal extraction pipeline or integrating with a government e-invoicing portal, the schema you adopt determines how your code reads and writes every field on an invoice.

Invoice data formats split into three broad families. Flat JSON schemas dominate extraction APIs, internal data pipelines, and modern SaaS integrations, where developers need quick access to invoice fields without navigating deep document hierarchies. UBL 2.1 XML is the structural backbone of Peppol and most EU e-invoicing mandates, built around a richly typed, namespace-heavy document model. And country-specific formats layer regulatory requirements on top of either JSON or XML base structures: India's GST e-invoice schema uses JSON with mandatory tax registration fields, while Taiwan's eGUI format relies on XML with government-defined element sets. Each of these structures the same core invoice data differently depending on its intended use case.

Regardless of standard, virtually every invoice data model shares the same foundational fields:

Invoice identifier (invoice number or unique reference)
Issue date and due date
Supplier/seller party (name, address, tax ID)
Buyer party (name, address, tax ID)
Line items, each containing a description, quantity, unit price, and line total
Tax breakdown with rate, amount, and applicable tax scheme
Document-level totals: net amount, total tax, and gross payable amount

These fields form the shared vocabulary of invoice data. The divergence between formats lies not in what data they capture but in how they organize it. A flat JSON schema might represent the seller as a top-level object with a handful of key-value pairs. UBL encodes the same seller inside a deeply nested AccountingSupplierParty element with namespaced child nodes for postal address, party tax scheme, and legal entity registration. A country-specific schema may then add mandatory fields (like India's GSTIN or Italy's Codice Destinatario) that don't exist in the base standard at all.

For developers building invoice processing systems, this fragmentation is the central design challenge. No single standard invoice data format dominates globally. A pipeline that handles European Peppol invoices, processes documents from Indian suppliers, and also supports converting invoice documents to structured JSON from scanned PDFs must account for three fundamentally different schema structures representing the same underlying data. The same mapping problem appears in sector-specific standards too, which is why a reference on LEDES versions, file structure, and UTBMS-linked fields is useful when legal e-billing data needs to flow into broader invoice pipelines. Cross-format awareness directly shapes your data model, your validation logic, and your transformation layer. The field mapping reference later in this guide maps sixteen common fields across all four format families, showing exactly where the same data lives in each structure.

Flat JSON Schemas and API Response Patterns

Flat JSON is the dominant invoice data format in modern API integrations and internal data pipelines. Where XML-based standards like UBL prioritize completeness and formal extensibility, JSON invoice schemas prioritize developer ergonomics: shallow nesting, human-readable field names, and direct mapping to application objects. Most extraction services, ERP connectors, and internal microservices exchange invoice data as JSON.

Two broad schema approaches define how JSON invoice APIs work in practice:

Fixed-schema APIs return a predetermined set of fields regardless of the input document. Every response includes the same structure — invoice_number, date, vendor_name, line_items[], total, tax, currency, and so on — whether the source invoice contains all those fields or not. Missing values come back as null. This predictability makes fixed schemas straightforward to integrate against: you write your parser once and every response fits. The limitation: if you need a field the API doesn't expose, you're stuck. For a detailed look at one major fixed-schema implementation, see how AWS Textract structures its invoice extraction output.

Prompt-defined or dynamic schemas flip that model. The caller specifies which fields to extract — by name, with optional instructions — and the API returns data keyed to those exact field definitions. Need po_number and payment_terms but not vendor_address? Define only the fields you want. The output schema adapts to your requirements rather than imposing a canonical structure. An invoice data extraction API built on this pattern lets you submit a natural language prompt or structured field list, then returns rows using your field names as columns or JSON keys. The same schema-first pattern carries over to procurement documents — see this guide to purchase order OCR API implementation patterns for an example of defining PO headers and line items before routing the output into matching workflows.

Schema.org Invoice and JSON-LD

Schema.org's Invoice type provides a formal JSON-LD vocabulary for describing invoices on the web. It defines properties like totalPaymentDue, paymentDueDate, broker, and provider, and is useful for structured data markup that search engines and web crawlers can interpret. However, Schema.org Invoice is a semantic web vocabulary, not a practical data interchange format. Its property names don't align with typical API field conventions, and its nested JSON-LD structure (with @context, @type, and entity references) adds overhead that most data pipelines don't need. You'll encounter it in web page metadata and some B2C invoicing platforms, but rarely as the wire format between backend systems.

Validating Extraction Output with JSON Schema

Regardless of whether your source API uses a fixed or dynamic schema, JSON Schema is the standard mechanism for validating the data you receive. A JSON Schema document defines required fields, expected data types, string patterns, numeric ranges, and structural constraints. You can enforce rules like:

invoice_number must be a non-empty string
total must be a number greater than zero
line_items must be an array with at least one object, each containing description and amount
invoice_date must match an ISO 8601 date format

Validating extraction output against a JSON Schema catches malformed data before it reaches your database or accounting system. In Python workflows, Pydantic-based invoice JSON validation is a practical way to normalize extracted fields before they move into business-rule handling. This is especially important with dynamic schemas, where the field definitions can change between extraction tasks and a schema mismatch might otherwise propagate silently.

Common API Response Patterns

Most extraction APIs wrap their results in a consistent envelope structure. A typical response includes:

A metadata wrapper containing the request status, page counts, processing timestamps, and configuration details.
An extracted data array where each object represents either one invoice (per-invoice mode) or one line item (per-line-item mode). Field names in these row objects correspond directly to the schema — fixed or caller-defined — used for extraction.
Quality signals such as AI confidence scores, uncertainty notes flagging ambiguous fields, or page-level success/failure indicators. These let consuming applications decide whether to accept extracted values directly, route them for human review, or reject them entirely.

This envelope pattern separates extraction metadata from the data itself and gives downstream consumers enough context to build validation and exception-handling workflows around the extracted fields.

UBL 2.1 XML and Peppol BIS Billing

Universal Business Language (UBL) 2.1 is the OASIS open standard that has become the dominant XML schema for structured business documents, invoices chief among them. Where flat JSON schemas represent an invoice as a single object with arrays, the UBL XML schema for invoices models them as a deep hierarchy that mirrors real-world document relationships.

A UBL Invoice document nests elements in ways that capture meaning through structure. AccountingSupplierParty and AccountingCustomerParty each contain full party details including postal addresses, tax registration identifiers, and contact information. InvoiceLine elements each wrap their own Item (with description, commodity classification, and item-level properties), Price, and per-line tax breakdowns. At the document level, LegalMonetaryTotal aggregates line extension amounts, tax-exclusive and tax-inclusive totals, and the payable amount, while TaxTotal contains one or more TaxSubtotal elements broken down by tax category, rate, and taxable amount.

This nesting is semantically precise. A tax subtotal belongs to a specific tax category. A price belongs to a specific line item. An address belongs to a specific party. The tradeoff is verbosity: a simple three-line invoice that fits in 40 lines of JSON can easily produce 200+ lines of UBL XML. Teams without prior XML tooling experience (schema validation, XPath querying, XSLT transformation) face a steeper integration curve than they would with a JSON-based approach.

Peppol BIS Billing 3.0 is not a separate schema. It is a profile, a constrained subset of UBL 2.1 designed specifically for cross-border e-invoicing over the Peppol network. Peppol takes the full UBL specification and narrows it: certain fields become mandatory (like buyer and seller endpoint identifiers), others remain optional, and some are explicitly forbidden. The critical relationship to understand is directional. Every valid Peppol invoice is a valid UBL 2.1 invoice. But a generic UBL invoice will often fail Peppol validation because it lacks required identifiers or includes disallowed extensions.

Both Peppol and UBL sit atop a foundational layer: EN 16931, the European standard that defines the semantic data model for electronic invoicing. EN 16931 does not specify XML tags or JSON keys. It specifies business rules and information requirements, the what of e-invoicing. UBL 2.1 and UN/CEFACT CII are the two approved syntax bindings, the how. When a regulation says "invoices must comply with EN 16931," it means the invoice must contain the required business data (seller tax ID, line-level tax categories, document-level monetary totals) and be encoded in one of those two syntaxes.

This three-layer architecture matters for developers making format decisions:

EN 16931 defines the semantic requirements (what data fields exist, which are mandatory, what business rules apply)
UBL 2.1 provides the XML encoding (element names, hierarchy, data types)
Peppol BIS Billing 3.0 constrains UBL for network interoperability (which UBL fields to use, how to populate identifiers, validation rules for cross-border exchange)

Peppol's relevance is no longer limited to the EU. As of July 2025, Peppol documents are now exchanged across 65 countries, with 23 countries having established government-led Peppol Authorities to formalize adoption. Singapore, Australia, New Zealand, Japan, and multiple countries across Latin America are active participants. For developers building invoice processing systems with any international scope, understanding the Peppol invoice data model and UBL structure is becoming a baseline requirement rather than a European specialty.

The practical calculus comes down to what you need from your invoice XML schema. UBL's hierarchical structure captures rich relational data (party addresses with scheme-identified IDs, multi-level tax categorization, structured payment terms, delivery location details) that most flat JSON schemas simply omit or flatten into strings. If your system must exchange invoices with government portals, Peppol access points, or European trading partners, UBL is not optional. If you are building internal-only systems with no regulatory compliance requirements, the XML tooling overhead may not be justified.

Country-Specific E-Invoice Schemas

While UBL and Peppol aim to standardize cross-border exchange, dozens of countries have implemented mandatory e-invoicing with their own data schemas, designed not for interchange efficiency but for government visibility into transactions in real time. Where UBL optimizes for buyer-seller document exchange, national schemas optimize for tax compliance reporting. This distinction shapes field requirements, validation rules, and transmission workflows in ways that affect every layer of your pipeline.

India's GSTN E-Invoice Schema

India's GST e-invoice schema (FORM GST INV-01) is one of the most widely encountered country-specific formats and a useful reference point. The GSTN e-invoice schema is JSON-based, containing approximately 120 fields organized into clearly defined groups:

Transaction details — invoice type, supply type, document number, and date
Document period — billing period start and end dates
Seller details — legal name, trade name, GSTIN, address with state code and PIN
Buyer details — mirrored structure with the buyer's GSTIN, address, and place of supply
Item list — line items with HSN (Harmonized System of Nomenclature) codes, quantity, unit price, discount, taxable value, and tax breakdowns (CGST, SGST, IGST, cess)
Value details — document-level totals, tax amounts, rounding, and final payable value
E-way bill information — transport details required when goods move between states

Several aspects of this schema differ fundamentally from UBL. The GSTIN (a 15-digit tax identifier) is mandatory for both parties and acts as the primary key for tax authority validation. HSN codes are required at the line level, not optional. And the entire payload is submitted to the Invoice Registration Portal (IRP), which returns a signed IRN (Invoice Reference Number) and QR code. The invoice is not considered legally valid until this round-trip completes.

This means India's e-invoice data structure cannot be treated as a simple field mapping from UBL. The schema encodes regulatory workflow requirements, not just document content.

Other Notable National Schemas

India is far from alone. Several other countries maintain distinct e-invoice formats worth knowing:

Saudi Arabia (ZATCA) takes a hybrid approach. ZATCA's e-invoicing mandate builds on UBL 2.1 as its base schema but adds ZATCA-specific extensions for cryptographic stamping, QR code generation, and tax authority integration. If your system already handles UBL, ZATCA compliance is an extension problem rather than a greenfield one.

Brazil's NF-e (Nota Fiscal Eletrônica) uses a proprietary XML schema with over 500 fields covering tax calculations across federal, state, and municipal levels. NF-e has been mandatory since 2008 and predates most international standardization efforts, which explains its independent design.

Taiwan's eGUI (Electronic Government Uniform Invoice) defines an XML format specifically for tax reporting to the Ministry of Finance. Like India's approach, it prioritizes government reporting over commercial document exchange.

The pattern that emerges is a spectrum. Some countries adopt UBL as a base and layer national extensions on top (Saudi Arabia, Singapore, Australia). Others define entirely independent schemas shaped by their specific tax systems (India, Brazil, Taiwan). Your integration strategy depends heavily on where a given country falls on this spectrum.

Architectural Implications

A multi-country system needs both a normalization layer (mapping each format to a canonical internal model on ingestion) and an adaptive fallback for unmapped formats. Teams pushing this pattern further build agentic workflows that let AI agents autonomously route, validate, and extract invoices across schemas without per-format configuration. Newer e-invoicing mandates increasingly align with Peppol or EN 16931, but the installed base of country-specific formats — India's GSTN, Brazil's NF-e — will persist for years, so plan to support both standardised and proprietary schemas in parallel.

Cross-Format Field Mapping Reference

The table below maps sixteen common invoice fields across four formats: a typical flat JSON schema (as returned by extraction APIs), UBL 2.1 XML, Peppol BIS Billing 3.0, and the Indian GST e-invoice JSON schema. Each cell shows the actual field name, JSON key, or XML element path used in that format.

Reading the Peppol column. Peppol BIS Billing 3.0 is a profile built on top of UBL 2.1, so the element paths are structurally identical in most cases. Where Peppol diverges is not in element naming but in constraint severity: fields that UBL marks as optional (such as cbc:DueDate or cac:PaymentTerms) are often mandated or further restricted by Peppol business rules. A document that validates against the UBL 2.1 schema may still fail Peppol schematron validation if required coded values or identifiers are missing.

Structural differences the table cannot show. A flat mapping captures field names but obscures important architectural gaps between these formats:

Nested party and address data. UBL and Peppol represent supplier and buyer information as deeply nested structures that include postal addresses (cac:PostalAddress), contact details (cac:Contact), and legal registration (cac:PartyLegalEntity). Flat JSON schemas typically collapse these into top-level keys like vendor.address or omit subfields entirely. Any normalization layer needs to decide how to flatten or restructure this hierarchy.
Payment terms and delivery information. UBL provides dedicated elements for payment means (cac:PaymentMeans), payment terms (cac:PaymentTerms), and delivery location (cac:Delivery). Most extraction API JSON schemas either surface these as unstructured text fields or skip them.
India GST-specific fields. The GST e-invoice schema includes HSN/SAC classification codes (ItemList[].HsnCd), e-way bill data (EwbDtls), and a tripartite tax breakdown (CGST, SGST, IGST) that have no direct equivalent in UBL or generic flat JSON schemas. If your system needs to support Indian compliance, these fields must be handled as format-specific extensions rather than mapped to a shared model.
Tax Amount decomposition. The table row for Tax Amount illustrates a common normalization challenge. UBL and Peppol store a single cbc:TaxAmount per tax subtotal, while the India GST schema splits the value across three fields depending on whether the transaction is interstate (IgstVal) or intrastate (CgstVal + SgstVal). A translation adapter must include logic to sum or split these values depending on the target format.

This mapping table is a practical starting point for building a normalization layer or translation adapter between invoice data formats. For each field your system consumes, trace across the row to confirm you are reading the correct path in every format you need to support. When implementing these mappings in code, consider validating invoice data with TypeScript and Zod to enforce your internal invoice data structure at runtime and catch mismatches early in the pipeline.

Choosing the Right Invoice Data Format

The format question is not academic. It determines your validation strategy, your storage model, your parsing dependencies, and how much friction you face when a new trading partner or tax authority enters the picture. The right answer depends on what you are building.

Internal data pipelines and analytics. Use flat JSON with a defined schema. JSON Schema gives you type validation, required-field enforcement, and tooling compatibility across every major language and database. You control the field names, the nesting depth, and the data types. Flat structures query faster, serialize smaller, and integrate directly with columnar stores, BI tools, and event streaming platforms. If your invoice data never leaves your own systems, there is no reason to adopt the structural overhead of XML-based standards.

EU cross-border and Peppol compliance. UBL 2.1 profiled to Peppol BIS Billing 3.0. Non-negotiable when trading partners or government mandates operate within the Peppol network. Your invoices must conform to EN 16931 and pass Peppol's schematron business rule checks.

Country-specific tax compliance. Use whatever the national tax authority mandates: India's GSTN JSON, Saudi Arabia's ZATCA-extended UBL, Brazil's NF-e XML. The tax authority's technical specification is the definitive source, and no internal preference overrides it.

API integration and extraction output. When consuming invoice data from OCR or extraction APIs, design around the API's native output format, which is almost always JSON. If you consume from a single provider, map their response fields to your internal model directly. If you consume from multiple extraction services, define a canonical internal schema first, then write a thin mapping layer for each provider's output. This prevents your downstream logic from coupling to any single vendor's field naming conventions.

Multi-format environments. Most production invoice systems eventually operate across several formats simultaneously. JSON handles internal processing and storage. UBL serves Peppol-connected partners. Country-specific schemas satisfy regulatory submission pipelines, and sectors like defense contracting impose their own structural requirements — DCAA compliant invoicing for government contracts demands cost segregation, CLIN-level breakdowns, and audit-ready documentation that generic schemas don't anticipate. The architecture decision is not "pick one format" but rather: which formats does my system need to support, and what is my canonical internal representation?

Build a normalization layer that ingests each format natively, maps it to your canonical model, and transforms outbound documents into whatever the target system requires. The canonical model becomes your single source of truth for business logic, validation, and reporting. Format-specific concerns stay isolated at the ingestion and emission boundaries.

The practical principle: start with the simplest format that satisfies your immediate requirements. For most teams, that means flat JSON with a well-defined schema. Add UBL when a Peppol integration reaches your roadmap, and country-specific formats when you expand into jurisdictions that mandate them.