TypeScript Invoice Extraction with Zod Validation

Build type-safe invoice extraction pipelines with TypeScript and Zod. Schema design, runtime validation with safeParse, and Node SDK integration.

Published
Updated
Reading Time
15 min
Topics:
API & Developer IntegrationTypeScriptZodschema validationNode SDK

Extraction APIs return JSON. TypeScript interfaces give you autocompletion and compiler hints for that JSON, but they evaporate the moment your code runs. A vendor name that arrives as null instead of a string, a tax field missing entirely from the response, a date formatted as "March 5" instead of an ISO string: none of these trigger a type error. The TypeScript compiler has no opinion about what an API actually sends at runtime. Your invoice processing pipeline compiles cleanly, passes review, and then breaks in production when it encounters a real-world document the extraction service handled differently than you expected.

This is the core problem with relying on TypeScript's structural type system alone for invoice data extraction. Interfaces describe a shape, but they make zero guarantees about whether incoming data matches that shape. The gap between compile-time types and runtime reality is where data quality issues live, and for financial data, those issues are expensive.

Zod closes that gap. TypeScript invoice extraction with Zod validation combines compile-time type safety with runtime validation of extraction output. You define your invoice data structures as Zod schemas, covering header fields, line items, and tax breakdowns. Then you use safeParse to validate API responses before they reach downstream systems. When a field is missing, malformed, or the wrong type, safeParse catches it at the extraction boundary and returns a structured error rather than letting bad data propagate into your accounting logic, database writes, or financial reports.

The schema-first approach to document extraction makes the Zod schema the single source of truth for the entire pipeline. That one schema definition serves three distinct roles: compile-time types via z.infer, runtime validation via safeParse, and extraction guidance via .describe() annotations that keep the data contract and the extraction logic aligned in a single place.

This pattern fits naturally into the TypeScript ecosystem that now dominates production Node.js development. TypeScript overtook both Python and JavaScript in August 2025 to become the most used language on GitHub, marking the most significant language shift in more than a decade, according to GitHub's Octoverse 2025 report. For teams already building their backend services, API layers, and data pipelines in TypeScript, adding schema-first validation to the extraction layer is a natural extension of the same type-safe patterns they use everywhere else.

If you already have a working pipeline for extracting invoice data with JavaScript and Node.js, these Zod patterns layer directly on top. You do not need to rewrite your extraction calls or switch libraries. The schema wraps your existing output, adding type safe invoice data extraction as a validation layer at the boundary where external data enters your system.


Designing Zod Schemas for Invoice Data

A production invoice schema is not a flat object with three fields. Real invoices carry dozens of data points across headers, variable-length line item arrays, multi-rate tax breakdowns, and format variations like credit notes. Zod lets you model all of this complexity as composable, validated schemas where TypeScript types are derived, never hand-maintained.

Start with the invoice header. This is the foundation every other schema composes into:

const invoiceHeaderSchema = z.object({
  invoiceNumber: z.string().describe("Unique invoice identifier, alphanumeric"),
  invoiceDate: z.string().describe("Invoice issue date, not due date. Format: YYYY-MM-DD"),
  dueDate: z.string().optional().describe("Payment due date. Format: YYYY-MM-DD"),
  vendorName: z.string().describe("Legal name of the issuing vendor"),
  currency: z.string().describe("ISO 4217 currency code, e.g. GBP, USD, EUR"),
  netAmount: z.number().describe("Pre-tax total"),
  taxAmount: z.number().optional().describe("Total tax charged on the invoice"),
  totalAmount: z.number().describe("Final payable amount including tax"),
});

Every .describe() annotation here serves a dual purpose. For developers, it documents what each field means directly in the schema definition. For extraction pipelines, these descriptions can be fed into prompts to tell the extraction system exactly how to interpret ambiguous fields. "Invoice issue date, not due date" prevents a common extraction error where the wrong date gets pulled. This pattern of annotating schemas with extraction-aware descriptions is one of Zod's underrated strengths in data extraction contexts.

Line Item Schemas

Line items are where invoice extraction gets interesting. A single invoice might have three line items or three hundred. Define the line item as a separate schema and compose it in:

const lineItemSchema = z.object({
  description: z.string().describe("Line item description or service name"),
  quantity: z.number().describe("Number of units"),
  unitPrice: z.number().describe("Price per unit before tax"),
  lineTotal: z.number().describe("quantity * unitPrice, pre-tax"),
  productCode: z.string().optional().describe("SKU or product code, if present"),
});

const invoiceSchema = z.object({
  ...invoiceHeaderSchema.shape,
  lineItems: z.array(lineItemSchema).describe("All line items on the invoice"),
});

The separate lineItemSchema definition makes it reusable across different invoice schema variants, and z.array() handles variable-length line item data regardless of whether the extraction returns one item or fifty.

Tax Breakdowns for Multi-Rate and Multi-Jurisdiction Invoices

Tax fields are the most inconsistent part of invoice data. A domestic invoice might have a single VAT line. A cross-border invoice might split tax across multiple jurisdictions with different rates. Some invoices omit tax entirely.

const taxBreakdownSchema = z.object({
  taxType: z.string().describe("Tax type identifier: VAT, GST, Sales Tax, etc."),
  taxRate: z.number().describe("Tax rate as a decimal, e.g. 0.20 for 20%"),
  taxableAmount: z.number().describe("Amount subject to this tax rate"),
  taxAmount: z.number().describe("Calculated tax for this rate"),
  jurisdiction: z.string().optional().describe("Tax jurisdiction if applicable"),
});

const invoiceWithTaxSchema = invoiceSchema.extend({
  taxBreakdowns: z.array(taxBreakdownSchema).optional().default([]),
  reverseCharge: z.boolean().optional().default(false)
    .describe("Whether reverse charge VAT applies"),
});

The combination of z.optional() and z.default() is critical here. Fields like reverseCharge only appear on certain invoice types, but downstream code benefits from always having a value. Setting a default of false means your TypeScript code never needs to check for undefined on that field.

Handling Different Invoice Formats with Discriminated Unions

Not every document in an extraction batch is a standard invoice. Credit notes have negative amounts and carry a credit reason. Zod's z.discriminatedUnion() lets you model these format variations with full type narrowing:

const standardInvoiceSchema = z.object({
  documentType: z.literal("invoice"),
  invoiceNumber: z.string(),
  totalAmount: z.number().min(0),
  lineItems: z.array(lineItemSchema),
});

const creditNoteSchema = z.object({
  documentType: z.literal("credit_note"),
  invoiceNumber: z.string(),
  totalAmount: z.number().max(0).describe("Must be negative for credit notes"),
  creditReason: z.string().describe("Reason for the credit, e.g. returned goods"),
  originalInvoiceRef: z.string().optional().describe("Reference to the original invoice"),
  lineItems: z.array(lineItemSchema),
});

const documentSchema = z.discriminatedUnion("documentType", [
  standardInvoiceSchema,
  creditNoteSchema,
]);

When you validate a parsed document against documentSchema, TypeScript knows that if documentType is "credit_note", the creditReason field exists. No type assertions, no runtime guessing.

Deriving Types from Schemas

The key advantage of building Zod invoice schemas is that you never maintain types separately:

type Invoice = z.infer<typeof invoiceSchema>;
type LineItem = z.infer<typeof lineItemSchema>;
type TaxBreakdown = z.infer<typeof taxBreakdownSchema>;
type Document = z.infer<typeof documentSchema>;

When you add a field to the schema, the inferred TypeScript invoice type definition updates automatically. When you mark a field optional, the type reflects it. This eliminates an entire class of bugs where types and validation logic drift apart.

These schema patterns map directly to the field complexity you encounter in production extraction. Defining invoice structures as Zod schemas means you can validate extraction API responses at the boundary, catching missing fields or malformed data before they reach your application logic. This approach is also directly relevant when converting invoice data to structured JSON output, since the Zod schema defines exactly what shape that JSON should take.


Connecting Zod to the Node SDK Extraction Pipeline

The schemas from the previous section define what valid invoice data looks like. Now you need an extraction pipeline that produces data to validate against them. The @invoicedataextraction/sdk package ships with TypeScript declarations built in, so there is no separate @types install step. Your editor gets full autocomplete and type checking from the moment you add it.

Install the SDK and set up the client:

npm install @invoicedataextraction/sdk
import InvoiceDataExtraction from "@invoicedataextraction/sdk";

const client = new InvoiceDataExtraction({
  api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});

The SDK requires Node.js 18+ and ESM modules. Set "type": "module" in your package.json or use .mjs file extensions.

Deriving Extraction Fields from the Zod Schema

The SDK's structured prompt accepts a fields array where each entry has a name and an optional prompt string describing what the API should extract. The .describe() annotations from your Zod schema map directly into this structure, so your schema drives the extraction request:

function schemaToPromptFields(schema: z.ZodObject<any>) {
  return Object.entries(schema.shape).map(([key, field]) => ({
    name: key,
    prompt: (field as z.ZodTypeAny).description,
  }));
}

const extractionFields = schemaToPromptFields(invoiceHeaderSchema);
// Result:
// [
//   { name: "invoiceNumber", prompt: "Unique invoice identifier, alphanumeric" },
//   { name: "invoiceDate", prompt: "Invoice issue date, not due date. Format: YYYY-MM-DD" },
//   { name: "vendorName", prompt: "Legal name of the issuing vendor" },
//   ...
// ]

This helper reads each key and its .describe() value from the Zod schema shape. The result is a fields array the SDK accepts directly, keeping your extraction request and your validation logic in sync through a single schema definition.

Running the Extraction

The SDK's extract() method handles file upload, task submission, polling, and result download in one call. Pass the structured prompt fields derived from your schema, choose an output structure that matches your data model, and request JSON output:

import { readFile } from "fs/promises";
import path from "path";

const result = await client.extract({
  folder_path: "./invoices",
  prompt: {
    fields: extractionFields,
    general_prompt: "Extract one row per invoice. Format dates as YYYY-MM-DD. Use 0 for missing tax values.",
  },
  output_structure: "per_invoice",
  download: {
    formats: ["json"],
    output_path: "./output",
  },
  console_output: true,
});

The SDK's TypeScript declarations type the result object, giving you compile-time access to properties like result.extraction_id, result.pages.successful_count, and result.output.json_url. This invoice data extraction API with TypeScript support was designed with TypeScript-first workflows in mind, and the SDK reflects that with full type coverage across the response structure.

Validating Downloaded JSON with Zod

Once the extraction completes, read the downloaded JSON and run it through your Zod schema using safeParse. This is where compile-time types and runtime validation converge: the SDK's declarations type the API response metadata, and Zod validates the actual extracted invoice data.

const jsonPath = path.join("./output", "extraction.json");
const raw = JSON.parse(await readFile(jsonPath, "utf-8"));

const invoiceBatchSchema = z.array(invoiceHeaderSchema);
const parsed = invoiceBatchSchema.safeParse(raw);

if (parsed.success) {
  const invoices = parsed.data;
  console.log(`Validated ${invoices.length} invoices`);
} else {
  console.error("Validation failures:", parsed.error.issues);
}

Every invoice that passes safeParse matches your TypeScript type exactly. Every one that fails gives you a structured error with the field path, expected type, and what was actually received. No casting, no type assertions, no runtime surprises.

Developers focused on improving invoice OCR accuracy in extraction pipelines can combine OCR optimization with Zod validation for end-to-end data quality, catching issues at both the extraction layer and the schema enforcement layer.


Validating Extraction Output and Recovering from Failures

Extraction output is inherently uncertain. An invoice might be scanned at an angle, a vendor might use a non-standard layout, or the extraction model might interpret a field differently than you expected. Calling parse on this data is a mistake because it throws on the first validation failure, crashing your pipeline when the correct response is remediation, not termination. Use safeParse instead.

const result = invoiceSchema.safeParse(extractedData);

if (result.success) {
  // result.data is fully typed as z.infer<typeof invoiceSchema>
  processValidInvoice(result.data);
} else {
  // result.error is a ZodError with structured issue details
  handleValidationFailure(result.error);
}

The success discriminant gives you a clean branch: when true, result.data carries the validated and typed invoice. When false, result.error is a ZodError containing an issues array where each issue exposes a path (which field failed), a code (what kind of failure), and a message (human-readable description). This structure lets you route failures programmatically rather than dumping a stack trace.

function handleValidationFailure(error: z.ZodError) {
  for (const issue of error.issues) {
    const fieldPath = issue.path.join(".");

    switch (issue.code) {
      case "invalid_type":
        console.warn(`Type mismatch at ${fieldPath}: ${issue.message}`);
        break;
      case "invalid_string":
      case "invalid_date":
        console.warn(`Format error at ${fieldPath}: ${issue.message}`);
        break;
      case "missing":
        console.error(`Missing required field: ${fieldPath}`);
        break;
      default:
        console.error(`Validation issue at ${fieldPath}: ${issue.message}`);
    }
  }
}

A missing required field like an invoice number signals that re-extraction with a more specific prompt might recover the data. A type mismatch on a currency amount usually means the schema needs a coercion rule to handle what the extraction engine actually returns.

Recovery Patterns for Common Extraction Issues

Extraction engines frequently return amounts as strings, dates in inconsistent formats, and currency values with symbols attached. Rather than failing validation and requiring manual intervention, build these realities into your schema.

Coerce numeric amounts that arrive as strings. This is the single most common extraction quirk:

const amount = z.coerce.number().nonnegative();

Preprocess currency values to strip symbols before validation. When amounts come back as "$1,250.00" or "EUR 3.400,50", the raw string will fail a numeric check. Handle this at the schema boundary:

const currencyAmount = z.preprocess(
  (val) => {
    if (typeof val === "string") {
      return parseFloat(val.replace(/[^0-9.\-]/g, ""));
    }
    return val;
  },
  z.number().nonnegative()
);

Default missing optional fields when you know the correct fallback:

const taxRate = z.coerce.number().default(0);

Normalize date formats with a transform, so downstream code always receives a consistent representation:

const invoiceDate = z.string().transform((val) => {
  const parsed = new Date(val);
  if (isNaN(parsed.getTime())) throw new Error("Unparseable date");
  return parsed.toISOString().split("T")[0]; // YYYY-MM-DD
});

These patterns compose naturally. A production invoice schema applies several of them together, accepting the messy reality of extraction output while guaranteeing clean types on the other side.

The Lenient-Then-Strict Pattern

For complex invoices, a two-pass validation strategy prevents total data loss when only some fields fail. Parse first with a relaxed schema where most fields are optional, then run the strict schema against the partial result.

const lenientInvoiceSchema = invoiceSchema.partial().extend({
  rawSource: z.string().optional(),
});

const strictResult = invoiceSchema.safeParse(extractedData);

if (!strictResult.success) {
  const lenientResult = lenientInvoiceSchema.safeParse(extractedData);
  if (lenientResult.success) {
    const recovered = lenientResult.data;
    const failedFields = strictResult.error.issues.map((i) => i.path.join("."));

    escalateForReview(recovered, failedFields);
  }
}

This captures every field that did extract correctly and isolates the failures for targeted re-extraction or manual review, rather than discarding the entire invoice.

Catching Semantically Wrong Data with Refinements

Structural validation alone has a critical blind spot: data that is the right type but the wrong value. This is where extraction hallucination becomes a practical concern. An extraction model can confidently return a total of $500 when the net amount is $450 and tax is $90. Zod's type checks will pass, but the invoice is wrong. This is where z.refine() catches errors that structural validation misses.

const validatedInvoice = z
  .object({
    netAmount: z.coerce.number(),
    taxAmount: z.coerce.number(),
    totalAmount: z.coerce.number(),
    invoiceDate: z.string(),
  })
  .refine(
    (inv) => Math.abs(inv.netAmount + inv.taxAmount - inv.totalAmount) < 0.02,
    { message: "Total does not equal net + tax", path: ["totalAmount"] }
  )
  .refine(
    (inv) => new Date(inv.invoiceDate) <= new Date(),
    { message: "Invoice date is in the future", path: ["invoiceDate"] }
  );

The tolerance of 0.02 on the arithmetic check accounts for floating-point rounding in currency conversions. The future-date check catches a common OCR misread where a "2" becomes a "7" in the year. These refinements encode business rules that structural schemas cannot express, and they run after type validation so you can safely access typed fields in the predicate.

Batch Validation with Typed Result Collection

Production systems process invoices in batches, not one at a time. When your pipeline extracts data from dozens or hundreds of documents, a single validation failure should not halt the entire batch. Collect results into typed success and failure arrays so the caller gets a clear picture of what validated and what needs attention.

interface ValidationSuccess {
  index: number;
  data: z.infer<typeof invoiceSchema>;
  source: string;
}

interface ValidationFailure {
  index: number;
  errors: z.ZodIssue[];
  rawData: unknown;
  source: string;
}

function validateBatch(
  invoices: unknown[],
  sources: string[]
): { valid: ValidationSuccess[]; invalid: ValidationFailure[] } {
  const valid: ValidationSuccess[] = [];
  const invalid: ValidationFailure[] = [];

  invoices.forEach((invoice, index) => {
    const result = invoiceSchema.safeParse(invoice);
    if (result.success) {
      valid.push({ index, data: result.data, source: sources[index] });
    } else {
      invalid.push({
        index,
        errors: result.error.issues,
        rawData: invoice,
        source: sources[index],
      });
    }
  });

  return { valid, invalid };
}

This pattern integrates directly with the SDK response structure. The extraction response tracks failed pages in its pages field, which reports successful_count, failed_count, and the lists of successful and failed page identifiers. Pages that failed extraction never produce data to validate, so your batch validator only receives pages that the extraction engine processed successfully. Cross-referencing the SDK's failed pages list with your Zod validation failures gives you two distinct categories: pages the engine could not process at all, and pages it processed but where the output did not meet your schema requirements.

The SDK response also includes an ai_uncertainty_notes array, where each entry describes a field or assumption the extraction engine was uncertain about. When a field appears in both the uncertainty notes and your Zod validation failures, that correlation is a strong signal for manual review. When a field appears in uncertainty notes but passes Zod validation, consider applying stricter refinement checks to that field, since the engine itself was not confident in the result.

const { valid, invalid } = validateBatch(extractedInvoices, pageSources);

console.log(`Batch complete: ${valid.length} validated, ${invalid.length} failed`);

for (const failure of invalid) {
  const uncertainFields = uncertaintyNotes
    .filter((note) =>
      failure.errors.some((e) => e.path.join(".").includes(note.topic))
    );

  if (uncertainFields.length > 0) {
    console.log(
      `Invoice ${failure.source}: extraction uncertainty confirmed by validation`,
      uncertainFields.map((n) => n.topic)
    );
  }
}

The valid array feeds directly into your downstream pipeline. The invalid array feeds into a review queue, re-extraction with adjusted prompts, or the lenient-then-strict pattern to recover partial data. Neither path blocks the other.

The schema defines the contract. safeParse enforces it at the extraction boundary. Recovery patterns handle the gap between what the extraction engine returns and what your downstream systems require. Together, they give you a TypeScript extraction pipeline where every invoice that reaches your application code has been validated at runtime, not just at compile time.

About the author

DH

David Harding

Founder, Invoice Data Extraction

David Harding is the founder of Invoice Data Extraction and a software developer with experience building finance-related systems. He oversees the product and the site's editorial process, with a focus on practical invoice workflows, document automation, and software-specific processing guidance.

Editorial process

This page is reviewed as part of Invoice Data Extraction's editorial process.

If this page discusses tax, legal, or regulatory requirements, treat it as general information only and confirm current requirements with official guidance before acting. The updated date shown above is the latest editorial review date for this page.

Continue Reading

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours