Utility Bill OCR API: Developer Guide to JSON Extraction

Developer guide to utility bill OCR API design for JSON extraction, schema design, validation rules, and ERP or energy-system integration workflows.

Published
Updated
Reading Time
19 min
Topics:
API & Developer IntegrationUtility BillsJSON schemameter readingsmulti-provider parsingenergy data extraction

A utility bill OCR API converts electricity bills, gas bills, water bills, and telecom bills from PDFs or images into structured JSON that your downstream systems can trust. In production, that means more than text capture. A serious utility bill extraction API should return normalized fields such as account number, service address, billing period, usage, meter readings, line-item charges, taxes, and total amount due, then move that data through a predictable flow: upload, classify, extract into a defined schema, validate, and only then send it into ERP sync, cost allocation, energy reporting, or exception queues.

For developers, the minimum useful output usually includes:

  • Document identity: provider name, account number, statement date, invoice or bill number
  • Location context: service address, site ID, customer name, mailing address when present
  • Period coverage: billing period start, billing period end, due date
  • Consumption data: usage quantity, usage unit, meter number, current and prior meter readings, read dates, read type
  • Charge breakdown: supply charges, delivery charges, demand charges, taxes, surcharges, late fees, credits, adjustments
  • Totals: subtotal, tax total, total amount due, prior balance, payments received, closing balance

In finance and operations workflows, a readable utility bill is not enough. The JSON must preserve account, location, billing-period, usage, meter, and charge relationships so downstream systems can compare periods, allocate costs, reconcile totals, and flag bad meter reads.

Utility bills are harder than standard invoice OCR because providers describe the same commercial facts in very different layouts. One statement may show a single account summary. Another may include multiple meters, multiple service addresses, rate tiers, demand charges, estimated versus actual reads, or provider-specific fee tables. Telecom bills belong in the same family even when they replace meter reads with circuit IDs, plan charges, and usage blocks.

That variability changes API design. The extractor has to preserve the relationship between the account, location, period, usage block, and charge line, not just lift text off the page. If your schema loses those relationships, downstream validation becomes guesswork.

The rest of this guide focuses on how to build that correctly: define a schema that matches real utility bills, use output structure and prompts to handle provider variation, and validate the extracted JSON before it feeds finance or energy systems.


Design a Schema That Matches Real Utility Bills

For production, do not collapse utility-bill extraction into a thin record with only provider, account number, bill date, and total. Real bills carry document-level facts, service-location facts, meter-level facts, usage calculations, and provider-specific charge logic. Separate those layers so the JSON can survive downstream mapping and audit review.

A practical schema usually works best with four distinct groups:

  • Document-level fields: provider identity, account number, utility type, currency, bill date, due date, billing period start, billing period end, statement number if present, total amount due, and the service address. Treat service address separately from mailing or remittance addresses, because utility bills often contain both.
  • Service point and meter detail: service point IDs, premise IDs, meter identifiers, meter read dates, previous and current readings, units, usage totals, and any demand measurements. On commercial electric bills, demand may matter as much as consumption, so your model should leave room for meter readings and demand charges without forcing them onto every bill.
  • Rate and usage structure: tariff or plan name, rate class, season, time-of-use window, and rate tiers where applicable. A serious schema should handle one usage total and also tiered breakdowns such as first-block and second-block consumption.
  • Charge detail: fixed charges, delivery charges, supply charges, taxes, fees, riders, surcharges, adjustments, credits, and line-item totals, plus the final amount due. Keep line items separate from summary totals so finance teams can reconcile both the bill detail and the payable amount.

That structure matters because provider variation is real, and pretending otherwise creates brittle integrations. Some bills include one meter; others include multiple service points on the same statement. Some water and gas providers expose clear previous and current reads; others show only consumption totals. Some commercial electric bills break out demand, transmission, and rider charges in detail; others roll taxes and surcharges into broader buckets. Regional format differences compound this — UK, EU, and US bills each carry their own field conventions and common failure modes worth mapping before you lock the schema. Your schema should normalize the concepts, not force every bill into identical completeness. In practice that means optional fields, repeatable arrays for service points and charges, and stable top-level keys even when certain providers leave sections blank.

A useful rule is to normalize names and meaning, while preserving enough raw detail for reconciliation. For example, map "service from," "service period," and "billing cycle" into billing period start and end, but keep the original label or source text if your downstream users may need to audit the extraction. Do the same for rate labels, tariff names, and bundled fee descriptions. That keeps your utility bill data extraction API consistent across providers without hiding the messy reality of the source documents.

There is also a typing nuance developers should plan for up front. In Invoice Data Extraction's JSON output, values are returned as strings, so your prompt should explicitly define formatting rules, such as dates in YYYY-MM-DD, decimal amounts with two places, normalized unit labels, and boolean-like fields rendered as true or false strings. Then parse those strings into typed dates, decimals, integers, and booleans in your own application code. That extra parsing step is worth designing intentionally, because it gives you stable validation logic across utility providers instead of trusting OCR output to guess native types correctly.

Nested structured JSON inside a single field can work in small doses. For example, if you are exporting one row per bill into a tabular system, a compact nested payload for tier details or tax breakdowns may be acceptable. But that should stay the exception, not the default. Once you are dealing with multiple meters, multiple service addresses, tiered usage, and heterogeneous charge lines, first-class arrays and objects are much easier to validate, map into ERP or energy systems, and explain during audits.

Use Output Structure and Prompts to Handle Provider Variation

Provider variation is usually less about whether the OCR can read the page and more about how the API models repeated utility data. A power bill with one service address and one total can be represented cleanly as a single bill record. A utility statement with multiple meters, rate tiers, demand charges, credits, taxes, and adjustments cannot. If you force both document types into the same flat output, your downstream logic ends up rebuilding the structure the extractor threw away.

Use invoice-level output when the bill is mostly a header document with a few summary values you care about: account number, billing period, service address, total usage, total amount due, and maybe one or two aggregate tax fields. This works well when your goal is ERP sync, monthly accruals, or site-level reporting where each PDF should become one record. It is also the safer choice when repeating tables are inconsistent but not operationally important.

Use line-item output when the repeating rows are the business data. That is the better choice when you need each meter row, usage block, charge, or adjustment to carry its own fields into downstream analysis. In practice, line-item-style extraction becomes preferable when a bill contains many repeating charges, when each charge or usage block must be categorized separately, or when each meter row needs its own service period, unit of measure, usage quantity, unit rate, and amount. If your workflow depends on a utility bill line item extraction API rather than just a document summary, one row per repeated element is usually the right model.

The choice becomes clearer once you look at the fields. Meter readings often need start read, end read, read type, and read date on the same row. Rate tiers need quantity, tier label, unit rate, and subtotal together. Demand charges often have their own measurement basis, such as peak demand, billing demand, or contracted demand, plus a separate rate and amount. Estimated versus actual reads can materially change how finance or operations treats the record, so that flag should stay attached to the specific meter or usage row, not buried in bill-level notes. Taxes and adjustments are the same story: if you only need the final amount, invoice-level output is fine; if you need tax type, jurisdiction, surcharge category, or adjustment reason, extract them as separate rows or grouped line items.

As a practical rule, use output_structure: "per_line_item" when the bill may contain around seven or more line items, or when repeated rows need detailed field instructions. This preserves charge, meter, and usage relationships instead of flattening them too early. It still matters on short bills when two or three meters span multiple service periods.

This is also where prompt-controlled extraction beats template-heavy design, but the right choice depends on the workload:

  • Templates are acceptable when you only process one or two stable provider layouts and the fields rarely move.
  • Prompt-driven extraction is better when providers revise their layouts, one batch mixes electricity, gas, water, and telecom bills, or the same charge logic appears in different table shapes.
  • A hybrid model works well when you use prompts to capture the core fields, then apply provider-specific validation or mapping rules after extraction instead of maintaining a full template for every layout.

For straightforward jobs, a plain-language prompt is often enough: ask for one row per meter charge, include account number and billing period on every row, separate taxes and adjustments, and return dates in a single format. When you need tighter control, use a structured prompt object with explicit field definitions. That lets you pin down field names, row grouping, required formats, and repeated-field behavior so the JSON stays stable across providers.

Implement the Upload-to-Result Workflow

To extract utility bill data programmatically, treat the integration as a staged document-processing job, not a single OCR call. With Invoice Data Extraction, the REST API sits under https://api.invoicedataextraction.com/v1, every request is authenticated with your API key as a Bearer token, and the workflow is predictable enough to plug into queue workers, ingestion services, or customer-facing upload flows.

  1. Authenticate and register the batch. Start by creating an upload session with the files you want to process. This is where you register a mixed-provider batch, assign your own session and file IDs, and get back the part size for upload. For utility bills, this is the right place to group electricity, gas, water, telecom, or district energy files that all need to land in the same downstream schema. The practical limits matter for planning: a session can hold up to 6,000 files, PDFs can be up to 150 MB, and images can be up to 5 MB each.

  2. Upload the files, then complete each file upload. The REST flow is not a blind file POST. You request upload URLs for each file part, upload the bytes, then explicitly complete each file in the session. That staged design is useful when provider layouts vary and file sizes vary with them. A 2-page water bill image and a 90-page scanned campus utility packet can move through the same upload workflow without forcing you into a different integration path.

  3. Submit the extraction task with utility-specific instructions. Once the files are uploaded, submit an extraction task that references the upload session and file IDs. This is where your schema work from earlier sections plugs into the API. The docs show both a string prompt path and an object prompt path. The string form is fine for fast prototyping, but utility bill extraction usually benefits from the object form because you can lock exact field names such as Account Number, Billing Period Start, Billing Period End, Meter Number, Previous Read, Current Read, Usage, Service Address, Rate Class, Charges, Taxes, and Total Due. You can also add task-level instructions such as date formatting, digits-only account numbers, one row per bill, or one row per charge line. If you need bill-level summaries, choose per_invoice. If you need every supply charge, delivery charge, tax, and adjustment broken out for cost allocation or energy reporting, choose per_line_item.

  4. Poll for status until the task reaches a terminal state. After submission, you poll the extraction by ID instead of holding the original request open. That makes the workflow safer for background jobs and webhook-driven systems. The docs recommend polling no more frequently than every five seconds. In practice, this is also where a production integration separates transport success from data readiness: the file upload may have worked, but you still need to inspect processing status, failed pages, and any AI uncertainty notes before the result is allowed into an ERP, energy dashboard, or reconciliation pipeline.

  5. Download the result in the format your downstream system expects. When processing completes, you can download JSON, CSV, or XLSX output. JSON is usually the right handoff for application code and validators, CSV works well for flat-file imports and exception review, and XLSX is useful when operations teams need a human-readable audit file. The key point for a utility bill parser API is that extraction is only one half of the job. The output has to be shaped so your validation layer can reject bad billing periods, impossible usage values, missing account numbers, or totals that do not reconcile before insertion.

If you do not want to hand-roll every REST step, the official SDKs abstract this workflow. The Python SDK supports Python 3.9+, the Node SDK supports Node.js 18+ and is ESM-only, and both expose a one-call extract() path for fast implementation plus lower-level staged methods when you need custom retry logic, split upload and processing services, or provider-specific validation checkpoints. The same pattern described in this core invoice and document extraction API workflow applies here, but for utility data you should be stricter about field definitions and output structure. If you want the broader platform context, this fits naturally within a document extraction API for finance workflows where utility bills are handled as operational finance documents, not just generic OCR inputs.

A compact Node SDK example looks like this:

import InvoiceDataExtraction from "@invoicedataextraction/sdk";

const client = new InvoiceDataExtraction({
  api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY,
});

const result = await client.extract({
  files: [
    "./utility-bills/march-electricity.pdf",
    "./utility-bills/march-water.pdf",
  ],
  task_name: "utility_bill_batch_mar_2026",
  prompt: {
    fields: [
      { name: "Account Number" },
      { name: "Service Address" },
      { name: "Billing Period Start", prompt: "Use YYYY-MM-DD format" },
      { name: "Billing Period End", prompt: "Use YYYY-MM-DD format" },
      { name: "Meter Number" },
      { name: "Current Read" },
      { name: "Previous Read" },
      { name: "Usage Quantity", prompt: "Digits and decimal point only" },
      { name: "Usage Unit" },
      { name: "Charge Type" },
      { name: "Charge Amount", prompt: "No currency symbol, 2 decimal places" },
      { name: "Total Amount Due", prompt: "No currency symbol, 2 decimal places" }
    ],
    general_prompt:
      "Extract one row per charge line. Repeat account number, service address, billing period, and meter number on each row. Preserve estimated versus actual read status when shown.",
  },
  output_structure: "per_line_item",
  download: {
    formats: ["json"],
    output_path: "./output",
  },
});

if (result.pages.failed_count > 0) {
  console.log(result.pages.failed);
}

The downloaded JSON row will still return strings by design, so expect a shape like this:

[
  {
    "Account Number": "55302814",
    "Service Address": "1200 Market St, Suite 400",
    "Billing Period Start": "2026-02-01",
    "Billing Period End": "2026-02-29",
    "Meter Number": "E-774102",
    "Current Read": "24581",
    "Previous Read": "23894",
    "Usage Quantity": "687",
    "Usage Unit": "kWh",
    "Charge Type": "Delivery Charge",
    "Charge Amount": "84.33",
    "Total Amount Due": "216.48"
  }
]

For telecom bills, the same pattern usually swaps meter-specific fields for circuit IDs, plan charges, data or call usage, and line-level taxes while keeping the same upload, prompt, validate, and handoff flow. If the downstream goal is departmental chargeback rather than API integration alone, this guide to splitting itemized phone bills into Excel for cost allocation shows how to structure those telecom charges row by row. If you need the same utility-bill data in a finance-ready spreadsheet instead of JSON, this walkthrough on turning utility bill PDFs into bookkeeping rows covers the fields worth carrying into Excel.


Validate the Data Before It Feeds Finance or Energy Systems

A utility bill OCR API is not ready for production just because it returns JSON. It is ready when the JSON survives a utility bill validation workflow that catches field-level mistakes, billing anomalies, and provider-specific edge cases before the data reaches finance or operations. The stakes are not trivial. EIA data on commercial building energy spending shows that the nation's 5.9 million commercial buildings spent $141 billion on energy in 2018. If your extracted data feeds accruals, chargebacks, benchmarking, or payment decisions, weak validation can turn one bad parse into a real accounting or reporting error.

A serious post-extraction checklist should validate more than field presence:

  • Account and site identity: Confirm the account number matches the expected customer or property record, and that the service address maps to the correct site, building, or cost center.
  • Billing period completeness: Require a billing period start date and end date, validate that the span is plausible, and flag gaps or overlaps if you ingest recurring bills for the same account.
  • Usage and unit plausibility: Check that usage is numeric, non-negative where appropriate, and paired with the correct unit such as kWh, therms, gallons, or kW. A value without a unit is not reliable enough for downstream use.
  • Meter read sequence: Validate previous and current meter readings, read dates, and read types. If the current read is lower than the previous read, you need a documented reason such as rollover, meter replacement, or corrected prior billing.
  • Charge reconciliation: Recalculate whether supply charges, delivery charges, riders, taxes, fees, and adjustments add up to the subtotal and final amount due within a defined tolerance.
  • Tax and fee handling: Separate taxes from non-tax regulatory fees where possible, because ERP integration and reporting logic often treat them differently.
  • Credits and adjustments: Detect negative line items, bill corrections, and prior-period adjustments so they are not accidentally treated as standard current-period expense.
  • Duplicate bill detection: Compare account number, service address, billing period, amount due, invoice or statement number, and source file fingerprint so you do not post the same bill twice.

Those rules need to flex when the document is more complex than a single-site, single-meter statement. A bill with multiple meters should be validated at both the header level and the meter level, because the invoice total can reconcile even when one meter's usage or service address is attached to the wrong child record. A consolidated bill covering multiple service locations should require a service address on each meter or service block, not just in the document header. Demand-charge bills need extra checks, including whether billed demand, demand period, rate, and demand charge align. Estimated reads should not be treated the same as actual reads either. If the bill is explicitly marked estimated, your pipeline should preserve that status and route the record differently for review, forecasting, or later true-up.

This is where document evidence matters. Invoice Data Extraction surfaces processed-page details and AI uncertainty notes through the API, and its downloadable spreadsheet outputs include source file and page references. When a validation rule flags a broken read sequence, an unexpected unit, or a charge mismatch, your reviewer can jump straight to the relevant page instead of reopening the entire PDF and guessing how the parser got there.

The same control mindset used in a broader utility bill validation and reconciliation workflow should sit between extraction and posting. It also overlaps with the more general post-extraction validation rules for API workflows, but utility bills need extra attention to meter readings, service address mapping, and billing-period logic because those fields drive both financial and operational decisions.

Once validation passes, the JSON becomes safe to hand off to downstream systems with much less manual cleanup. For ERP integration, that usually means creating a vendor bill or journal only when account identity, billing period, tax handling, and total reconciliation all pass. For AP review, failed rules should open an exception queue with the exact field, reason, and source-page reference. For cost allocation, you can split charges by service address, meter, property, or department instead of booking the whole bill to one overhead bucket. For energy management platforms, validated usage, demand, and cost data can be trended by site and compared across billing periods without mixing incompatible units or duplicate statements. For ESG reporting, you need traceable usage values, preserved read types, and clean site attribution so consumption data can support emissions calculations and audit review.

Validation is where extracted JSON becomes finance-ready: it should block records with account, billing-period, meter, usage, charge, or duplicate-bill problems before handoff to ERP, operations, or reporting systems.

Production Checklist for Utility Bill OCR APIs

When you evaluate a utility bill OCR API for production, test the workflow against the documents you actually receive rather than a polished sample. If you are still mapping the broader landscape, this overview of workflow options for automated utility data capture compares OCR APIs against bill-management platforms, ESG-focused tools, and AP-reuse approaches before you commit to one path.

Use a representative pilot batch with electricity, gas, water, telecom, scanned PDF, native PDF, multi-meter, estimated-read, and unusual charge-table examples. Then check whether the API can:

  1. Preserve multiple service addresses, service points, meters, rate tiers, taxes, credits, and provider-specific fee tables without collapsing them into one generic summary.
  2. Let you control field names, output structure, formatting, and extraction rules without writing a separate parser for every provider.
  3. Return stable JSON typing, predictable line-item behaviour, and exception details that downstream systems can consume. Many of the same design concerns covered in bank statement extraction API design patterns apply here too.
  4. Provide source-to-output traceability, failed-page reporting, and uncertainty notes so reviewers can verify flagged values quickly.
  5. Support your deployment model, including API access, SDK behaviour, credit handling, and any UI-to-API prompt testing your product or operations team needs.

Be sceptical of unsupported absolutes such as "99% accuracy on all utility providers." The right API is the one that gives you usable JSON, reliable exceptions, and less custom cleanup code after the pilot than before it.

Extract invoice data to Excel with natural language prompts

Upload your invoices, describe what you need in plain language, and download clean, structured spreadsheets. No templates, no complex configuration.

Exceptional accuracy on financial documents
1–8 seconds per page with parallel processing
50 free pages every month — no subscription
Any document layout, language, or scan quality
Native Excel types — numbers, dates, currencies
Files encrypted and auto-deleted within 24 hours
Continue Reading