Build an AP Automation Agent with the OpenAI Agents SDK

OpenAI Agents SDK AP automation works because the SDK's primitives map almost one-to-one onto the roles inside an accounts payable workflow. An Agent is an LLM with instructions and a set of tools — it plays a single AP role. A @function_tool is a Python function the agent can call, which is how an extractor agent reaches an invoice API, how a matcher reaches a purchase order ledger, and how an approver writes to an ERP. handoffs let one agent delegate to another, so the extractor can pass a parsed invoice to a three-way matcher and the matcher can pass a clean match result to an approver. guardrails gate decisions on input and output, which is where a high-value-invoice rule or a vendor-master check belongs. sessions hold conversational memory keyed to a batch ID, which is how an overnight job survives a worker restart. Runner is the loop driver that runs the whole thing as one call, and the SDK's built-in tracing records every step in a format auditors can read.

This is taken straight from the documented primitives of the OpenAI Agents SDK, which describes the core building blocks as agents (LLMs equipped with instructions and tools), handoffs that let agents delegate to other agents, guardrails for input and output validation, a runner that manages the tool-call loop, sessions for persistent memory, and built-in tracing for visualizing and debugging workflows. Each AP role lines up cleanly with one of those primitives — extraction with a tool, validation with another agent reached by handoff, approval with guardrails, batch resumption with sessions, audit with tracing. The match is close enough that the SDK reads like it was designed for AP, even though it wasn't.

What this article does, end to end, is build a working pipeline that takes one vendor invoice PDF and ends with a posted ERP record. Python is the primary path because that's where the SDK has the most surface area and where the majority of teams ship; TypeScript appears alongside Python at the points where the integration shape differs materially (the @openai/agents package, the Node invoice extraction SDK, the tool() helper with Zod schemas). The reader of this guide already operates on OpenAI infrastructure and wants the code shape — Agent, @function_tool, Runner.run, handoffs, guardrails, sessions, tracing — rather than a framework-choice debate. Readers who landed here from a different ecosystem will find an honest pointer to the Claude Agent SDK and LangGraph equivalents at the end. The piece otherwise stays inside the OpenAI build.

For readers who want the cross-framework conceptual ground first, the framework-neutral agentic invoice processing patterns article covers the agent-shaped AP architecture in vendor-agnostic terms. This guide is the OpenAI-specific instantiation of those patterns — same overall shape, written against the actual primitives the OpenAI Agents SDK exposes.

The build below targets the current SDK, the current Responses API, and a model split where most agent roles run on the strongest available reasoning model and the cheaper roles run on a faster, lower-cost model — concretely gpt-5.5 for extraction and gpt-5.2 for the structured-decision validator and approver, but the pattern is the point, not the specific model strings, which will rotate as OpenAI ships new versions.

A working GPT accounts payable agent on this stack has three agents talking to each other through handoffs, three tool integrations into real systems (the invoice extractor, the PO ledger, the ERP), two guardrails on the approver, one session per batch, and one trace per invoice that is also the audit log.

Setup and the Minimal AP Agent

Install the Python SDK with pip install openai-agents (Python 3.9 or later). For TypeScript, npm install @openai/agents (Node 18 or later). Both SDKs read OPENAI_API_KEY from the environment, so export it once before the first run:

export OPENAI_API_KEY=sk-...

The smallest useful AP-flavored agent looks like this:

from agents import Agent, Runner

reconciler = Agent(
    name="Invoice Reconciler",
    instructions=(
        "You receive a brief invoice summary as text. "
        "Sum the line totals and compare to the stated invoice total. "
        "Return 'reconciled' if they match within $0.01, otherwise return "
        "'mismatch' and explain which line is the likely cause."
    ),
    model="gpt-5.5",
)

result = Runner.run_sync(
    reconciler,
    "Invoice INV-9034: line 1 widgets 10 @ $12.50 = $125.00, "
    "line 2 freight = $18.00, stated total = $143.00.",
)
print(result.final_output)

That is a complete program. The Agent is the role; Runner.run_sync drives the agent's tool-call loop and returns a result object whose final_output field carries whatever the agent decided to return. Even with no tools attached, the agent is doing real work — reading a structured input, applying instructions, returning a typed answer — and the SDK is doing the model-call loop on the developer's behalf.

What Runner actually does on each turn is straightforward. It sends the current conversation (instructions, prior turns, any tool results) to the model, reads the model's response, and decides what to do next. If the response contains a tool call, the runner invokes the tool, captures its return value, appends the result to the conversation, and loops. If the response contains a handoff, the runner switches to the target agent and continues. If the response contains neither, the runner stops and returns the final_output. The whole loop is invisible to the caller — one Runner.run_sync call produces one result.

Production AP code should use the async variant, await Runner.run(agent, input), so the worker can process multiple invoices concurrently without one extraction call blocking the next. Runner.run_sync is convenient for examples and for one-off scripts; the rest of this guide uses it in code blocks to keep the prose tight, but the same shape works under asyncio.run with Runner.run.

There is a single-step alternative worth naming for context. The OpenAI Responses API, called directly with client.responses.parse(...) and a Pydantic response_format, will return structured invoice data from one model call without any of the SDK's agent or runner machinery. That single-shot path is the right tool when extraction is the only thing the program does — see single-shot OpenAI Structured Outputs extraction in Node.js for that approach end to end. The Agents SDK earns its weight when there are multiple steps after extraction (validation, threshold-gated approval, ERP posting) that benefit from running inside one orchestrated trace rather than as separately-instrumented HTTP calls.

The agent above has no real tools and no handoffs, which is the floor — every subsequent AP role is this same Agent constructor with tools, output types, guardrails, or handoffs added.

The Extraction Agent: Wrapping the Invoice API as a @function_tool

The most idiomatic way to give an Agent real extraction power is the @function_tool decorator. Import it with from agents import function_tool, decorate a regular Python function, and the SDK turns the function into a tool the agent can call. The function's type hints become the tool's JSON schema, the docstring becomes the tool description the model reads when deciding whether to call it, and the function's return type comes back to the agent as structured data.

For an extraction tool, the body of the function should talk to a real extractor rather than re-implementing OCR or single-call Responses API code inside the agent loop. The invoice extraction service this guide uses exposes both a REST API and an official Python SDK. The SDK is the recommended integration shape — it handles chunked upload, polling, output download, and result typing in one call — so the tool body uses the SDK. The underlying invoice data extraction API is what the SDK wraps; both are documented for direct use, but inside a @function_tool the SDK is the shorter, type-safer path.

Define the structured output first. The agent's output_type is a Pydantic model — result.final_output then comes back as an Invoice instance with typed fields rather than a raw dict:

from typing import List
from pydantic import BaseModel

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    line_total: float

class Invoice(BaseModel):
    invoice_number: str
    invoice_date: str
    vendor_name: str
    net_amount: float
    tax_amount: float
    total_amount: float
    currency: str
    line_items: List[LineItem]

The extraction tool itself:

import os
import json
import urllib.request
from agents import function_tool
from invoicedataextraction import InvoiceDataExtraction

extractor = InvoiceDataExtraction(
    api_key=os.environ["INVOICE_DATA_EXTRACTION_API_KEY"],
)

@function_tool
def extract_invoice(pdf_path: str) -> Invoice:
    """Extract structured invoice data from a local PDF file path.

    Args:
        pdf_path: Absolute path to an invoice PDF on the local filesystem.

    Returns:
        Invoice: Parsed invoice with header fields and line items.
    """
    result = extractor.extract(
        files=[pdf_path],
        prompt=(
            "Extract invoice number, invoice date, vendor name, "
            "net amount, tax, total, and one row per line item with "
            "description, quantity, unit price, and line total."
        ),
        output_structure="per_invoice",
    )
    with urllib.request.urlopen(result["output"]["json_url"]) as response:
        payload = json.loads(response.read())
    return Invoice(**payload[0])

The agent that uses the tool:

from agents import Agent

invoice_extractor = Agent(
    name="Invoice Extractor",
    instructions=(
        "You receive a local PDF path for a vendor invoice. "
        "Call extract_invoice exactly once with that path and "
        "return the parsed Invoice as your final output."
    ),
    tools=[extract_invoice],
    output_type=Invoice,
    model="gpt-5.5",
)

A few details that matter in production. The extraction SDK accepts file path strings only, not byte buffers or streams — for serverless deployments where the source PDF arrives over HTTP, the @function_tool body should write the PDF to a tempfile before calling extract. The output_structure parameter takes automatic, per_invoice, or per_line_item depending on whether the agent needs one row per invoice (the default for header-only AP) or one row per line item (for line-level matching, which the validator agent in the next section uses). The SDK also exposes staged methods — upload_files, submit_extraction, wait_for_extraction_to_finish, download_output — if the agent needs progress callbacks or longer polling windows than the one-call extract provides. For an agent that processes a single invoice per turn, the one-call shape is the right default.

For teams that prefer direct REST over the SDK, the same workflow runs over the documented endpoints; the REST quickstart for the invoice extraction API covers the upload-session, multipart upload, extraction-submit, and polling pattern in detail. The official Python SDK for invoice extraction covers the SDK methods, the result envelope, and the SDK error types in depth. Either path produces the same typed payload; the choice is mostly about how much HTTP plumbing the developer wants to own.

The TypeScript shape is similar. Use the tool() helper from @openai/agents with a Zod schema for the parameters, and the Node SDK (npm install @invoicedataextraction/sdk) for the body:

import { Agent, tool } from "@openai/agents";
import { z } from "zod";
import InvoiceDataExtraction from "@invoicedataextraction/sdk";

const client = new InvoiceDataExtraction({
  api_key: process.env.INVOICE_DATA_EXTRACTION_API_KEY!,
});

const extractInvoice = tool({
  name: "extract_invoice",
  description: "Extract structured invoice data from a local PDF path.",
  parameters: z.object({ pdf_path: z.string() }),
  execute: async ({ pdf_path }) => {
    const result = await client.extract({
      files: [pdf_path],
      prompt: "Extract invoice number, date, vendor, net, tax, total, line items.",
      output_structure: "per_invoice",
    });
    const response = await fetch(result.output.json_url);
    const payload = await response.json();
    return payload[0];
  },
});

Node's SDK is Promise-returning and the agent runtime awaits the tool correctly without further wiring. Both runtimes share the underlying API limits worth knowing inside the tool body: up to 6,000 files per extraction session, single PDFs up to 5,000 pages, and JPG or PNG up to 5 MB per file. For an agent processing one invoice per turn these limits are background; for a batch agent that hands the SDK a list of files in one call, the batch limit becomes the natural unit of work.

The Validator Agent: Tools for Three-Way Matching

Once the extractor agent returns an Invoice, the next role is the validator that runs three-way matching: comparing the invoice against the related purchase order and goods receipt to confirm that what was ordered, what arrived, and what is being billed for all agree. The SDK shape is three tools and one agent that uses them.

The tools each return a typed Pydantic model so the agent can reason about structured results rather than raw text:

from typing import List, Optional
from pydantic import BaseModel
from agents import function_tool

class POLineItem(BaseModel):
    sku: str
    description: str
    quantity_ordered: float
    unit_price: float

class PurchaseOrder(BaseModel):
    po_number: str
    vendor_name: str
    line_items: List[POLineItem]

class ReceiptLine(BaseModel):
    sku: str
    quantity_received: float

class GoodsReceipt(BaseModel):
    po_number: str
    received_date: str
    line_items: List[ReceiptLine]

@function_tool
def lookup_po(po_number: str) -> PurchaseOrder:
    """Fetch the purchase order with the given PO number from the ERP."""
    return erp_client.get_purchase_order(po_number)

@function_tool
def lookup_receipt(po_number: str) -> Optional[GoodsReceipt]:
    """Fetch the goods receipt for the given PO number, or None if no goods were received yet."""
    return erp_client.get_goods_receipt(po_number)

The match calculation is its own tool so the agent can call it explicitly and the result lands in the trace as a named step:

class LineVariance(BaseModel):
    sku: str
    quantity_delta: float
    unit_price_delta: float
    line_total_delta: float

class MatchResult(BaseModel):
    overall_pass: bool
    failed_checks: List[str]
    line_variances: List[LineVariance]

@function_tool
def calculate_match(
    invoice: Invoice,
    po: PurchaseOrder,
    receipt: Optional[GoodsReceipt],
    quantity_tolerance_pct: float = 2.0,
    price_tolerance_pct: float = 1.0,
) -> MatchResult:
    """Run three-way matching with configurable tolerances and return a structured MatchResult."""
    ...

The tolerances are arguments to the tool rather than constants buried in its body. Most AP teams allow small variances on quantity and unit price — common defaults are around 1 to 2 percent — but the right values vary by vendor category and contract terms. Exposing tolerances as parameters lets the agent (or the calling code, via instructions) set them per run without re-deploying the tool.

The validator agent itself:

validator = Agent(
    name="Three-Way Matcher",
    instructions=(
        "You receive a structured Invoice. Look up the related purchase order "
        "using lookup_po, look up the goods receipt using lookup_receipt, then "
        "call calculate_match. Return the MatchResult as your final output. "
        "If lookup_receipt returns None, treat this as a service invoice and "
        "run two-way matching: still call calculate_match, passing None for receipt. "
        "If lookup_po raises because the PO number is unknown, return a "
        "MatchResult with overall_pass=False and a single failed_check explaining "
        "the missing PO."
    ),
    tools=[lookup_po, lookup_receipt, calculate_match],
    output_type=MatchResult,
    model="gpt-5.5",
)

A few notes on what the SDK is actually doing here. The tool-call loop will execute these three tools in whatever order the model decides, so the instructions guide the model toward the natural order rather than encoding the order in code. This matters in cases where the model legitimately should reorder — for example, an instruction extension like "if the invoice has no PO number on it, skip lookup_po and return a MatchResult flagging the missing PO" lets the agent handle non-PO invoices without a separate code path. A hardcoded chain in Python cannot do that without a branch the developer has to maintain; the agent does it by reading the instructions.

Service invoices without a goods receipt — software subscriptions, professional fees, utilities — are two-way matching, not three-way. lookup_receipt returning None is the signal, and the instructions tell the agent how to react. This is one of the places where SDK instructions become real AP business logic rather than scaffolding.

What happens when a tool fails matters too. lookup_po will raise if the PO number is unknown or the ERP is unreachable. By default, the SDK surfaces the exception to the model as an error string, the model decides what to do, and the agent continues. For predictable AP failures (PO not found, vendor not in master, ERP timeout) the failure_error_function parameter on @function_tool lets the developer return a structured error message the model can reason about, rather than a raw stack trace. For unrecoverable failures, pass failure_error_function=None to re-raise and let the worker's error handling catch it.

The deeper logic of which fields are compared, how header-level versus line-level variance is weighted, and which checks belong in the matcher versus the approver is covered in post-extraction invoice validation rules. The agent itself is a thin orchestration over those rules — it decides which tools to call and in what order, then returns the structured MatchResult the next agent in the pipeline acts on.

The Approver Agent and Guardrails for High-Value Invoices

The approver is the agent that converts a passing MatchResult into either an auto-approval and ERP posting or an escalation to a human reviewer. This is where the SDK's guardrails primitive earns its place: a guardrail is a named, audit-visible check that fires before (input guardrail) or after (output guardrail) the agent runs, and trips a tripwire when its condition is violated. A tripped guardrail is recorded in the trace under its name, which gives auditors a clean line of sight into why a particular invoice was escalated — something a conditional buried in a tool body cannot provide.

The approver's tools are the same shape as the validator's:

class ApprovalDecision(BaseModel):
    approved: bool
    approver_id: str
    decision_timestamp: str
    notes: Optional[str] = None

class ERPPostingReceipt(BaseModel):
    erp_voucher_id: str
    posting_timestamp: str

def is_approved_vendor(vendor_name: str) -> bool:
    """Plain helper so both the tool and the guardrail can call it."""
    return vendor_master.has_approved_vendor(vendor_name)

@function_tool
def check_vendor_master(vendor_name: str) -> bool:
    """Return True if the vendor exists in the vendor master and is approved for payment."""
    return is_approved_vendor(vendor_name)

@function_tool
def request_human_approval(invoice: Invoice, reason: str) -> ApprovalDecision:
    """Queue a human approval request and return once a reviewer has decided."""
    return approval_queue.submit_and_wait(invoice=invoice, reason=reason)

@function_tool
def post_to_erp(invoice: Invoice, approval_decision: ApprovalDecision) -> ERPPostingReceipt:
    """Post the approved invoice to the ERP. Requires a positive approval_decision."""
    return erp_client.post_invoice(invoice, approval_decision)

The high-value-invoice guardrail is the canonical example. Any invoice above a configurable threshold — $50,000 in this build — cannot auto-approve regardless of how clean the MatchResult is. The structured Invoice and MatchResult reach the guardrail through the agent's run context, which is the SDK's way of passing typed state alongside the conversation. Define a context class and a Runner context= value when the pipeline starts, then read it inside each guardrail:

from dataclasses import dataclass
from typing import Optional
from agents import input_guardrail, GuardrailFunctionOutput, RunContextWrapper

@dataclass
class ApproverContext:
    invoice: Optional[Invoice] = None
    match_result: Optional[MatchResult] = None

HIGH_VALUE_THRESHOLD = 50_000.0

@input_guardrail
async def high_value_threshold(
    ctx: RunContextWrapper[ApproverContext],
    agent: Agent,
    input,
) -> GuardrailFunctionOutput:
    invoice = ctx.context.invoice
    over_threshold = invoice.total_amount > HIGH_VALUE_THRESHOLD
    return GuardrailFunctionOutput(
        output_info={
            "invoice_total": invoice.total_amount,
            "threshold": HIGH_VALUE_THRESHOLD,
        },
        tripwire_triggered=over_threshold,
    )

The vendor-master check follows the same shape and is more than a hygiene rule. AP fraud schemes consistently exploit invoices submitted by vendors that are not on the master file, often paid by an approver moving quickly through a queue. A guardrail that blocks the agent from auto-approving any unknown vendor — and surfaces that block in the trace by name — is a real control, not a toy example:

@input_guardrail
async def vendor_master_check(
    ctx: RunContextWrapper[ApproverContext],
    agent: Agent,
    input,
) -> GuardrailFunctionOutput:
    vendor = ctx.context.invoice.vendor_name
    on_master = is_approved_vendor(vendor)
    return GuardrailFunctionOutput(
        output_info={"vendor": vendor, "on_master": on_master},
        tripwire_triggered=not on_master,
    )

Wire both guardrails onto the approver:

approver = Agent(
    name="AP Approver",
    instructions=(
        "You receive an Invoice and a MatchResult. "
        "If both guardrails passed and overall_pass is True, "
        "call post_to_erp with an auto-approval ApprovalDecision. "
        "If any guardrail tripped or overall_pass is False, "
        "call request_human_approval with a reason that names the failing check, "
        "then once you have a positive ApprovalDecision call post_to_erp. "
        "If the human reviewer rejects, return a final summary stating "
        "the rejection and do not post."
    ),
    tools=[request_human_approval, post_to_erp],
    input_guardrails=[high_value_threshold, vendor_master_check],
    model="gpt-5.5",
)

A pattern worth being explicit about: post_to_erp is not a guardrail. Guardrails are read-only checks that gate the agent, not actions. The decision to post belongs in the agent's tool-call sequence so the trace records the posting and the prior approval as a coherent chain.

The SDK handles the agent-side pause when request_human_approval is a long-running operation — the agent serialises into session state and resumes when the tool returns. What the SDK does not build is the approval UI itself. Whether that is an email link reviewers click, a Slack approval button, a dedicated AP admin panel, or a row in a queue table a controller reviews each morning is the developer's job. The SDK gives back hours of agent-state plumbing; it does not give back the design of the approval surface.

There is one principled reason to put the guardrails here rather than as if statements at the top of request_human_approval or post_to_erp. A conditional inside a tool body produces no audit artifact beyond the tool's return value — an auditor reading the trace sees only that the tool returned something, not that a specific named check was evaluated. A guardrail, by contrast, appears in the trace as a named check with its input data and its tripwire outcome. For SOX and SOC 2 environments where the question "was this control evaluated for this invoice" must be answerable from the audit log alone, that named visibility is the difference between a control that exists and a control that can be evidenced.

Handoffs: Composing the AP Pipeline

The extractor, validator, and approver agents are three independent roles until they are wired together with handoffs. The wiring itself is one parameter on each agent:

# Defined once, then re-created with handoffs wired in.
validator_with_handoffs = validator.clone(handoffs=[approver])
extractor_with_handoffs = invoice_extractor.clone(handoffs=[validator_with_handoffs])

# One call runs the whole pipeline. The shared ApproverContext is
# populated by the extractor's and validator's tools as they run, so
# the approver's guardrails see a fully filled context when they fire.
result = Runner.run_sync(
    extractor_with_handoffs,
    "Process the invoice at /var/inbox/invoice-9034.pdf",
    context=ApproverContext(),
)
print(result.final_output)
print(result.last_agent.name)

The extractor's extract_invoice tool and the validator's calculate_match tool each accept a leading ctx: RunContextWrapper[ApproverContext] parameter (the @function_tool decorator detects it and injects the wrapper at call time), and write into ctx.context.invoice and ctx.context.match_result respectively. By the time the approver's guardrails fire, the context is fully populated.

Runner.run walks the agents in sequence. The runner invokes the extractor; the extractor's model calls extract_invoice with the PDF path; the SDK runs the tool, captures the typed Invoice, and feeds it back to the extractor's next model turn. The extractor's instructions tell it to hand off to the matcher once it has a structured Invoice, so the model emits a handoff and the runner switches context to the validator. The validator runs its three lookup-and-match tools, produces a MatchResult, and hands off to the approver. The approver's input guardrails fire first — high_value_threshold checks the invoice total, vendor_master_check confirms the vendor is approved — and then the agent either calls post_to_erp directly or routes through request_human_approval. result.final_output carries whatever the last agent returned; result.last_agent tells the caller which agent produced it, which matters when error handling needs to distinguish "approver rejected" from "validator escalated".

Two alternative architectures are worth comparing on the merits, because they will both look reasonable on the whiteboard.

The first is a hardcoded chain — an extractor agent, then a validator agent, then an approver agent, each called by separate Runner.run invocations in plain Python, with the developer threading the result of one into the input of the next. This is simpler in one direction: the step order is deterministic, and the code reads top to bottom. It is worse in two directions. Every invoice takes exactly the same path, even when context strongly suggests a different one — a known repeat invoice from a trusted vendor on a small recurring contract has a different risk profile from a first-time invoice in an unfamiliar currency, and a hardcoded chain treats them identically. And the audit picture is fragmented: each Runner.run produces its own trace, the connection between them lives in the developer's own logging, and an auditor has to reconstruct the pipeline from three separate artifacts rather than reading one.

The second is the single mega-agent — one Agent with all the tools (extract_invoice, lookup_po, lookup_receipt, calculate_match, check_vendor_master, request_human_approval, post_to_erp) and one long instructions block covering every responsibility. This compiles. It runs. It produces a trace. It is still worse than three agents with handoffs, because role separation is gone. Every tool sits in every prompt the model sees, the instructions balloon to several pages to cover the full responsibility surface, and the model's judgment on extraction is no longer independent of its judgment on approval — the same context shaping how the model reads a line item also shapes whether it decides to escalate, which is the wrong direction for separation of duties. An auditor reading the trace cannot tell where extraction ended and approval began; they read one undifferentiated stream of tool calls. Multi-agent with handoffs is more code up front and more correct over the long run.

One architectural detail to be honest about. A handoff is an LLM decision, not a guaranteed deterministic transition. The extractor's model normally hands off to the validator because the instructions tell it to, but a model that misreads its instructions or is given an ambiguous input could in principle return a final_output rather than hand off, or hand off to the wrong agent. In production this happens rarely and the trace makes it obvious when it does, but it can happen. Where strict determinism is a regulatory requirement — for example, an AP control that must prove every invoice over a certain threshold passed through the same set of named agents in the same order — the answer is to constrain the handoff with an output_type schema that requires the structured next-step decision, or to run the pipeline as explicit Runner.run calls per step rather than one handoff chain. The SDK's flexibility is a strength for the common case and a risk when used without thinking about which mode the workflow needs.

Sessions and Tracing for Batch Idempotency and Audit

Two operational concerns dominate any AP automation that actually ships: surviving worker crashes mid-batch without reprocessing already-posted invoices, and producing an audit trail an external reviewer can read. The SDK's sessions handle the first; its built-in tracing handles the second.

The session pattern for a batch is one line of setup and one parameter on the runner:

from agents import SQLiteSession

batch_id = "2026-05-22-overnight-batch"
session = SQLiteSession(batch_id, "ap_batch_sessions.db")

result = Runner.run_sync(
    extractor_with_handoffs,
    "Process the invoice at /var/inbox/invoice-9034.pdf",
    session=session,
)

The session is keyed on batch_id, and the conversation history (every model turn, every tool call, every tool result, every handoff) is persisted to ap_batch_sessions.db as it happens. When the worker crashes mid-batch and the process restarts, instantiating SQLiteSession(batch_id, "ap_batch_sessions.db") with the same arguments reattaches to the same history, and the next Runner.run call continues from where the previous one stopped rather than starting the agent loop from scratch. For an overnight batch of several thousand invoices, that turns a worker crash from a full reprocessing job into a continuation of the same job.

Be specific about what sessions persist and what they do not. Sessions persist the agent's conversational state — what the model said, what the tools returned, which handoffs fired. They do not persist external side effects. If the approver agent already called post_to_erp for invoice 9034 before the crash, the ERP holds the voucher and the session holds the record of the tool call. On resume, the agent will not call post_to_erp again for that invoice because the prior tool call is in the conversation history and the agent has already moved on. But sessions are not a duplicate-payment safeguard on their own. The post_to_erp tool body itself should send an idempotency key the ERP can deduplicate on (a hash of the invoice number plus vendor plus posting date is a common choice), so that any path that re-calls the tool — a session bug, a manual replay, a coding mistake — cannot produce a second voucher. Sessions handle the agent-state question; idempotency on the side-effect-producing tools handles the duplicate-action question.

The in-memory variant SQLiteSession("batch_id") (no path argument) dies with the process and is fine for unit tests and ad-hoc scripts. Production AP almost always wants the file-backed variant so a worker restart preserves state. On long batches with many invoices per session the history grows without bound, and replaying every prior turn back into the model on each invoice is both expensive and noisy — the right pattern is one session per invoice or per natural batch boundary rather than one session per worker lifetime, so each invoice's processing draws from its own relevant window.

Tracing is on by default. Every Agent invocation, every @function_tool call, every handoff, every guardrail check, and every model turn is recorded with timing, input, and output. The trace dashboard renders this as a tree: one root span per Runner.run, child spans per agent activation, grandchild spans per tool call. For one invoice's pipeline run, the trace contains, in order: the Invoice Extractor agent activation; the extract_invoice tool span with the PDF path as input and the structured Invoice as output; the handoff to Three-Way Matcher; three lookup spans (lookup_po, lookup_receipt, calculate_match) with their typed inputs and the MatchResult returned; the handoff to AP Approver; named guardrail spans (high_value_threshold: passed, vendor_master_check: passed); the post_to_erp span with the ERPPostingReceipt returned. Each span carries a timestamp; the whole trace replays the pipeline end to end.

For SOX, SOC 2, and GoBD compliance, this trace is the audit log. An auditor asking "which guardrails fired for invoice 9034, in what order, with what evidence" answers that question by opening one trace. The named guardrail check appears as its own span with tripwire_triggered: false, so the answer to "was the high-value threshold control evaluated for this invoice" is visible without any inference. This is why earlier sections placed guardrails as SDK primitives rather than conditionals — the trace artifact is the difference.

Most regulated environments will need traces shipped to an in-house log store rather than relied on as a vendor-side artifact. The SDK exposes a trace context manager under agents.tracing for explicit scoping, and trace IDs are accessible from Runner results for programmatic export. A common pattern is a small background process that fetches traces by ID once they complete and writes them to the customer's existing audit store (a SIEM, an S3 archive bucket, a compliance database), so the audit-retention policy lives where the rest of the customer's compliance evidence lives. The trace dashboard remains the working surface for debugging; the in-house store is the system of record.

Where these patterns sit inside a broader processing system — queue topology, dead-letter handling, retry policy, observability beyond the agent layer — is covered in the reference architecture for invoice processing pipelines. The session-plus-trace pattern above is the agent-layer answer to idempotency and audit; the surrounding plumbing is the system-layer answer to throughput and reliability.

Production Scope: Cost, Latency, and Where the SDK Stops

A working pipeline against a single invoice is not the same thing as a production AP system. The SDK ends at the agent loop; the rest is the developer's. The honest scope notes below cover the questions that come up between a prototype and a deployment.

Cost per invoice. For a typical AP invoice — three to six line items, two to three pages — the pipeline consumes a few thousand input tokens for the extracted invoice body plus the validator's lookup results, plus a few hundred output tokens across the three agents' decisions. On the current gpt-5.5 family the cost lands in the low single-digit US-cents range per invoice, dominated by the extraction step where the model first sees the structured invoice. The validator and approver agents are cheaper because their inputs are smaller (a MatchResult is far shorter than the full Invoice) and their outputs are short structured decisions. Specific numbers will shift as OpenAI pricing changes, but the shape — extraction is the most expensive step, validation and approval are cheaper — is stable across model generations. For high-volume AP teams pushing tens of thousands of invoices a month, dropping the validator and approver to gpt-5.2 (lower per-token cost, sufficient reasoning for the structured decisions these agents make) and keeping extraction on gpt-5.5 is the common cost-shape optimisation.

Latency per invoice. The extraction tool call dominates wall-clock time, because the underlying extraction service processes a page in roughly one to eight seconds (closer to two seconds per page on larger batches, where parallelisation inside the service is more efficient). For a two-page invoice, that is two to sixteen seconds before the agent loop continues. The validator and approver each add a few seconds of model time across their tool calls and decision turns. End-to-end, a single invoice through the full pipeline serially is tens of seconds, not single-digit seconds. The right parallelism boundary is per invoice, not per pipeline step — process many invoices concurrently with await Runner.run on each, rather than trying to parallelise within one invoice's three sequential agents. The extractor, validator, and approver depend on each other's outputs; the invoices in a batch do not.

ERP write-token security. The token that authenticates post_to_erp to the ERP is the credential that creates payment-eligible vouchers. It must never appear in any context the LLM can see. The pattern is: the tool body reads the token from the host environment (a secrets manager mount, an environment variable in the worker process, a per-request fetch from a vault), uses it inside the tool body to make the HTTP call to the ERP, and never includes it in tool arguments, tool return values, or any other field the agent reads. Models can be prompted, accidentally or maliciously, into echoing context they have seen; an AP write token in the agent's conversation history is the worst possible thing for that vector to reach. Treat the boundary between tool body (token-aware) and tool interface (token-blind) as load-bearing.

Long-running approvals. When request_human_approval blocks for hours or days because the approver is out of office or the escalation chain is long, do not hold the agent loop open on it. The right pattern is for the tool to return a pending ApprovalDecision carrying an approval ID, the agent serialises into its session, the worker releases the request, and a webhook from the approval UI later resumes the session by re-invoking Runner.run with the same session and a synthetic input indicating the human decision. The SDK supports the resume side cleanly because of how sessions persist conversation state; what the SDK does not build is the webhook listener that catches the approval signal, the storage of pending approval IDs across worker restarts, or the approval UI itself. Those are application code.

PII in invoice line items. Invoices routinely include personal data — consultant names on professional-services invoices, employee references on expense reimbursements, recipient details on payroll-adjacent items, addresses on shipping-related charges. Two things follow. First, the OpenAI organisation should be configured for the data region the customer's compliance program requires (EU, UK, US), and the chosen model must be available in that region — not all model versions land in every region simultaneously, and an extraction agent in the wrong region is a compliance gap the trace dashboard will not flag. Second, for AP categories where line-item PII is unusually sensitive (legal services with named matter parties, HR-adjacent payments), redact the relevant fields at the extraction tool layer before the structured Invoice reaches the downstream agents — the matcher and approver rarely need the personal identifiers; they need the financial structure.

The MCP alternative. The same extraction capability shown here as a @function_tool can be exposed as an MCP server consumed across multiple agent frameworks — useful for organisations standardising on one extraction service across an OpenAI Agents SDK pipeline, a Claude Agent SDK pipeline, and an internal LangChain workflow. The trade-off is operational: an MCP server is another service to deploy, monitor, and version, and the cross-framework benefit only materialises if there are genuinely multiple consumers. For a single AP team building a single pipeline on a single SDK, the @function_tool shape is shorter, deploys with the agent code, and has nothing to break on its own. For a platform team supporting multiple AI consumers of the same extraction service, MCP server as an alternative tool-integration pattern is the more reusable choice.

Versioning and freshness. The Agents SDK is still moving. Pin openai-agents and @openai/agents to specific versions in production, watch the SDK changelog for breaking changes around guardrail and handoff APIs, and retest the pipeline against new SDK versions in a staging environment before upgrading. The pieces most likely to shift are the guardrail decorator signature and the session storage format; everything else has been relatively stable across the recent versions.

When OpenAI Agents SDK Isn't the Right Choice

The OpenAI Agents SDK is a strong default for an AP team building on OpenAI. It is not the right default for every team, and the question "is this the right framework" deserves a straight answer rather than a soft one.

The Anthropic Claude Agent SDK with Skills is a better fit for teams already on Anthropic infrastructure, teams that want long-running file-system-aware agents that work over a project directory rather than discrete tool calls, and teams whose AP workflow benefits from Skills as a primitive — composable, reusable agent capabilities that load on demand rather than being declared up front in the Agent constructor. Skills are designed for the case where an AP team has a growing library of specialised capabilities (extractors, validators, posting adapters) that different runs draw from selectively, which is a pattern the OpenAI Agents SDK's flat tools=[...] list expresses less naturally. The trade-off is the broader OpenAI ecosystem the Anthropic stack does not match feature for feature — Responses API, file_search, the Realtime API, the native OpenAI tracing surface. For teams already running production OpenAI infrastructure, the migration cost outweighs the Skills benefit; for teams building greenfield AP on Anthropic, the equivalent build with the Claude Agent SDK and Skills is the closer fit.

LangGraph is the right choice for teams that want explicit graph-based workflow definitions, deterministic step ordering, and human-in-the-loop checkpoints as first-class graph nodes rather than LLM-driven routing decisions. AP workflows with hard regulatory requirements — banking-sector AP, healthcare-sector AP, public-sector procurement — where every invoice must demonstrably traverse exactly the same set of named steps in exactly the same order, often land here. The graph definition is its own auditable artifact; an external reviewer can read the graph and verify the control flow without reading any LLM traces at all. The trade-off is more orchestration code than the Agents SDK's terse handoffs=[...] parameter and a steeper learning curve. The LangGraph AP workflow with human-in-the-loop approval walks the deterministic-graph alternative end to end.

When to stay on the OpenAI Agents SDK: teams already operating on OpenAI infrastructure and billing; teams that want the terse multi-agent primitives without writing graph definitions; teams that benefit from the built-in trace dashboard rather than wiring observability separately; teams whose AP pipeline benefits from LLM-driven routing decisions — skipping validation for trusted repeat vendors, escalating ambiguous extraction straight to human review without the validator, routing certain invoice categories to specialised approver agents — rather than rigid step sequences. For most mid-market AP teams shipping their first production agent pipeline, those conditions hold and the SDK is the right answer.

The practical recommendation is the same one most cross-stack decisions reduce to: build the first version of the pipeline on whichever SDK the team already operates inside. Cross-framework migrations cost more than the marginal feature differences are worth in the early AP-automation phase, and the things that actually break in production (idempotency on side-effect tools, secrets handling on the ERP boundary, human-approval UX, batch resumption, audit-trail storage) are framework-independent. Pick the ecosystem first, then the framework, then ship.

Build an AP Automation Agent with the OpenAI Agents SDK

Setup and the Minimal AP Agent

The Extraction Agent: Wrapping the Invoice API as a @function_tool

The Validator Agent: Tools for Three-Way Matching

The Approver Agent and Guardrails for High-Value Invoices

Handoffs: Composing the AP Pipeline

Sessions and Tracing for Batch Idempotency and Audit

Production Scope: Cost, Latency, and Where the SDK Stops

When OpenAI Agents SDK Isn't the Right Choice

Extract invoice data to Excel with natural language prompts

Pydantic AI Invoice Extraction: Build a Typed Agent

LangGraph Accounts Payable Workflow with HITL Approval

OpenAI Structured Outputs for Invoice Extraction in Node.js