Skip to main content
Data extraction from unstructured text, including invoices, receipts, medical records, and contracts, is the bread and butter of structured output. The schema defines what “correctly extracted” looks like: which fields must be present, what types they have, and what ranges are valid. This turns extraction from a fuzzy NLP task into a well-defined contract: the output either conforms to the schema or it doesn’t. The tighter your schema constraints, the less post-processing you need. A format: “date” constraint on the invoice date means you get ”2026-02-12” instead of ”Feb 12, 2026” or ”12/02/2026”. A pattern constraint on currency codes means you get ”USD” instead of ”US Dollars”.

Goal

Extract invoice data from OCR text into a normalized, storage-ready record with validated types and bounded arrays.

Schema contract

{
  "type": "object",
  "properties": {
    "invoice_id": { "type": "string", "minLength": 1, "maxLength": 40 },
    "vendor": { "type": "string", "minLength": 1, "maxLength": 120 },
    "invoice_date": { "type": "string", "format": "date" },
    "currency": { "type": "string", "pattern": "^[A-Z]{3}$" },
    "total": { "type": "number", "minimum": 0 },
    "line_items": {
      "type": "array",
      "minItems": 1,
      "maxItems": 100,
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string", "minLength": 1, "maxLength": 200 },
          "quantity": { "type": "number", "minimum": 0 },
          "unit_price": { "type": "number", "minimum": 0 },
          "line_total": { "type": "number", "minimum": 0 }
        },
        "required": ["description", "quantity", "unit_price", "line_total"],
        "additionalProperties": false
      }
    }
  },
  "required": ["invoice_id", "vendor", "invoice_date", "currency", "total", "line_items"],
  "additionalProperties": false
}

Example input

Invoice INV-22019
Vendor: Northwind Supplies
Date: 2026-02-12
Currency: USD
Items:
- Battery Pack x2 @ 39.50 = 79.00
- Cable Kit x1 @ 15.00 = 15.00
Total: 94.00

Example output

{
  "invoice_id": "INV-22019",
  "vendor": "Northwind Supplies",
  "invoice_date": "2026-02-12",
  "currency": "USD",
  "total": 94,
  "line_items": [
    {
      "description": "Battery Pack",
      "quantity": 2,
      "unit_price": 39.5,
      "line_total": 79
    },
    {
      "description": "Cable Kit",
      "quantity": 1,
      "unit_price": 15,
      "line_total": 15
    }
  ]
}

Implementation tips

  • Narrow fields to business needs. Don’t add a catch-all "raw_text" field. Each field should map to a column in your database or a field in your downstream API. If you don’t need it, don’t extract it.
  • Bound arrays to realistic limits. maxItems: 100 on line items is generous but prevents runaway generation on malformed OCR input. Without it, a noisy scan could produce thousands of phantom line items.
  • Use format and pattern for normalization. format: "date" on invoice_date gives you ISO 8601 dates regardless of how the source text formats them. pattern: "^[A-Z]{3}$" on currency gives you three-letter codes, not spelled-out currency names.
  • Consider per-field confidence. For high-stakes extraction (financial documents, medical records), add a confidence number field next to each extracted value. This lets your application flag low-confidence extractions for human review rather than trusting everything equally.