Data Extraction - dottxt docs

Data extraction from unstructured text, including invoices, receipts, medical records, and contracts, is the bread and butter of structured output. The schema defines what “correctly extracted” looks like: which fields must be present, what types they have, and what ranges are valid. This turns extraction from a fuzzy NLP task into a well-defined contract: the output either conforms to the schema or it doesn’t. The tighter your schema constraints, the less post-processing you need. A format: “date” constraint on the invoice date means you get ”2026-02-12” instead of ”Feb 12, 2026” or ”12/02/2026”. A pattern constraint on currency codes means you get ”USD” instead of ”US Dollars”.

Goal

Extract invoice data from OCR text into a normalized, storage-ready record with validated types and bounded arrays.

Schema contract

{
  "type": "object",
  "properties": {
    "invoice_id": { "type": "string", "minLength": 1, "maxLength": 40 },
    "vendor": { "type": "string", "minLength": 1, "maxLength": 120 },
    "invoice_date": { "type": "string", "format": "date" },
    "currency": { "type": "string", "pattern": "^[A-Z]{3}$" },
    "total": { "type": "number", "minimum": 0 },
    "line_items": {
      "type": "array",
      "minItems": 1,
      "maxItems": 100,
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string", "minLength": 1, "maxLength": 200 },
          "quantity": { "type": "number", "minimum": 0 },
          "unit_price": { "type": "number", "minimum": 0 },
          "line_total": { "type": "number", "minimum": 0 }
        },
        "required": ["description", "quantity", "unit_price", "line_total"],
        "additionalProperties": false
      }
    }
  },
  "required": ["invoice_id", "vendor", "invoice_date", "currency", "total", "line_items"],
  "additionalProperties": false
}

from pydantic import BaseModel, ConfigDict, Field
from datetime import date

class LineItem(BaseModel):
    model_config = ConfigDict(extra="forbid")
    description: str = Field(..., min_length=1, max_length=200)
    quantity: float = Field(..., ge=0)
    unit_price: float = Field(..., ge=0)
    line_total: float = Field(..., ge=0)

class InvoiceRecord(BaseModel):
    model_config = ConfigDict(extra="forbid")
    invoice_id: str = Field(..., min_length=1, max_length=40)
    vendor: str = Field(..., min_length=1, max_length=120)
    invoice_date: date
    currency: str = Field(..., pattern=r"^[A-Z]{3}$")
    total: float = Field(..., ge=0)
    line_items: list[LineItem] = Field(..., min_length=1, max_length=100)

import { z } from "zod";

const invoiceRecordSchema = z.object({
  invoice_id: z.string().min(1).max(40),
  vendor: z.string().min(1).max(120),
  invoice_date: z.iso.date(),
  currency: z.string().regex(/^[A-Z]{3}$/),
  total: z.number().min(0),
  line_items: z.array(z.object({
    description: z.string().min(1).max(200),
    quantity: z.number().min(0),
    unit_price: z.number().min(0),
    line_total: z.number().min(0),
  }).strict()).min(1).max(100),
}).strict();

curl https://api.dottxt.ai/v1/chat/completions \
  -H "Authorization: Bearer $DOTTXT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{ "role": "user", "content": "Invoice INV-22019\nVendor: Northwind Supplies\nDate: 2026-02-12\nCurrency: USD\nItems:\n- Battery Pack x2 @ 39.50 = 79.00\n- Cable Kit x1 @ 15.00 = 15.00\nTotal: 94.00" }],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "invoice_record",
        "schema": {
          "type": "object",
          "properties": {
            "invoice_id": { "type": "string", "minLength": 1, "maxLength": 40 },
            "vendor": { "type": "string", "minLength": 1, "maxLength": 120 },
            "invoice_date": { "type": "string", "format": "date" },
            "currency": { "type": "string", "pattern": "^[A-Z]{3}$" },
            "total": { "type": "number", "minimum": 0 },
            "line_items": {
              "type": "array",
              "minItems": 1,
              "maxItems": 100,
              "items": {
                "type": "object",
                "properties": {
                  "description": { "type": "string", "minLength": 1, "maxLength": 200 },
                  "quantity": { "type": "number", "minimum": 0 },
                  "unit_price": { "type": "number", "minimum": 0 },
                  "line_total": { "type": "number", "minimum": 0 }
                },
                "required": ["description", "quantity", "unit_price", "line_total"],
                "additionalProperties": false
              }
            }
          },
          "required": ["invoice_id", "vendor", "invoice_date", "currency", "total", "line_items"],
          "additionalProperties": false
        }
      }
    }
  }'

Example input

Invoice INV-22019
Vendor: Northwind Supplies
Date: 2026-02-12
Currency: USD
Items:
- Battery Pack x2 @ 39.50 = 79.00
- Cable Kit x1 @ 15.00 = 15.00
Total: 94.00

Example output

{
  "invoice_id": "INV-22019",
  "vendor": "Northwind Supplies",
  "invoice_date": "2026-02-12",
  "currency": "USD",
  "total": 94,
  "line_items": [
    {
      "description": "Battery Pack",
      "quantity": 2,
      "unit_price": 39.5,
      "line_total": 79
    },
    {
      "description": "Cable Kit",
      "quantity": 1,
      "unit_price": 15,
      "line_total": 15
    }
  ]
}

Implementation tips

Narrow fields to business needs. Don’t add a catch-all "raw_text" field. Each field should map to a column in your database or a field in your downstream API. If you don’t need it, don’t extract it.
Bound arrays to realistic limits. maxItems: 100 on line items is generous but prevents runaway generation on malformed OCR input. Without it, a noisy scan could produce thousands of phantom line items.
Use format and pattern for normalization. format: "date" on invoice_date gives you ISO 8601 dates regardless of how the source text formats them. pattern: "^[A-Z]{3}$" on currency gives you three-letter codes, not spelled-out currency names.
Consider per-field confidence. For high-stakes extraction (financial documents, medical records), add a confidence number field next to each extracted value. This lets your application flag low-confidence extractions for human review rather than trusting everything equally.

Optional fields: make fields the model can omit when the source text doesn’t contain them
Optional vs Null: choose between “field absent” and “field present but null”
String bounds: control length, format, and regex on extracted strings
Bounded arrays: set min/max item counts on repeated structures
Object reference | String reference

​Goal

​Schema contract

​Example input

​Example output

​Implementation tips

​Related docs

Goal

Schema contract

Example input

Example output

Implementation tips

Related docs