← Back to the catalog
D12OPERATIONS

Document data extraction

Reads structured and semi-structured documents that arrive in unstructured formats — invoices, purchase orders, bills of lading, claim forms, contracts, scanned paperwork — and extracts the fields the business cares about into structured records. Routes each extracted record through validation (does it match what we expected, do the totals add up, is the supplier known), then delivers to the system of record. High-confidence extractions land directly; low-confidence ones route to a human reviewer with the source document and the proposed values shown side by side. The pattern absorbs the data-entry work that currently consumes hours of skilled time every day in operations-heavy SMBs.

WHERE THIS FITS
BUSINESS SHAPES
B2B servicesProduct companyProfessional services
VOLUME THRESHOLD
Below 200 documents processed per month a month, the payback rarely earns the build. Patterns this shape reliably pay back at 2,000+.
FITS BEST
Manufacturing, logistics, accounting firms, healthcare admin. One of the most reliable patterns in the catalog.
PAYBACK · 3-5 moBUILD · Low-MediumVALUE · $50k-$200kWHEN · >100 docs/wk
FAILURE MODE TO DESIGN AROUND
Unusual layouts break extraction → the exception queue must be first-class, not an afterthought.
REQUIREMENTS · 6 REQUIRED, 1 OPTIONAL

Requirements describe capabilities the pattern needs in your environment, not the vendors you must buy. Any system that fills a requirement satisfies it — that’s what makes the catalog portable across the long tail of SMB tooling.

  1. document_intake_channel
    REQUIREDREADevent

    Where documents arrive for processing. Most clients have several channels feeding the same workflow.

    DATA SHAPE
    Document files in common formats (PDF, image, sometimes Word or HTML email) with arrival metadata: source, timestamp, sender if known.
    COMMONLY FILLED BY
    • email inbox where suppliers send invoices
    • shared drive folder where scanned documents land
    • upload portal for customers or partners
    • vendor portal that pushes documents via API
    • OCR feed from a multi-function printer
  2. extraction_schema_definitions
    REQUIREDREADcorpus

    What fields to extract per document type. Different document types have different schemas; the pattern needs explicit definitions, not guessing.

    DATA SHAPE
    Per-document-type field definitions: name, type, required-or-optional, validation rules, example values.
    COMMONLY FILLED BY
    • configuration maintained per document type by the engagement team
    • small admin UI for the operations team to manage schemas
    • templates derived from sample documents during build phase
  3. reference_data_for_validation
    REQUIREDREADrequest

    Known-good data the extracted values get checked against: supplier list, product catalog, customer roster, GL codes.

    DATA SHAPE
    Reference tables with the canonical entities and their identifiers. Lets the pattern check that 'Acme Corp' on an invoice matches the supplier record and isn't a new supplier.
    COMMONLY FILLED BY
    • supplier master in the ERP or accounting system
    • product catalog with SKUs
    • customer database keyed by name and tax ID
    • chart of accounts
  4. structured_output_destination
    REQUIREDWRITEevent

    Where the extracted, validated records land. Usually the system of record the documents would have been keyed into by hand.

    DATA SHAPE
    Structured records matching the destination system's schema, with a reference back to the source document.
    COMMONLY FILLED BY
    • bill or invoice created in the accounting system
    • purchase order record in the ERP
    • claim record in the operations platform
    • structured entry in the workflow management tool
  5. human_review_queue
    REQUIREDREAD + WRITErequest

    Where low-confidence extractions go for human verification. Reviewer sees the source document and the proposed values side by side, with the uncertain fields highlighted.

    DATA SHAPE
    Document + extracted fields + per-field confidence + reviewer actions (confirm, edit, reject).
    COMMONLY FILLED BY
    • review UI built for the engagement showing document and fields side by side
    • queue inside the accounting or ERP system
    • dedicated app for the operations reviewers
  6. exception_workflow
    REQUIREDWRITEevent

    Where documents that fundamentally can't be processed go: unknown supplier, malformed document, suspected duplicate, suspected fraud.

    DATA SHAPE
    Document with exception classification, evidence, and routing suggestion.
    COMMONLY FILLED BY
    • exception queue in the accounting system
    • ticket created for finance operations
    • dedicated investigation folder with notifications
  7. source_document_archive
    RECOMMENDEDWRITEcorpus

    Long-term storage of original documents, linked from the structured records for audit and reference.

    DATA SHAPE
    Original files preserved with metadata: extraction date, links to created records, retention schedule.
    IF MISSING
    Compliance and audit trail are weaker; in regulated industries this is effectively required. Strongly recommend in any client with audit obligations.
    COMMONLY FILLED BY
    • document management system with retention policies
    • archived folder in the file store with structured naming
    • attachment field on the destination record
RUNTIME FLOW · 9 STEPS
  1. 01
    A new document arrives on one of the intake channels
    document_intake_channel
  2. 02
    Classify document type and select the matching extraction schema
    extraction_schema_definitions
  3. 03
    Extract field values from the document, scoring each field's extraction confidence
  4. 04
    Validate extracted values against reference data (does this supplier exist, are line items recognized, do totals match)
    reference_data_for_validation
  5. 05
    Classify the result: high-confidence + valid → auto-process; low-confidence or validation failures → review; fundamental problems → exception
    DECISION Three branches based on confidence and validation outcome.
  6. 06
    For auto-process: write structured record to destination and archive source
    structured_output_destinationsource_document_archive
  7. 07
    For review: queue with source and proposed values for human verification
    human_review_queue
  8. 08
    For exception: route to exception workflow with classification and evidence
    exception_workflow
  9. 09
    On reviewer confirmation, write the record and feed corrections back for schema and confidence tuning
    structured_output_destinationsource_document_archive
EMISSIONS · 3

Structured outputs this pattern produces. Other patterns and client systems can subscribe to them, which is how the catalog composes over time.

  • extraction_quality_signal

    Per-document-type accuracy: auto-process rate, reviewer override rate, exception rate. The main metric for tuning.

    CONSUMED BY
    • pattern quality dashboards
    • schema refinement workflows
    • monthly operations review
  • supplier_or_customer_emergence_signal

    New entities appearing in documents that don't match reference data, surfaced for master-data maintenance.

    CONSUMED BY
    • procurement
    • B2 CRM hygiene if live
    • master data governance
  • anomaly_signal

    Documents flagged as suspicious — duplicates, unusual amounts, mismatched line items.

    CONSUMED BY
    • fraud and audit workflows
    • finance leadership review