Document Extraction Pipeline

Public Last updated: 2026-04-15 16:44

Extract structured data from invoices, receipts, contracts, and purchase orders — validate the output automatically and flag anything that needs human review.

Before You Start

You need the following to run this workflow:

Claude Code (desktop app, CLI, or IDE extension). Claude reads documents and extracts fields during Steps 3 and 4.
Documents in a readable format — PDF, image (PNG/JPG), or scanned files. Claude Code can read PDFs and images directly. For very poor quality scans, run OCR first (see tips in Step 2).
A spreadsheet app — Google Sheets, Excel, or CSV editor for the final output.
20-30 minutes for a batch of 10-15 documents. Process documents in batches of the same type (e.g. all invoices together, all contracts together) for the most consistent results.

Accuracy expectations: Standard invoices with clean formatting typically see 94-97% field-level accuracy. Messy scans, handwritten notes, and unusual layouts will have lower accuracy — the workflow flags these for human review rather than guessing. That tradeoff is intentional: 95% of the grunt work disappears, and the remaining 5% gets flagged with context so a human resolves it in seconds.

Quick Reference

Step	Who	What you do	What you get
1	You	Choose document type, define extraction schema	Reusable schema template
2	You	Gather 10-15 documents in a folder	Batch ready for processing
3	Claude	Read each document, extract fields	Raw extraction table
4	Claude	Validate math, completeness, and format	CLEAN and FLAGGED record sets
5	You	Resolve flagged items against originals	All records verified
6	You + Claude	Export to CSV, import to your system	Structured data in your workflow

Going Further

Once you have run this workflow on a few batches:

Build schema variants. Create separate schema files for each document type you process regularly — invoices, purchase orders, expense receipts, contracts. Reuse them across batches without redefining fields each time.
Track accuracy over time. Note how many documents come back CLEAN vs FLAGGED per batch. If a specific vendor's invoices consistently flag, add vendor-specific extraction hints to your schema (e.g. "This vendor puts the invoice number in the footer").
Scale with confidence. Once your schemas are calibrated and your CLEAN rate is above 90%, you can increase batch sizes or process more frequently. The 5% that needs review will stay roughly constant — it is the nature of messy real-world documents, not a limitation you can eliminate.
Automate the pipeline. Schedule this as a recurring Claude Code task that watches a folder for new documents, extracts and validates automatically, and drops the CSV into a shared drive. Human review only triggers when FLAGGED items appear.

No tasks to visualize yet.

Steps (0)