Document Intelligence
Document Intelligence lets you extract structured data from unstructured documents -- PDFs, scanned images, invoices, receipts, forms, and reports. Datafi processes your documents through a pipeline that combines OCR, GPT-4 Vision, and configurable extraction schemas to produce clean, structured JSON output that you can query, store, or feed into downstream workflows.
How It Works
Stage 1: Upload and Preprocessing
You upload a document through the Datafi UI, API, or an agent tool. The pipeline detects the file format and applies appropriate preprocessing:
- PDFs -- Pages are rendered to images for visual processing. Multi-page PDFs are processed page by page.
- Scanned images -- Preprocessing includes deskewing, contrast enhancement, and resolution normalization.
- Native digital documents -- Text is extracted directly where possible, with visual processing as a fallback.
Stage 2: GPT-4 Vision Extraction
Preprocessed document pages are sent to GPT-4 Vision along with an extraction prompt. The prompt can be:
- General purpose -- Extract all visible text and structure (tables, headings, lists).
- Schema-guided -- Extract specific fields defined by an extraction schema you configure.
- Template-based -- Match the document against a known template (invoice, receipt, form) and extract mapped fields.
Stage 3: Schema Mapping and Validation
The raw extraction output is mapped to your defined schema. The pipeline validates data types, checks for required fields, and flags any values that do not match expected patterns. Validation failures are reported in the output so you can review them.
Stage 4: Structured JSON Output
The final output is a structured JSON document containing the extracted fields, their values, confidence scores, and any validation warnings.
Supported Formats
| Format | Extension | Notes |
|---|---|---|
.pdf | Single and multi-page. Scanned and native digital. | |
| TIFF | .tiff, .tif | Single and multi-frame. Common for scanned documents. |
| PNG | .png | Lossless image format. Ideal for screenshots and diagrams. |
| BMP | .bmp | Uncompressed bitmap. Supported for legacy compatibility. |
| JPEG | .jpg, .jpeg | Lossy compressed. Common for photos of documents. |
Maximum file size is 20 MB per document. Multi-page PDFs are limited to 50 pages per extraction request. For larger documents, split them into batches before uploading.
Extraction Schemas
An extraction schema defines the fields you want to pull from a document. You define the schema as a JSON structure, and the pipeline uses it to guide GPT-4 Vision's extraction and validate the output.
{
"schema_name": "invoice_extraction",
"fields": [
{
"name": "vendor_name",
"type": "string",
"required": true,
"description": "Name of the vendor or supplier"
},
{
"name": "invoice_number",
"type": "string",
"required": true,
"description": "Unique invoice identifier"
},
{
"name": "invoice_date",
"type": "date",
"required": true,
"format": "YYYY-MM-DD"
},
{
"name": "line_items",
"type": "array",
"items": {
"description": "string",
"quantity": "integer",
"unit_price": "decimal",
"total": "decimal"
}
},
{
"name": "total_amount",
"type": "decimal",
"required": true
}
]
}
Supported Field Types
| Type | Description | Example |
|---|---|---|
string | Free-form text | "Acme Corp" |
integer | Whole numbers | 42 |
decimal | Floating-point numbers | 1299.99 |
date | Date values with configurable format | "2025-09-15" |
boolean | True/false values | true |
array | Repeating groups (e.g., line items) | [{...}, {...}] |
Example Output
For an uploaded invoice processed with the schema above, the pipeline returns:
{
"extraction_id": "ext_a1b2c3d4",
"document": "invoice_2025_0042.pdf",
"schema": "invoice_extraction",
"status": "complete",
"confidence": 0.94,
"data": {
"vendor_name": "Acme Industrial Supplies",
"invoice_number": "INV-2025-0042",
"invoice_date": "2025-09-15",
"line_items": [
{
"description": "Steel bolts M8x50",
"quantity": 500,
"unit_price": 0.45,
"total": 225.00
},
{
"description": "Rubber gaskets 25mm",
"quantity": 200,
"unit_price": 1.20,
"total": 240.00
}
],
"total_amount": 465.00
},
"warnings": []
}
Using Document Intelligence
From the UI
- Navigate to AI > Document Intelligence.
- Select or create an extraction schema.
- Upload your document (drag and drop or file picker).
- Review the extracted data in the results panel.
- Export as JSON, or send to a Data View for further analysis.
From an Agent
Agents can use the vision_extraction tool to process documents as part of an automated workflow:
tools:
- name: vision_extraction
config:
schema: invoice_extraction
source: ${input.document_url}
output_format: json
From the API
Submit documents programmatically using the Document Intelligence API endpoint:
curl -X POST https://api.datafi.io/v1/documents/extract \
-H "Authorization: Bearer $TOKEN" \
-F "[email protected]" \
-F "schema=invoice_extraction"
Batch Processing
For high-volume document processing, you can submit multiple documents in a batch:
- Upload files to a designated folder or object storage path.
- Create a batch extraction job referencing the folder and schema.
- Monitor progress in AI > Document Intelligence > Jobs.
- Download results as a combined JSON array or individual files.
You can build an agent workflow that watches an FTP folder or email inbox for new documents, processes them through Document Intelligence, and writes the extracted data to a database table automatically. See Workflow Builder for details.
Accuracy and Confidence Scores
Every extraction includes a confidence score between 0 and 1. Scores above 0.9 typically indicate high-quality extraction. For lower-confidence results, review the flagged fields and consider:
- Improving document scan quality (higher DPI, better lighting).
- Adding more specific field descriptions to your extraction schema.
- Using template-based extraction for standardized document formats.
Next Steps
- Agent Catalog -- Browse agents that use document intelligence.
- Agent Builder -- Build custom agents with the
vision_extractiontool. - Workflow Builder -- Orchestrate document processing pipelines.