Skip to main content

Document Intelligence

Document Intelligence lets you extract structured data from unstructured documents -- PDFs, scanned images, invoices, receipts, forms, and reports. Datafi processes your documents through a pipeline that combines OCR, GPT-4 Vision, and configurable extraction schemas to produce clean, structured JSON output that you can query, store, or feed into downstream workflows.


How It Works

Stage 1: Upload and Preprocessing

You upload a document through the Datafi UI, API, or an agent tool. The pipeline detects the file format and applies appropriate preprocessing:

  • PDFs -- Pages are rendered to images for visual processing. Multi-page PDFs are processed page by page.
  • Scanned images -- Preprocessing includes deskewing, contrast enhancement, and resolution normalization.
  • Native digital documents -- Text is extracted directly where possible, with visual processing as a fallback.

Stage 2: GPT-4 Vision Extraction

Preprocessed document pages are sent to GPT-4 Vision along with an extraction prompt. The prompt can be:

  • General purpose -- Extract all visible text and structure (tables, headings, lists).
  • Schema-guided -- Extract specific fields defined by an extraction schema you configure.
  • Template-based -- Match the document against a known template (invoice, receipt, form) and extract mapped fields.

Stage 3: Schema Mapping and Validation

The raw extraction output is mapped to your defined schema. The pipeline validates data types, checks for required fields, and flags any values that do not match expected patterns. Validation failures are reported in the output so you can review them.

Stage 4: Structured JSON Output

The final output is a structured JSON document containing the extracted fields, their values, confidence scores, and any validation warnings.


Supported Formats

FormatExtensionNotes
PDF.pdfSingle and multi-page. Scanned and native digital.
TIFF.tiff, .tifSingle and multi-frame. Common for scanned documents.
PNG.pngLossless image format. Ideal for screenshots and diagrams.
BMP.bmpUncompressed bitmap. Supported for legacy compatibility.
JPEG.jpg, .jpegLossy compressed. Common for photos of documents.
File Size Limits

Maximum file size is 20 MB per document. Multi-page PDFs are limited to 50 pages per extraction request. For larger documents, split them into batches before uploading.


Extraction Schemas

An extraction schema defines the fields you want to pull from a document. You define the schema as a JSON structure, and the pipeline uses it to guide GPT-4 Vision's extraction and validate the output.

{
"schema_name": "invoice_extraction",
"fields": [
{
"name": "vendor_name",
"type": "string",
"required": true,
"description": "Name of the vendor or supplier"
},
{
"name": "invoice_number",
"type": "string",
"required": true,
"description": "Unique invoice identifier"
},
{
"name": "invoice_date",
"type": "date",
"required": true,
"format": "YYYY-MM-DD"
},
{
"name": "line_items",
"type": "array",
"items": {
"description": "string",
"quantity": "integer",
"unit_price": "decimal",
"total": "decimal"
}
},
{
"name": "total_amount",
"type": "decimal",
"required": true
}
]
}

Supported Field Types

TypeDescriptionExample
stringFree-form text"Acme Corp"
integerWhole numbers42
decimalFloating-point numbers1299.99
dateDate values with configurable format"2025-09-15"
booleanTrue/false valuestrue
arrayRepeating groups (e.g., line items)[{...}, {...}]

Example Output

For an uploaded invoice processed with the schema above, the pipeline returns:

{
"extraction_id": "ext_a1b2c3d4",
"document": "invoice_2025_0042.pdf",
"schema": "invoice_extraction",
"status": "complete",
"confidence": 0.94,
"data": {
"vendor_name": "Acme Industrial Supplies",
"invoice_number": "INV-2025-0042",
"invoice_date": "2025-09-15",
"line_items": [
{
"description": "Steel bolts M8x50",
"quantity": 500,
"unit_price": 0.45,
"total": 225.00
},
{
"description": "Rubber gaskets 25mm",
"quantity": 200,
"unit_price": 1.20,
"total": 240.00
}
],
"total_amount": 465.00
},
"warnings": []
}

Using Document Intelligence

From the UI

  1. Navigate to AI > Document Intelligence.
  2. Select or create an extraction schema.
  3. Upload your document (drag and drop or file picker).
  4. Review the extracted data in the results panel.
  5. Export as JSON, or send to a Data View for further analysis.

From an Agent

Agents can use the vision_extraction tool to process documents as part of an automated workflow:

tools:
- name: vision_extraction
config:
schema: invoice_extraction
source: ${input.document_url}
output_format: json

From the API

Submit documents programmatically using the Document Intelligence API endpoint:

curl -X POST https://api.datafi.io/v1/documents/extract \
-H "Authorization: Bearer $TOKEN" \
-F "[email protected]" \
-F "schema=invoice_extraction"

Batch Processing

For high-volume document processing, you can submit multiple documents in a batch:

  1. Upload files to a designated folder or object storage path.
  2. Create a batch extraction job referencing the folder and schema.
  3. Monitor progress in AI > Document Intelligence > Jobs.
  4. Download results as a combined JSON array or individual files.
Agent-Driven Batches

You can build an agent workflow that watches an FTP folder or email inbox for new documents, processes them through Document Intelligence, and writes the extracted data to a database table automatically. See Workflow Builder for details.


Accuracy and Confidence Scores

Every extraction includes a confidence score between 0 and 1. Scores above 0.9 typically indicate high-quality extraction. For lower-confidence results, review the flagged fields and consider:

  • Improving document scan quality (higher DPI, better lighting).
  • Adding more specific field descriptions to your extraction schema.
  • Using template-based extraction for standardized document formats.

Next Steps