Document Intelligence

Document Intelligence lets you extract structured data from unstructured documents -- PDFs, scanned images, invoices, receipts, forms, and reports. Datafi processes your documents through a pipeline that combines OCR, GPT-4 Vision, and configurable extraction schemas to produce clean, structured JSON output that you can query, store, or feed into downstream workflows.

How It Works

Stage 1: Upload and Preprocessing

You upload a document through the Datafi UI, API, or an agent tool. The pipeline detects the file format and applies appropriate preprocessing:

PDFs -- Pages are rendered to images for visual processing. Multi-page PDFs are processed page by page.
Scanned images -- Preprocessing includes deskewing, contrast enhancement, and resolution normalization.
Native digital documents -- Text is extracted directly where possible, with visual processing as a fallback.

Stage 2: GPT-4 Vision Extraction

Preprocessed document pages are sent to GPT-4 Vision along with an extraction prompt. The prompt can be:

General purpose -- Extract all visible text and structure (tables, headings, lists).
Schema-guided -- Extract specific fields defined by an extraction schema you configure.
Template-based -- Match the document against a known template (invoice, receipt, form) and extract mapped fields.

Stage 3: Schema Mapping and Validation

The raw extraction output is mapped to your defined schema. The pipeline validates data types, checks for required fields, and flags any values that do not match expected patterns. Validation failures are reported in the output so you can review them.

Stage 4: Structured JSON Output

The final output is a structured JSON document containing the extracted fields, their values, confidence scores, and any validation warnings.

Supported Formats

Format	Extension	Notes
PDF	`.pdf`	Single and multi-page. Scanned and native digital.
TIFF	`.tiff`, `.tif`	Single and multi-frame. Common for scanned documents.
PNG	`.png`	Lossless image format. Ideal for screenshots and diagrams.
BMP	`.bmp`	Uncompressed bitmap. Supported for legacy compatibility.
JPEG	`.jpg`, `.jpeg`	Lossy compressed. Common for photos of documents.

File Size Limits

Maximum file size is 20 MB per document. Multi-page PDFs are limited to 50 pages per extraction request. For larger documents, split them into batches before uploading.

Extraction Schemas

An extraction schema defines the fields you want to pull from a document. You define the schema as a JSON structure, and the pipeline uses it to guide GPT-4 Vision's extraction and validate the output.

{
  "schema_name": "invoice_extraction",
  "fields": [
    {
      "name": "vendor_name",
      "type": "string",
      "required": true,
      "description": "Name of the vendor or supplier"
    },
    {
      "name": "invoice_number",
      "type": "string",
      "required": true,
      "description": "Unique invoice identifier"
    },
    {
      "name": "invoice_date",
      "type": "date",
      "required": true,
      "format": "YYYY-MM-DD"
    },
    {
      "name": "line_items",
      "type": "array",
      "items": {
        "description": "string",
        "quantity": "integer",
        "unit_price": "decimal",
        "total": "decimal"
      }
    },
    {
      "name": "total_amount",
      "type": "decimal",
      "required": true
    }
  ]
}

Supported Field Types

Type	Description	Example
`string`	Free-form text	`"Acme Corp"`
`integer`	Whole numbers	`42`
`decimal`	Floating-point numbers	`1299.99`
`date`	Date values with configurable format	`"2025-09-15"`
`boolean`	True/false values	`true`
`array`	Repeating groups (e.g., line items)	`[{...}, {...}]`

Example Output

For an uploaded invoice processed with the schema above, the pipeline returns:

{
  "extraction_id": "ext_a1b2c3d4",
  "document": "invoice_2025_0042.pdf",
  "schema": "invoice_extraction",
  "status": "complete",
  "confidence": 0.94,
  "data": {
    "vendor_name": "Acme Industrial Supplies",
    "invoice_number": "INV-2025-0042",
    "invoice_date": "2025-09-15",
    "line_items": [
      {
        "description": "Steel bolts M8x50",
        "quantity": 500,
        "unit_price": 0.45,
        "total": 225.00
      },
      {
        "description": "Rubber gaskets 25mm",
        "quantity": 200,
        "unit_price": 1.20,
        "total": 240.00
      }
    ],
    "total_amount": 465.00
  },
  "warnings": []
}

Using Document Intelligence

From the UI

Navigate to AI > Document Intelligence.
Select or create an extraction schema.
Upload your document (drag and drop or file picker).
Review the extracted data in the results panel.
Export as JSON, or send to a Data View for further analysis.

From an Agent

Agents can use the vision_extraction tool to process documents as part of an automated workflow:

tools:
  - name: vision_extraction
    config:
      schema: invoice_extraction
      source: ${input.document_url}
      output_format: json

From the API

Submit documents programmatically using the Document Intelligence API endpoint:

curl -X POST https://api.datafi.io/v1/documents/extract \
  -H "Authorization: Bearer $TOKEN" \
  -F "[email protected]" \
  -F "schema=invoice_extraction"

Batch Processing

For high-volume document processing, you can submit multiple documents in a batch:

Upload files to a designated folder or object storage path.
Create a batch extraction job referencing the folder and schema.
Monitor progress in AI > Document Intelligence > Jobs.
Download results as a combined JSON array or individual files.

Agent-Driven Batches

You can build an agent workflow that watches an FTP folder or email inbox for new documents, processes them through Document Intelligence, and writes the extracted data to a database table automatically. See Workflow Builder for details.

Accuracy and Confidence Scores

Every extraction includes a confidence score between 0 and 1. Scores above 0.9 typically indicate high-quality extraction. For lower-confidence results, review the flagged fields and consider:

Improving document scan quality (higher DPI, better lighting).
Adding more specific field descriptions to your extraction schema.
Using template-based extraction for standardized document formats.

Next Steps

Agent Catalog -- Browse agents that use document intelligence.
Agent Builder -- Build custom agents with the vision_extraction tool.
Workflow Builder -- Orchestrate document processing pipelines.

How It Works​

Stage 1: Upload and Preprocessing​

Stage 2: GPT-4 Vision Extraction​

Stage 3: Schema Mapping and Validation​

Stage 4: Structured JSON Output​

Supported Formats​

Extraction Schemas​

Supported Field Types​

Example Output​

Using Document Intelligence​

From the UI​

From an Agent​

From the API​

Batch Processing​

Accuracy and Confidence Scores​

Next Steps​