Platform Architecture

Datafi is built on a three-service architecture designed for security, performance, and flexibility. Each service has a distinct responsibility, and together they enable federated query execution across any number of data sources without centralizing your data.

Architecture Diagram

The Three Services

Coordinator

The Coordinator is the central orchestration hub of the Datafi platform. It handles every aspect of request processing except direct database connectivity. When you send a query, the Coordinator validates your identity, enforces access policies, parses and optimizes the query, routes sub-queries to the appropriate Edge nodes, and aggregates results before returning them to you.

Key responsibilities:

JWT validation -- Authenticates every incoming request.
ABAC enforcement -- Applies attribute-based access control policies to determine what data you can see.
Query parsing and optimization -- Breaks complex federated queries into sub-queries targeted at specific data sources.
Routing -- Dispatches sub-queries to the correct Edge nodes over gRPC (TLS).
Result aggregation -- Merges results from multiple Edge nodes using Apache Arrow for high-performance columnar processing.
AI/ML orchestration -- Coordinates agent workflows and model interactions.

Design Principle

The Coordinator never connects directly to your databases. All data access flows through Edge nodes, preserving network isolation and minimizing your attack surface.

Edge

Edge nodes provide secure, minimal database connectivity. Each Edge node is deployed close to your data sources -- inside your VPC, your data center, or wherever your databases reside. The Edge exposes only three RPC methods, keeping the attack surface as small as possible.

RPC Method	Purpose
`Query`	Execute a compiled SQL statement against a connected database and return results.
`GetSchema`	Retrieve table and column metadata from a connected database.
`Ping`	Health check to verify the Edge node and its database connections are operational.

Key characteristics:

Minimal surface area -- Only 3 RPC methods, no business logic.
Deployed at the data -- Runs inside your network perimeter, adjacent to your databases.
TLS-secured gRPC -- All communication between the Coordinator and Edge nodes is encrypted in transit.
Stateless -- Edge nodes do not cache query results or store user data.

Client Library

The Client Library is a WebAssembly-based SDK that runs directly in the browser. It provides a GraphQL interface for building data applications with near-native performance.

Key characteristics:

WebAssembly -- Compiled to Wasm for near-native execution speed in the browser.
GraphQL access -- You interact with your data through a familiar GraphQL API.
Lightweight -- No server-side rendering required; the library runs entirely client-side.
Cross-platform -- Works in any modern browser that supports WebAssembly.

Service Communication

Source	Destination	Protocol	Port	Authentication
Client Library	Coordinator	gRPC-Web	8001	JWT Bearer Token
Client Library	Coordinator	HTTP	8000	JWT Bearer Token
Browser / External	Coordinator	gRPC	50051	JWT Bearer Token
Browser / External	Coordinator	MCP	8002	JWT Bearer Token
Coordinator	Edge	gRPC (TLS)	50051	Mutual TLS
Edge	Database	Native Driver	Varies	Database-specific
Edge	Health Check	HTTP	80	None

Deployment Models

Datafi supports multiple deployment models to match your infrastructure and compliance requirements.

Model	Coordinator	Edge	Data Sources	Best For
SaaS	Datafi-hosted	Datafi-hosted	Cloud databases	Teams that want zero infrastructure management.
Private Cloud	Your cloud account	Your cloud account	Your cloud databases	Organizations requiring full control over the compute layer.
On-Premises	Your data center	Your data center	On-premises databases	Regulated industries with strict data residency requirements.
Hybrid	Datafi-hosted	Your network	Mixed	Organizations that want managed orchestration with data that never leaves their perimeter.

Hybrid Is the Most Common

Most enterprise customers deploy the Hybrid model. The Coordinator runs in Datafi's managed cloud, while Edge nodes run inside the customer's network. This means your data never leaves your infrastructure, but you still benefit from managed orchestration, updates, and monitoring.

Design Principles

Data never moves -- Queries go to the data, not the other way around. Edge nodes execute queries where the data lives.
Minimal trust boundaries -- Each service has the minimum permissions it needs. The Coordinator never touches raw data. Edge nodes never make authorization decisions.
Protocol efficiency -- gRPC with Protocol Buffers for internal communication. Apache Arrow for columnar result aggregation. PRQL as the intermediate query representation.
Horizontal scalability -- You can deploy as many Edge nodes as you need, each serving a different set of data sources. The Coordinator handles routing and aggregation transparently.

Next Steps

Key Concepts -- Learn the terminology used throughout the platform.
Request Lifecycle -- Understand what happens when you execute a query.
Supported Data Sources -- See the full list of databases and connectors.
Multi-Protocol APIs -- Explore the available API protocols and their use cases.

Architecture Diagram​

The Three Services​

Coordinator​

Edge​

Client Library​

Service Communication​

Deployment Models​

Design Principles​

Next Steps​