Platform Architecture
Datafi is built on a three-service architecture designed for security, performance, and flexibility. Each service has a distinct responsibility, and together they enable federated query execution across any number of data sources without centralizing your data.
Architecture Diagram
The Three Services
Coordinator
The Coordinator is the central orchestration hub of the Datafi platform. It handles every aspect of request processing except direct database connectivity. When you send a query, the Coordinator validates your identity, enforces access policies, parses and optimizes the query, routes sub-queries to the appropriate Edge nodes, and aggregates results before returning them to you.
Key responsibilities:
- JWT validation -- Authenticates every incoming request.
- ABAC enforcement -- Applies attribute-based access control policies to determine what data you can see.
- Query parsing and optimization -- Breaks complex federated queries into sub-queries targeted at specific data sources.
- Routing -- Dispatches sub-queries to the correct Edge nodes over gRPC (TLS).
- Result aggregation -- Merges results from multiple Edge nodes using Apache Arrow for high-performance columnar processing.
- AI/ML orchestration -- Coordinates agent workflows and model interactions.
The Coordinator never connects directly to your databases. All data access flows through Edge nodes, preserving network isolation and minimizing your attack surface.
Edge
Edge nodes provide secure, minimal database connectivity. Each Edge node is deployed close to your data sources -- inside your VPC, your data center, or wherever your databases reside. The Edge exposes only three RPC methods, keeping the attack surface as small as possible.
| RPC Method | Purpose |
|---|---|
Query | Execute a compiled SQL statement against a connected database and return results. |
GetSchema | Retrieve table and column metadata from a connected database. |
Ping | Health check to verify the Edge node and its database connections are operational. |
Key characteristics:
- Minimal surface area -- Only 3 RPC methods, no business logic.
- Deployed at the data -- Runs inside your network perimeter, adjacent to your databases.
- TLS-secured gRPC -- All communication between the Coordinator and Edge nodes is encrypted in transit.
- Stateless -- Edge nodes do not cache query results or store user data.
Client Library
The Client Library is a WebAssembly-based SDK that runs directly in the browser. It provides a GraphQL interface for building data applications with near-native performance.
Key characteristics:
- WebAssembly -- Compiled to Wasm for near-native execution speed in the browser.
- GraphQL access -- You interact with your data through a familiar GraphQL API.
- Lightweight -- No server-side rendering required; the library runs entirely client-side.
- Cross-platform -- Works in any modern browser that supports WebAssembly.
Service Communication
| Source | Destination | Protocol | Port | Authentication |
|---|---|---|---|---|
| Client Library | Coordinator | gRPC-Web | 8001 | JWT Bearer Token |
| Client Library | Coordinator | HTTP | 8000 | JWT Bearer Token |
| Browser / External | Coordinator | gRPC | 50051 | JWT Bearer Token |
| Browser / External | Coordinator | MCP | 8002 | JWT Bearer Token |
| Coordinator | Edge | gRPC (TLS) | 50051 | Mutual TLS |
| Edge | Database | Native Driver | Varies | Database-specific |
| Edge | Health Check | HTTP | 80 | None |
Deployment Models
Datafi supports multiple deployment models to match your infrastructure and compliance requirements.
| Model | Coordinator | Edge | Data Sources | Best For |
|---|---|---|---|---|
| SaaS | Datafi-hosted | Datafi-hosted | Cloud databases | Teams that want zero infrastructure management. |
| Private Cloud | Your cloud account | Your cloud account | Your cloud databases | Organizations requiring full control over the compute layer. |
| On-Premises | Your data center | Your data center | On-premises databases | Regulated industries with strict data residency requirements. |
| Hybrid | Datafi-hosted | Your network | Mixed | Organizations that want managed orchestration with data that never leaves their perimeter. |
Most enterprise customers deploy the Hybrid model. The Coordinator runs in Datafi's managed cloud, while Edge nodes run inside the customer's network. This means your data never leaves your infrastructure, but you still benefit from managed orchestration, updates, and monitoring.
Design Principles
- Data never moves -- Queries go to the data, not the other way around. Edge nodes execute queries where the data lives.
- Minimal trust boundaries -- Each service has the minimum permissions it needs. The Coordinator never touches raw data. Edge nodes never make authorization decisions.
- Protocol efficiency -- gRPC with Protocol Buffers for internal communication. Apache Arrow for columnar result aggregation. PRQL as the intermediate query representation.
- Horizontal scalability -- You can deploy as many Edge nodes as you need, each serving a different set of data sources. The Coordinator handles routing and aggregation transparently.
Next Steps
- Key Concepts -- Learn the terminology used throughout the platform.
- Request Lifecycle -- Understand what happens when you execute a query.
- Supported Data Sources -- See the full list of databases and connectors.
- Multi-Protocol APIs -- Explore the available API protocols and their use cases.