Skip to main content

Platform Architecture

Datafi is built on a three-service architecture designed for security, performance, and flexibility. Each service has a distinct responsibility, and together they enable federated query execution across any number of data sources without centralizing your data.

Architecture Diagram

The Three Services

Coordinator

The Coordinator is the central orchestration hub of the Datafi platform. It handles every aspect of request processing except direct database connectivity. When you send a query, the Coordinator validates your identity, enforces access policies, parses and optimizes the query, routes sub-queries to the appropriate Edge nodes, and aggregates results before returning them to you.

Key responsibilities:

  • JWT validation -- Authenticates every incoming request.
  • ABAC enforcement -- Applies attribute-based access control policies to determine what data you can see.
  • Query parsing and optimization -- Breaks complex federated queries into sub-queries targeted at specific data sources.
  • Routing -- Dispatches sub-queries to the correct Edge nodes over gRPC (TLS).
  • Result aggregation -- Merges results from multiple Edge nodes using Apache Arrow for high-performance columnar processing.
  • AI/ML orchestration -- Coordinates agent workflows and model interactions.
Design Principle

The Coordinator never connects directly to your databases. All data access flows through Edge nodes, preserving network isolation and minimizing your attack surface.

Edge

Edge nodes provide secure, minimal database connectivity. Each Edge node is deployed close to your data sources -- inside your VPC, your data center, or wherever your databases reside. The Edge exposes only three RPC methods, keeping the attack surface as small as possible.

RPC MethodPurpose
QueryExecute a compiled SQL statement against a connected database and return results.
GetSchemaRetrieve table and column metadata from a connected database.
PingHealth check to verify the Edge node and its database connections are operational.

Key characteristics:

  • Minimal surface area -- Only 3 RPC methods, no business logic.
  • Deployed at the data -- Runs inside your network perimeter, adjacent to your databases.
  • TLS-secured gRPC -- All communication between the Coordinator and Edge nodes is encrypted in transit.
  • Stateless -- Edge nodes do not cache query results or store user data.

Client Library

The Client Library is a WebAssembly-based SDK that runs directly in the browser. It provides a GraphQL interface for building data applications with near-native performance.

Key characteristics:

  • WebAssembly -- Compiled to Wasm for near-native execution speed in the browser.
  • GraphQL access -- You interact with your data through a familiar GraphQL API.
  • Lightweight -- No server-side rendering required; the library runs entirely client-side.
  • Cross-platform -- Works in any modern browser that supports WebAssembly.

Service Communication

SourceDestinationProtocolPortAuthentication
Client LibraryCoordinatorgRPC-Web8001JWT Bearer Token
Client LibraryCoordinatorHTTP8000JWT Bearer Token
Browser / ExternalCoordinatorgRPC50051JWT Bearer Token
Browser / ExternalCoordinatorMCP8002JWT Bearer Token
CoordinatorEdgegRPC (TLS)50051Mutual TLS
EdgeDatabaseNative DriverVariesDatabase-specific
EdgeHealth CheckHTTP80None

Deployment Models

Datafi supports multiple deployment models to match your infrastructure and compliance requirements.

ModelCoordinatorEdgeData SourcesBest For
SaaSDatafi-hostedDatafi-hostedCloud databasesTeams that want zero infrastructure management.
Private CloudYour cloud accountYour cloud accountYour cloud databasesOrganizations requiring full control over the compute layer.
On-PremisesYour data centerYour data centerOn-premises databasesRegulated industries with strict data residency requirements.
HybridDatafi-hostedYour networkMixedOrganizations that want managed orchestration with data that never leaves their perimeter.
Hybrid Is the Most Common

Most enterprise customers deploy the Hybrid model. The Coordinator runs in Datafi's managed cloud, while Edge nodes run inside the customer's network. This means your data never leaves your infrastructure, but you still benefit from managed orchestration, updates, and monitoring.

Design Principles

  1. Data never moves -- Queries go to the data, not the other way around. Edge nodes execute queries where the data lives.
  2. Minimal trust boundaries -- Each service has the minimum permissions it needs. The Coordinator never touches raw data. Edge nodes never make authorization decisions.
  3. Protocol efficiency -- gRPC with Protocol Buffers for internal communication. Apache Arrow for columnar result aggregation. PRQL as the intermediate query representation.
  4. Horizontal scalability -- You can deploy as many Edge nodes as you need, each serving a different set of data sources. The Coordinator handles routing and aggregation transparently.

Next Steps