Edge API
The Datafi edge server exposes a deliberately minimal API surface. It runs close to your data sources and handles only three operations: schema retrieval, query execution, and health checking. All governance, authentication, and catalog logic remain on the coordinator -- the edge server is a lean execution layer.
Design Philosophy
The edge API follows a minimal surface area design. By limiting the edge server to three RPCs, Datafi reduces the attack surface of the component that has direct access to your databases.
| Characteristic | Detail |
|---|---|
| Total RPCs | 3 |
| Authentication | mTLS (coordinator to edge); JWT not required on edge directly |
| Authorization | Enforced by coordinator before forwarding to edge |
| Protocol | gRPC (port 50051) and HTTP (port 8000) |
| Response format | Apache Arrow (gRPC), JSON (HTTP) |
RPCs
GetSchema
Retrieves the schema of a connected data source, including tables, columns, data types, and constraints.
Use case: The coordinator calls GetSchema to populate its catalog with metadata about the data source connected to this edge server.
service EdgeService {
rpc GetSchema(GetSchemaRequest) returns (GetSchemaResponse);
}
message GetSchemaRequest {
string connection_id = 1;
bool refresh = 2; // Force refresh from the data source
}
message GetSchemaResponse {
repeated TableSchema tables = 1;
string connection_id = 2;
string last_refreshed = 3; // ISO 8601 timestamp
}
message TableSchema {
string name = 1;
string schema_name = 2;
repeated ColumnSchema columns = 3;
int64 estimated_row_count = 4;
}
message ColumnSchema {
string name = 1;
string data_type = 2;
bool nullable = 3;
bool is_primary_key = 4;
string default_value = 5;
}
HTTP equivalent:
curl https://edge.internal:8000/v1/schema/conn_abc123 \
-H "Authorization: Bearer $EDGE_TOKEN"
Example response:
{
"tables": [
{
"name": "customers",
"schema_name": "public",
"columns": [
{"name": "id", "data_type": "integer", "nullable": false, "is_primary_key": true},
{"name": "name", "data_type": "varchar(255)", "nullable": false},
{"name": "email", "data_type": "varchar(255)", "nullable": true},
{"name": "region", "data_type": "varchar(50)", "nullable": true},
{"name": "created_at", "data_type": "timestamp", "nullable": false}
],
"estimated_row_count": 125000
}
],
"connection_id": "conn_abc123",
"last_refreshed": "2025-01-15T10:30:00Z"
}
Schema results are cached by the coordinator with a default TTL of 1 hour. Set refresh: true to bypass the cache and fetch the latest schema directly from the data source.
Query
Executes a SQL query against the connected data source and returns results.
Use case: The coordinator forwards authorized, validated queries to the edge server for execution against the data source.
service EdgeService {
rpc Query(QueryRequest) returns (QueryResponse);
}
message QueryRequest {
string sql = 1;
string connection_id = 2;
map<string, string> parameters = 3;
int32 timeout_seconds = 4;
}
message QueryResponse {
bytes arrow_record_batch = 1; // Serialized Apache Arrow RecordBatch
QueryMetadata metadata = 2;
}
message QueryMetadata {
int64 row_count = 1;
int64 execution_time_ms = 2;
string query_id = 3;
bool from_cache = 4;
repeated string columns = 5;
}
HTTP equivalent:
curl -X POST https://edge.internal:8000/v1/query \
-H "Authorization: Bearer $EDGE_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT name, email, region FROM customers WHERE region = $1 LIMIT 100",
"connection_id": "conn_abc123",
"parameters": {"$1": "us-west"},
"timeout_seconds": 30
}'
Example response (HTTP/JSON):
{
"data": [
{"name": "Alice", "email": "[email protected]", "region": "us-west"},
{"name": "Carol", "email": "[email protected]", "region": "us-west"}
],
"metadata": {
"row_count": 2,
"execution_time_ms": 45,
"query_id": "qry_edge_001",
"from_cache": false,
"columns": ["name", "email", "region"]
}
}
Query Parameterization
All queries executed on the edge server use parameterized statements to prevent SQL injection. User-supplied values are passed separately from the SQL text.
{
"sql": "SELECT * FROM orders WHERE customer_id = $1 AND status = $2",
"parameters": {
"$1": "cust_123",
"$2": "completed"
}
}
The edge server rejects queries that contain unparameterized user input embedded directly in the SQL string. Always use the parameters field.
Query Timeout
If a query exceeds the specified timeout_seconds, the edge server cancels the query on the data source and returns a timeout error.
{
"error": {
"code": "QUERY_TIMEOUT",
"message": "Query exceeded the 30 second timeout.",
"query_id": "qry_edge_001"
}
}
Ping
A simple health check RPC that confirms the edge server is running and responsive.
service EdgeService {
rpc Ping(PingRequest) returns (PingResponse);
}
message PingRequest {}
message PingResponse {
string status = 1; // "SERVING"
string version = 2; // "1.12.0"
string uptime = 3; // "3d 14h 22m"
}
HTTP equivalent:
curl https://edge.internal:8000/health
Response:
{
"status": "SERVING",
"version": "1.12.0",
"uptime": "3d 14h 22m"
}
Security Model
The edge server does not perform authentication or authorization directly. These responsibilities belong to the coordinator.
The edge server trusts requests from the coordinator because the mTLS channel ensures that only the legitimate coordinator can communicate with it.
Best Practices
- Deploy edge servers close to your data sources. Network latency between the edge server and the database directly impacts query performance.
- Use private networking. Place edge servers and data sources in the same VPC or subnet to avoid public internet traversal.
- Monitor edge health. Use the
PingRPC or/healthendpoint in your orchestrator's liveness and readiness probes. - Set appropriate query timeouts. Prevent long-running queries from consuming edge server resources indefinitely.