Skip to main content

Request Lifecycle

Every query in Datafi follows a deterministic, six-stage lifecycle. This page walks you through each stage, from the moment your request arrives at the Coordinator to the moment aggregated results are returned to your application.

Lifecycle Overview

Stage 1: Authentication

Every request must include a valid JSON Web Token (JWT). The Coordinator validates the token before any other processing occurs.

What happens:

  1. You send a request to the Coordinator over gRPC-Web (:8001), HTTP (:8000), or gRPC (:50051). The JWT is included as a bearer token in the request metadata or authorization header.
  2. The Coordinator validates the token signature against the tenant's configured identity provider.
  3. The Coordinator checks the token's expiration, issuer, audience, and required claims.
  4. If validation fails, the request is rejected immediately with an authentication error.

What you need to know:

  • Tokens are issued by your identity provider (e.g., Auth0, Okta, Azure AD, or Datafi's built-in provider).
  • The Coordinator does not store tokens. Validation is stateless and performed on every request.
  • Token refresh is handled by the Client Library automatically.
warning

Expired or malformed tokens result in an immediate rejection. No downstream processing occurs. Ensure your identity provider is configured with appropriate token lifetimes.

Stage 2: Authorization

After authentication, the Coordinator evaluates Attribute-Based Access Control (ABAC) policies to determine what you are allowed to do.

What happens:

  1. The ABAC engine loads all policies attached to the requested resource (dataset, data view, or workspace).
  2. Each policy's rules are evaluated against attributes from four dimensions:
    • User attributes -- Role, department, group memberships, custom claims from the JWT.
    • Resource attributes -- Dataset name, sensitivity classification, data source type.
    • Action attributes -- Read, write, schema discovery, export.
    • Environment attributes -- Time of day, IP address, device type.
  3. The engine produces one of three outcomes:
    • Allow -- The request proceeds unchanged.
    • Deny -- The request is rejected with an authorization error.
    • Filter -- The request proceeds, but row-level or column-level restrictions are injected into the query.

What you need to know:

  • ABAC policies are evaluated entirely within the Coordinator. Edge nodes never make authorization decisions.
  • Row-level filters are appended as WHERE clauses during query planning.
  • Column-level restrictions remove columns from the projection or replace values with masked output.
  • Policy evaluation results are not cached between requests. Each request is evaluated independently.

Stage 3: Query Planning

The Query Planner breaks your request into sub-queries, each targeted at a specific data source.

What happens:

  1. The planner parses your incoming query (GraphQL from the Client Library, or a direct API call).
  2. It identifies which datasets are involved and which Edge nodes host those datasets.
  3. For federated queries that span multiple data sources, the planner determines:
    • Which operations can be pushed down to individual databases (filters, aggregations, sorting).
    • Which operations must be performed during result aggregation (cross-source joins, unions).
  4. The planner produces a sub-query plan -- a directed acyclic graph of operations.

What you need to know:

  • Push-down optimization is critical for performance. The planner pushes as much computation as possible to the source databases, reducing the volume of data transferred over the network.
  • If your query touches only a single data source, the plan contains a single sub-query and no aggregation step.
  • The planner injects any row-level or column-level filters from the authorization stage into the appropriate sub-queries.

Stage 4: SQL Compilation (PRQL)

Each sub-query in the plan is compiled from the internal representation into database-specific SQL using PRQL (Pipelined Relational Query Language) as the intermediate representation.

What happens:

  1. The sub-query is expressed as a PRQL pipeline. PRQL is more composable than raw SQL and supports a consistent syntax regardless of the target database.
  2. The PRQL compiler translates the pipeline into the SQL dialect required by the target database (e.g., Snowflake SQL, T-SQL for MSSQL, BigQuery Standard SQL).
  3. Database-specific optimizations are applied during compilation, such as vendor-specific function names, quoting conventions, and syntax variations.

Why PRQL?

ConcernSQLPRQL
ComposabilitySubqueries, CTEs, nested expressionsLinear pipeline of transformations
Dialect handlingManual per-database syntaxSingle syntax, compiled to any dialect
Policy injectionString concatenation or query rewritingPipeline stages inserted programmatically
ReadabilityCan become deeply nestedAlways reads top to bottom

Example:

from employees
filter department == "engineering"
filter start_date > @2024-01-01
select {employee_id, name, title, salary}
sort {-salary}
take 50

This PRQL compiles to the appropriate SQL for whichever database hosts the employees dataset -- whether that is PostgreSQL, Snowflake, MSSQL, or any other supported source.

info

You do not write PRQL directly. The Coordinator generates PRQL internally from your GraphQL queries or API calls. PRQL is an implementation detail of the compilation pipeline.

Stage 5: Parallel Execution

The compiled SQL statements are dispatched to Edge nodes for execution against the target databases.

What happens:

  1. The Coordinator opens gRPC (TLS) streams to the relevant Edge nodes.
  2. Each Edge node receives a compiled SQL statement through the Query RPC method.
  3. The Edge node executes the SQL against its connected database using the appropriate native driver or ODBC connection.
  4. Results are serialized into Apache Arrow columnar format and streamed back to the Coordinator.

What you need to know:

  • Sub-queries targeting different Edge nodes execute in parallel. A federated query that touches three databases does not take three times as long -- the queries run concurrently.
  • Each Edge node manages its own connection pool to its databases.
  • If an Edge node is unreachable or a database query fails, the Coordinator reports the error for that sub-query without blocking results from other sub-queries (when possible).
  • The maximum message size is 1 GB. The default timeout is 5 minutes, configurable per request.

Stage 6: Result Aggregation

The Coordinator merges results from all Edge nodes into a single, unified response.

What happens:

  1. As Arrow-formatted result sets arrive from Edge nodes, the Result Aggregator processes them.
  2. If the query plan includes cross-source operations (joins, unions, deduplication), they are performed on the aggregated data.
  3. Final sorting, pagination, and formatting are applied.
  4. The unified result set is returned to the client.

What you need to know:

  • Apache Arrow enables high-performance, zero-copy columnar processing during aggregation.
  • The Coordinator performs only the operations that could not be pushed down to the source databases. Most filtering and aggregation happens at the Edge.
  • Results are streamed back to the client as they become available when using gRPC or gRPC-Web protocols.

End-to-End Flow Summary

StageComponentInputOutput
1. AuthenticationCoordinatorJWTValidated identity
2. AuthorizationABAC EngineIdentity + resource + actionAllow / Deny / Filter conditions
3. Query PlanningQuery PlannerParsed query + filtersSub-query plan (DAG)
4. SQL CompilationPRQL CompilerSub-query planDatabase-specific SQL statements
5. Parallel ExecutionEdge NodesSQL statementsApache Arrow result sets
6. Result AggregationResult AggregatorMultiple Arrow result setsUnified response

Error Handling

Errors can occur at any stage of the lifecycle. The Coordinator returns structured error responses that identify the stage where the failure occurred.

StageCommon Errors
AuthenticationExpired token, invalid signature, missing claims
AuthorizationInsufficient permissions, policy denial
Query PlanningUnknown dataset, invalid query syntax
SQL CompilationUnsupported operation for target dialect
Parallel ExecutionEdge node unreachable, database timeout, connection failure
Result AggregationMemory limit exceeded, incompatible schemas for join

Next Steps