Operations Guide
This guide covers day-to-day operational concerns for running Datafi in production, including health checks, shutdown behavior, update strategies, monitoring, and caching.
Health Monitoring
Datafi exposes health endpoints on both the coordinator and edge server for liveness and readiness checks.
HTTP Health Check
# Liveness check
curl http://localhost:8000/health
Response (healthy):
{
"status": "healthy",
"version": "1.12.0",
"uptime": "3d 14h 22m"
}
Response (unhealthy):
{
"status": "unhealthy",
"version": "1.12.0",
"checks": {
"database": "timeout",
"cache": "healthy"
}
}
gRPC Health Check
The gRPC Ping RPC serves as a health check for gRPC clients.
# Using grpcurl
grpcurl -plaintext localhost:50051 datafi.edge.v1.EdgeService/Ping
Response:
{
"status": "SERVING"
}
Health Check Configuration
| Parameter | Default | Description |
|---|---|---|
| Health endpoint path | /health | HTTP GET endpoint |
| gRPC health RPC | Ping | gRPC liveness check |
| Check interval (K8s) | 30s | How often the orchestrator polls |
| Timeout | 5s | Max wait for a health response |
| Failure threshold | 3 | Consecutive failures before restart |
Graceful Shutdown
When a Datafi container receives a termination signal (SIGTERM), it initiates a graceful shutdown sequence.
Shutdown Behavior
- Stop accepting new connections. The server immediately stops accepting new TCP connections.
- Drain in-flight requests. All currently executing requests are allowed to complete within the grace period.
- Flush buffers. Any cached data or pending log entries are flushed.
- Close connections. Database connections, Redis connections, and mTLS channels are closed cleanly.
- Exit. The process exits with code 0.
The default grace period is 30 seconds. If in-flight requests do not complete within this window, they are terminated.
Set your orchestrator's terminationGracePeriodSeconds to at least 30 seconds to allow Datafi to drain requests before forced termination.
Rolling Updates
Use rolling updates to deploy new versions of Datafi without downtime.
Kubernetes Rolling Update Strategy
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 30
Update Procedure
- Update the image tag in your deployment manifest.
- Apply the manifest. Kubernetes creates a new pod with the updated image.
- Readiness probe passes. The new pod begins accepting traffic only after its
/healthendpoint returns healthy. - Old pod drains. The old pod receives
SIGTERMand gracefully shuts down. - Repeat. The process continues until all pods are updated.
# Update the image
kubectl set image deployment/datafi-coordinator \
coordinator=datafi/coordinator:1.13.0
# Monitor the rollout
kubectl rollout status deployment/datafi-coordinator
Rollback
If a deployment introduces issues, roll back immediately:
kubectl rollout undo deployment/datafi-coordinator
Monitoring Recommendations
Track the following metrics to ensure healthy operation.
| Metric | Target | Alert Threshold | Description |
|---|---|---|---|
| Query latency (p99) | < 5s | > 5s | 99th percentile query execution time |
| Query latency (p50) | < 500ms | > 2s | Median query execution time |
| Error rate | < 1% | > 1% | Percentage of requests returning 5xx |
| Active connections | Varies | > 80% of limit | Current open connections to data sources |
| Cache hit rate | > 70% | < 50% | Percentage of queries served from cache |
| Memory usage | < 80% | > 85% | Container memory utilization |
| CPU usage | < 70% | > 80% | Container CPU utilization |
OpenTelemetry Integration
Datafi exports traces and metrics via OpenTelemetry. Configure the collector endpoint using environment variables.
docker run -d \
--name datafi-coordinator \
-e MODE=coordinator \
-e OTEL_ENDPOINT=https://otel-collector.example.com:4317 \
-e OTEL_TOKEN=your-otel-token \
datafi/coordinator:latest
Exported telemetry includes:
- Traces -- end-to-end request traces through authentication, authorization, query execution, and response delivery.
- Metrics -- query latency histograms, error counters, cache hit/miss ratios, connection pool utilization.
- Logs -- structured JSON logs with request IDs, tenant IDs, and operation details.
Caching Configuration
Datafi caches frequently accessed data to reduce latency and load on data sources.
| Cache Layer | Default TTL | Scope | Description |
|---|---|---|---|
| Catalog cache | 1 hour | Per tenant | Schema metadata, table definitions, column types |
| Query result cache | 1 hour | Per user + query | Results of previously executed queries |
| JWKS cache | On-demand | Global | Identity provider signing keys |
Configuring Cache TTLs
caching:
enabled: true
catalog:
ttl: 3600 # 1 hour in seconds
query_results:
ttl: 3600 # 1 hour in seconds
max_size: 100mb # Maximum cache size per tenant
jwks:
strategy: on_demand # Fetched when an unknown kid is encountered
Cache Invalidation
| Event | Cache Invalidated |
|---|---|
| Schema change detected | Catalog cache for affected tables |
| Policy update | Query result cache for affected datasets |
| Manual flush | All caches for the tenant |
| TTL expiration | Specific cache entry |
To manually flush a tenant's cache:
# Via the admin API
curl -X POST https://api.datafi.io/admin/cache/flush \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"tenant_id": "tenant_001"}'
Disabling Cache
If you need real-time data with no caching:
docker run -d \
--name datafi-edge \
-e MODE=edge \
-e CACHE_ENABLED=false \
datafi/edge:latest
Disabling caching increases load on your data sources and may increase query latency. Only disable caching when real-time freshness is a strict requirement.
Log Management
Datafi emits structured JSON logs to stdout. Direct these to your centralized logging platform.
{
"level": "info",
"timestamp": "2025-01-15T10:30:00Z",
"request_id": "req_7f8a9b2c",
"tenant_id": "tenant_001",
"user_id": "user_abc123",
"operation": "query.execute",
"duration_ms": 142,
"status": "success"
}
Log Levels
| Level | Use |
|---|---|
debug | Detailed diagnostic information; use only during troubleshooting |
info | Normal operational events (default) |
warn | Potentially harmful situations that do not cause failures |
error | Errors that prevent a specific request from completing |
Best Practices
- Always use readiness probes. Prevent traffic from reaching containers that are not yet ready to serve requests.
- Set resource limits. Define CPU and memory limits to prevent a single container from consuming excessive resources.
- Monitor the p99 latency. A rising p99 often indicates resource contention or slow data sources.
- Keep caching enabled in production. The performance benefit of caching almost always outweighs the slight data staleness.
- Export telemetry to a centralized platform. Distributed tracing across coordinator and edge servers is essential for diagnosing performance issues.