Skip to main content

Operations Guide

This guide covers day-to-day operational concerns for running Datafi in production, including health checks, shutdown behavior, update strategies, monitoring, and caching.

Health Monitoring

Datafi exposes health endpoints on both the coordinator and edge server for liveness and readiness checks.

HTTP Health Check

# Liveness check
curl http://localhost:8000/health

Response (healthy):

{
"status": "healthy",
"version": "1.12.0",
"uptime": "3d 14h 22m"
}

Response (unhealthy):

{
"status": "unhealthy",
"version": "1.12.0",
"checks": {
"database": "timeout",
"cache": "healthy"
}
}

gRPC Health Check

The gRPC Ping RPC serves as a health check for gRPC clients.

# Using grpcurl
grpcurl -plaintext localhost:50051 datafi.edge.v1.EdgeService/Ping

Response:

{
"status": "SERVING"
}

Health Check Configuration

ParameterDefaultDescription
Health endpoint path/healthHTTP GET endpoint
gRPC health RPCPinggRPC liveness check
Check interval (K8s)30sHow often the orchestrator polls
Timeout5sMax wait for a health response
Failure threshold3Consecutive failures before restart

Graceful Shutdown

When a Datafi container receives a termination signal (SIGTERM), it initiates a graceful shutdown sequence.

Shutdown Behavior

  1. Stop accepting new connections. The server immediately stops accepting new TCP connections.
  2. Drain in-flight requests. All currently executing requests are allowed to complete within the grace period.
  3. Flush buffers. Any cached data or pending log entries are flushed.
  4. Close connections. Database connections, Redis connections, and mTLS channels are closed cleanly.
  5. Exit. The process exits with code 0.

The default grace period is 30 seconds. If in-flight requests do not complete within this window, they are terminated.

tip

Set your orchestrator's terminationGracePeriodSeconds to at least 30 seconds to allow Datafi to drain requests before forced termination.

Rolling Updates

Use rolling updates to deploy new versions of Datafi without downtime.

Kubernetes Rolling Update Strategy

spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 30

Update Procedure

  1. Update the image tag in your deployment manifest.
  2. Apply the manifest. Kubernetes creates a new pod with the updated image.
  3. Readiness probe passes. The new pod begins accepting traffic only after its /health endpoint returns healthy.
  4. Old pod drains. The old pod receives SIGTERM and gracefully shuts down.
  5. Repeat. The process continues until all pods are updated.
# Update the image
kubectl set image deployment/datafi-coordinator \
coordinator=datafi/coordinator:1.13.0

# Monitor the rollout
kubectl rollout status deployment/datafi-coordinator

Rollback

If a deployment introduces issues, roll back immediately:

kubectl rollout undo deployment/datafi-coordinator

Monitoring Recommendations

Track the following metrics to ensure healthy operation.

MetricTargetAlert ThresholdDescription
Query latency (p99)< 5s> 5s99th percentile query execution time
Query latency (p50)< 500ms> 2sMedian query execution time
Error rate< 1%> 1%Percentage of requests returning 5xx
Active connectionsVaries> 80% of limitCurrent open connections to data sources
Cache hit rate> 70%< 50%Percentage of queries served from cache
Memory usage< 80%> 85%Container memory utilization
CPU usage< 70%> 80%Container CPU utilization

OpenTelemetry Integration

Datafi exports traces and metrics via OpenTelemetry. Configure the collector endpoint using environment variables.

docker run -d \
--name datafi-coordinator \
-e MODE=coordinator \
-e OTEL_ENDPOINT=https://otel-collector.example.com:4317 \
-e OTEL_TOKEN=your-otel-token \
datafi/coordinator:latest

Exported telemetry includes:

  • Traces -- end-to-end request traces through authentication, authorization, query execution, and response delivery.
  • Metrics -- query latency histograms, error counters, cache hit/miss ratios, connection pool utilization.
  • Logs -- structured JSON logs with request IDs, tenant IDs, and operation details.

Caching Configuration

Datafi caches frequently accessed data to reduce latency and load on data sources.

Cache LayerDefault TTLScopeDescription
Catalog cache1 hourPer tenantSchema metadata, table definitions, column types
Query result cache1 hourPer user + queryResults of previously executed queries
JWKS cacheOn-demandGlobalIdentity provider signing keys

Configuring Cache TTLs

caching:
enabled: true
catalog:
ttl: 3600 # 1 hour in seconds
query_results:
ttl: 3600 # 1 hour in seconds
max_size: 100mb # Maximum cache size per tenant
jwks:
strategy: on_demand # Fetched when an unknown kid is encountered

Cache Invalidation

EventCache Invalidated
Schema change detectedCatalog cache for affected tables
Policy updateQuery result cache for affected datasets
Manual flushAll caches for the tenant
TTL expirationSpecific cache entry

To manually flush a tenant's cache:

# Via the admin API
curl -X POST https://api.datafi.io/admin/cache/flush \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"tenant_id": "tenant_001"}'

Disabling Cache

If you need real-time data with no caching:

docker run -d \
--name datafi-edge \
-e MODE=edge \
-e CACHE_ENABLED=false \
datafi/edge:latest
warning

Disabling caching increases load on your data sources and may increase query latency. Only disable caching when real-time freshness is a strict requirement.

Log Management

Datafi emits structured JSON logs to stdout. Direct these to your centralized logging platform.

{
"level": "info",
"timestamp": "2025-01-15T10:30:00Z",
"request_id": "req_7f8a9b2c",
"tenant_id": "tenant_001",
"user_id": "user_abc123",
"operation": "query.execute",
"duration_ms": 142,
"status": "success"
}

Log Levels

LevelUse
debugDetailed diagnostic information; use only during troubleshooting
infoNormal operational events (default)
warnPotentially harmful situations that do not cause failures
errorErrors that prevent a specific request from completing

Best Practices

  1. Always use readiness probes. Prevent traffic from reaching containers that are not yet ready to serve requests.
  2. Set resource limits. Define CPU and memory limits to prevent a single container from consuming excessive resources.
  3. Monitor the p99 latency. A rising p99 often indicates resource contention or slow data sources.
  4. Keep caching enabled in production. The performance benefit of caching almost always outweighs the slight data staleness.
  5. Export telemetry to a centralized platform. Distributed tracing across coordinator and edge servers is essential for diagnosing performance issues.