Operations Guide

This guide covers day-to-day operational concerns for running Datafi in production, including health checks, shutdown behavior, update strategies, monitoring, and caching.

Health Monitoring

Datafi exposes health endpoints on both the coordinator and edge server for liveness and readiness checks.

HTTP Health Check

# Liveness check
curl http://localhost:8000/health

Response (healthy):

{
  "status": "healthy",
  "version": "1.12.0",
  "uptime": "3d 14h 22m"
}

Response (unhealthy):

{
  "status": "unhealthy",
  "version": "1.12.0",
  "checks": {
    "database": "timeout",
    "cache": "healthy"
  }
}

gRPC Health Check

The gRPC Ping RPC serves as a health check for gRPC clients.

# Using grpcurl
grpcurl -plaintext localhost:50051 datafi.edge.v1.EdgeService/Ping

Response:

{
  "status": "SERVING"
}

Health Check Configuration

Parameter	Default	Description
Health endpoint path	`/health`	HTTP GET endpoint
gRPC health RPC	`Ping`	gRPC liveness check
Check interval (K8s)	30s	How often the orchestrator polls
Timeout	5s	Max wait for a health response
Failure threshold	3	Consecutive failures before restart

Graceful Shutdown

When a Datafi container receives a termination signal (SIGTERM), it initiates a graceful shutdown sequence.

Shutdown Behavior

Stop accepting new connections. The server immediately stops accepting new TCP connections.
Drain in-flight requests. All currently executing requests are allowed to complete within the grace period.
Flush buffers. Any cached data or pending log entries are flushed.
Close connections. Database connections, Redis connections, and mTLS channels are closed cleanly.
Exit. The process exits with code 0.

The default grace period is 30 seconds. If in-flight requests do not complete within this window, they are terminated.

tip

Set your orchestrator's terminationGracePeriodSeconds to at least 30 seconds to allow Datafi to drain requests before forced termination.

Rolling Updates

Use rolling updates to deploy new versions of Datafi without downtime.

Kubernetes Rolling Update Strategy

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      terminationGracePeriodSeconds: 30

Update Procedure

Update the image tag in your deployment manifest.
Apply the manifest. Kubernetes creates a new pod with the updated image.
Readiness probe passes. The new pod begins accepting traffic only after its /health endpoint returns healthy.
Old pod drains. The old pod receives SIGTERM and gracefully shuts down.
Repeat. The process continues until all pods are updated.

# Update the image
kubectl set image deployment/datafi-coordinator \
  coordinator=datafi/coordinator:1.13.0

# Monitor the rollout
kubectl rollout status deployment/datafi-coordinator

Rollback

If a deployment introduces issues, roll back immediately:

kubectl rollout undo deployment/datafi-coordinator

Monitoring Recommendations

Track the following metrics to ensure healthy operation.

Metric	Target	Alert Threshold	Description
Query latency (p99)	< 5s	> 5s	99th percentile query execution time
Query latency (p50)	< 500ms	> 2s	Median query execution time
Error rate	< 1%	> 1%	Percentage of requests returning 5xx
Active connections	Varies	> 80% of limit	Current open connections to data sources
Cache hit rate	> 70%	< 50%	Percentage of queries served from cache
Memory usage	< 80%	> 85%	Container memory utilization
CPU usage	< 70%	> 80%	Container CPU utilization

OpenTelemetry Integration

Datafi exports traces and metrics via OpenTelemetry. Configure the collector endpoint using environment variables.

docker run -d \
  --name datafi-coordinator \
  -e MODE=coordinator \
  -e OTEL_ENDPOINT=https://otel-collector.example.com:4317 \
  -e OTEL_TOKEN=your-otel-token \
  datafi/coordinator:latest

Exported telemetry includes:

Traces -- end-to-end request traces through authentication, authorization, query execution, and response delivery.
Metrics -- query latency histograms, error counters, cache hit/miss ratios, connection pool utilization.
Logs -- structured JSON logs with request IDs, tenant IDs, and operation details.

Caching Configuration

Datafi caches frequently accessed data to reduce latency and load on data sources.

Cache Layer	Default TTL	Scope	Description
Catalog cache	1 hour	Per tenant	Schema metadata, table definitions, column types
Query result cache	1 hour	Per user + query	Results of previously executed queries
JWKS cache	On-demand	Global	Identity provider signing keys

Configuring Cache TTLs

caching:
  enabled: true
  catalog:
    ttl: 3600        # 1 hour in seconds
  query_results:
    ttl: 3600        # 1 hour in seconds
    max_size: 100mb   # Maximum cache size per tenant
  jwks:
    strategy: on_demand  # Fetched when an unknown kid is encountered

Cache Invalidation

Event	Cache Invalidated
Schema change detected	Catalog cache for affected tables
Policy update	Query result cache for affected datasets
Manual flush	All caches for the tenant
TTL expiration	Specific cache entry

To manually flush a tenant's cache:

# Via the admin API
curl -X POST https://api.datafi.io/admin/cache/flush \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tenant_id": "tenant_001"}'

Disabling Cache

If you need real-time data with no caching:

docker run -d \
  --name datafi-edge \
  -e MODE=edge \
  -e CACHE_ENABLED=false \
  datafi/edge:latest

warning

Disabling caching increases load on your data sources and may increase query latency. Only disable caching when real-time freshness is a strict requirement.

Log Management

Datafi emits structured JSON logs to stdout. Direct these to your centralized logging platform.

{
  "level": "info",
  "timestamp": "2025-01-15T10:30:00Z",
  "request_id": "req_7f8a9b2c",
  "tenant_id": "tenant_001",
  "user_id": "user_abc123",
  "operation": "query.execute",
  "duration_ms": 142,
  "status": "success"
}

Log Levels

Level	Use
`debug`	Detailed diagnostic information; use only during troubleshooting
`info`	Normal operational events (default)
`warn`	Potentially harmful situations that do not cause failures
`error`	Errors that prevent a specific request from completing

Best Practices

Always use readiness probes. Prevent traffic from reaching containers that are not yet ready to serve requests.
Set resource limits. Define CPU and memory limits to prevent a single container from consuming excessive resources.
Monitor the p99 latency. A rising p99 often indicates resource contention or slow data sources.
Keep caching enabled in production. The performance benefit of caching almost always outweighs the slight data staleness.
Export telemetry to a centralized platform. Distributed tracing across coordinator and edge servers is essential for diagnosing performance issues.

Health Monitoring​

HTTP Health Check​

gRPC Health Check​

Health Check Configuration​

Graceful Shutdown​

Shutdown Behavior​

Rolling Updates​

Kubernetes Rolling Update Strategy​

Update Procedure​

Rollback​

Monitoring Recommendations​

OpenTelemetry Integration​

Caching Configuration​

Configuring Cache TTLs​

Cache Invalidation​

Disabling Cache​

Log Management​

Log Levels​

Best Practices​