Monitoring Your Data Platform: Dashboards, Alerts, and SLAs
Building data pipelines is only half the job. Keeping them healthy is the other half. Without good monitoring, you will find out about problems from angry users instead of timely alerts.
What to monitor
- Ingestion – Kafka lag, dropped messages, and input error rates.
- Processing – Job runtimes, failure counts, and retry rates.
- Storage – Disk usage, replica health, and query latency.
- Serving – API error rates, dashboard load times, and 95th percentile latency.
Dashboards
Dashboards give you a real-time view of the system. Common boards include:
- Platform overview – High-level health for leaders and on-call engineers.
- Pipeline-specific – Metrics for critical ETL or streaming jobs.
- Database health – Query performance and resource usage for ClickHouse or your warehouse.
Alerts and on-call
Good alerts are actionable and rare. Focus on:
- Data freshness outside agreed thresholds.
- Job failures that affect critical tables.
- Query latency or error rates affecting dashboards and APIs.
Define clear runbooks so on-call engineers know how to respond quickly.
SLAs and SLOs
Service Level Agreements (SLAs) and Objectives (SLOs) make expectations explicit:
- “Core dashboards are updated within 10 minutes, 99% of the time.”
- “Data warehouse is available for queries 99.9% of the time.”
When you treat your data platform like a product with its own uptime and quality guarantees, monitoring stops being an afterthought and becomes a core part of how you work.