Phase 2: Observability
Making the pipeline legible: Prometheus scraping by pod annotation, a provisioned Grafana dashboard, Loki and Alloy for logs, and consumer lag as a first-class metric. Plus why end-to-end latency has to be instrumented in the app, not inferred.
Third build post in the data-platform series. Phase 1 moved data. Phase 2 makes that movement visible, because every change after this is only as trustworthy as your ability to see what it did.
Run it
Apply everything through this phase (cumulative), then open the live dev UI:
make phase-2
make tilt PHASE=phase-2-observability
(First run on a clean machine: bring up the cluster first, see Phase 0.)
What’s added in this phase?
- Prometheus, scraping by pod annotation (no operator, no CRDs),
- a provisioned Grafana dashboard, baked in as a ConfigMap,
- Loki and Grafana Alloy for log aggregation,
- kafka-lag-exporter, so consumer lag is a real metric,
- end-to-end latency instrumented inside the Spark stream processor.
Phase 2: Observability
Adds metrics, dashboards, and logs: Prometheus, Grafana, Loki, Alloy, and the consumer-lag exporter. Everything else (in muted) was already there: one cluster, growing a layer at a time.
Scraping without an operator
Prometheus here is the bare server, not the operator
(overlays/phase-2-observability/prometheus.yaml).
It discovers targets by pod annotation. An annotation is a free-form key: value you attach to a
pod’s metadata; unlike a label it is not used for selection, it just carries information for tools to
read. Prometheus watches the Kubernetes API for pods and keeps only the ones whose annotation says
to scrape them. That keeps the platform light and the mechanism obvious:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
Turning on scraping for a workload is then just three annotations on its pod, added as a phase-2
patch over the phase-1 app
(instrumentation-patch.yaml):
prometheus.io/scrape: "true"
prometheus.io/port: "9000"
prometheus.io/path: /metrics
Latency you cannot infer
Throughput and consumer lag come for free from the broker (Redpanda) and the exporter (kafka-lag-exporter). End-to-end latency does not. “How long from when an event happened to when we stored it” can only be measured by the code that sees both timestamps, so the stream processor records it as a histogram per batch:
E2E_LATENCY = Histogram("stream_end_to_end_latency_seconds", ...)
def observe(rows):
now = datetime.now(UTC)
for r in rows:
E2E_LATENCY.observe((now - r["event_time"]).total_seconds())
Grafana then renders p50 and p99 from the histogram buckets. This is the difference between a metric you compute and one you wish you had: latency had to be designed in.
Consumer lag is the health signal
For a streaming platform, consumer lag (how far behind the Redpanda log the Spark stream-processor
is) is the single most important number. If lag is flat, the platform is keeping up; if it climbs,
something is wrong upstream or the processor is too slow. kafka-lag-exporter polls the
stream-processor’s consumer-group offsets against the end of the log and exports
kafka_consumergroup_group_lag, which becomes a panel and, in phase 4, an alert and the basis for
backpressure.
The consumer is stalled or too slow. Lag is piling up. Speed it back to "healthy" and watch it drain (this is what backpressure and a scaled consumer buy you in phase 4).
Logs, searchable
Alloy runs as a DaemonSet, discovers pods through the Kubernetes API, tails their logs, and ships
them to Loki with namespace, pod, and app labels attached. Grafana queries Loki as a second
datasource, so a spike on a metric panel and the log lines behind it are one click apart.
To actually read them, two UIs, both port-forwarded by Tilt:
- Grafana, http://localhost:3000. Go to Explore, pick the Loki
datasource, and query logs by label, for example
{app="producer"}or{namespace="platform"} |= "error". That is LogQL: select a log stream by its labels, then optionally filter the lines (|=is “contains”). The Prometheus metrics are available here too: in Explore via the Prometheus datasource, and pre-built in the “Platform overview” dashboard. - Prometheus, http://localhost:9090. Use the Graph tab to run raw
PromQL (try
upto see which targets are scraped, orkafka_consumergroup_group_lag), and Status → Targets to confirm the scrape is working. Its Alerts tab is empty on purpose, no alert rules ship until phase 4.
What you see
The Grafana “Platform overview” dashboard comes up already provisioned, no click-ops
(grafana.yaml):
producer throughput by tenant, end-to-end latency p50/p99, consumer lag, and processed rows per
second. A verification script
(scripts/demo-phase-2.sh)
asserts the three required signals and the dashboard all exist:
bash scripts/demo-phase-2.sh # asserts the signals + dashboard exist
What’s in the overlay
Phase 2 is overlays/phase-2-observability/, which bases on phase 1 and adds the metrics, logs, lag, and dashboards.
prometheus.yaml: metrics
prometheus.yaml is Prometheus (no operator): a ConfigMap scrape config that discovers pods by the prometheus.io/scrape annotation and statically scrapes Redpanda and the lag exporter, plus the ClusterRole it needs to list pods for discovery.
loki.yaml: logs
loki.yaml runs Loki in single-binary mode with filesystem storage. It stores the logs Alloy ships.
alloy.yaml: log shipping
alloy.yaml is a Grafana Alloy DaemonSet that discovers pods through the Kubernetes API, tails their logs, and pushes them to Loki with namespace/pod/app labels (plus the RBAC it needs).
kafka-lag-exporter.yaml: consumer lag
kafka-lag-exporter.yaml polls consumer-group offsets against the log end and exports kafka_consumergroup_group_lag, the lag metric.
grafana.yaml: dashboards
grafana.yaml is Grafana with provisioned datasources (Prometheus and Loki) and a baked-in platform-overview dashboard, so it comes up working with no click-ops.
instrumentation-patch.yaml: wire the app in
instrumentation-patch.yaml is a strategic-merge patch over the phase-1 app that adds the prometheus.io/scrape annotations to producer, stream-processor, and api, and the stream-processor’s metrics port.
Done when
Every later change is legible on a dashboard, and lag is a visible metric. From here on, when a phase changes the platform’s behaviour, you can watch it happen rather than guess.