Phase 4: Resilience
Making the pipeline survive the things a real quarter would throw at it: poison messages routed to a dead-letter topic, backpressure on the consumer, a killed processor resuming from its checkpoint, and an SLO with an error budget you can watch hold. Plus a log-DLQ vs queue-DLQ contrast.
Fifth build post in the data-platform series. Everything so far assumed the happy path. Phase 4 breaks things on purpose and makes the platform survive. You cannot wait a quarter to find out whether it holds, so you inject the failures in a few seconds and watch the error budget absorb them.
SLO, SLI, and error budget
Three terms run through this phase, so quickly:
- An SLI (service level indicator) is a measured number for how well the service is doing right now. Ours is a success ratio: of the events the processor handled, what fraction were processed successfully rather than dead-lettered.
- An SLO (service level objective) is the target you hold that SLI to. Ours is 99%: at least 99 of every 100 events should succeed.
- The error budget is what the objective leaves room for,
100% - 99% = 1%. It is the failure you are allowed before you have missed the SLO: a budget you spend, not a line you must never touch.
In this system the SLI is computed by Prometheus over a rolling 5-minute window, from two counters
the processor already exports, stream_events_processed_total and stream_dlq_messages_total:
success ratio = processed / (processed + dead-lettered), recorded as sli:success:ratio5m. The SLO
target is 0.99, the error budget is the remaining 1%, and an alert fires when the ratio drops
below target. The “error budget you can watch” section below builds all of that out.
Run it
Apply everything through this phase (cumulative), then open the live dev UI:
make phase-4
make tilt PHASE=phase-4-resilience
(First run on a clean machine: bring up the cluster first, see Phase 0.)
What’s added in this phase?
- a dead-letter topic for poison messages,
- backpressure on the stream processor,
- failure recovery from the streaming checkpoint,
- an SLO with recording rules, an error-budget dashboard, and alerts,
- an SQS-compatible queue to contrast queue-DLQ with log-DLQ.
Phase 4: Resilience
Adds Alertmanager, and an SQS-compatible queue (ElasticMQ) to contrast a queue DLQ with the log DLQ. Everything else (in muted) was already there: one cluster, growing a layer at a time.
Nothing is silently dropped
In phase 1 the processor quietly dropped events it could not parse. That is exactly the bug that
corrupts a platform invisibly. So the producer now emits a small fraction of malformed messages,
and the processor (job.py)
splits each batch into valid events and poison, and routes the poison to a dead-letter topic
instead of dropping it:
def process_batch(batch_df, batch_id):
valid = batch_df.where(col("e.event_id").isNotNull()).select("e.*")
invalid = batch_df.where(col("e.event_id").isNull()).select("raw")
sink_dlq(invalid) # poison -> events.dlq, counted as a metric
... # valid -> the sinks
A poison message is now a thing you can inspect and replay, and a stream_dlq_messages_total
counter drives a Dead-lettered messages/s panel on the Grafana “Platform overview” dashboard.
The failure is visible, not lost.
Every message is accounted for: valid events reach the sinks, and a poison message lands on the dead-letter topic where it can be inspected and replayed, instead of vanishing.
Backpressure
If lag spikes (a burst of events, or the processor falling behind), pulling the entire backlog
into one giant batch is how a streaming job falls over. Structured Streaming caps how many offsets
a trigger reads, set via a phase-4 patch
(app-resilience-patch.yaml):
env:
- { name: MAX_OFFSETS_PER_TRIGGER, value: "500" }
The backlog drains steadily instead of in one oversized gulp. Lag climbs, then recovers, and you watch both on the lag panel.
Survive a kill
Spark Structured Streaming records its progress, the Kafka offsets it has committed and its
in-flight batch state, in a checkpoint directory. In this platform that directory is a
PersistentVolumeClaim
rather than the pod’s ephemeral emptyDir, so it outlives any single pod.
That is what makes killing the processor survivable: a fresh pod mounts the same checkpoint and
resumes from the last committed offset, with no gap and no double-processing.
One script (scripts/demo-phase-4.sh)
exercises all four mechanisms in a single run, and the kill is its centerpiece:
bash scripts/demo-phase-4.sh
# == 1. Poison is dead-lettered, not dropped ==
# events.dlq messages so far: 412
# == 2. Killed processor resumes from its checkpoint (no data loss) ==
# events before: 234114 after: 234493 (climbed, no gap, no dupes)
# == 3. The SLO stays within its error budget ==
# success ratio (SLI): 0.995 (target 0.99)
# == 4. Queue contrast: the broker redrives after maxReceiveCount ==
# messages redriven to orders-dlq: 1
Step 2 is the kill. The script counts the events stored so far, deletes the running processor pod
(kubectl delete pod -l app.kubernetes.io/name=stream-processor), waits for it to catch back up, and
counts again. The deletion is the only deliberate act: from there Kubernetes does the recovery on its
own, scheduling a replacement that mounts the same checkpoint PersistentVolumeClaim and resumes from
the last committed offset. In a real outage this happens automatically, with no commands run by a
human. The two counts bracket the kill and prove it worked: after has climbed past before, and
because the processor restarted exactly at its checkpoint, the stored events are continuous, nothing
lost while it was down and nothing processed twice.
The processor is keeping pace; only the newest offset is still in flight.
What if Redpanda or Postgres crashes?
The same idea covers the rest of the pipeline, because the durable state lives on disk, not in a pod:
- Redpanda (the log) runs as a StatefulSet with its own PersistentVolumeClaim, so a restarted broker comes back with every event still on disk. Producers and the stream processor reconnect and carry on, and the processor resumes from its checkpoint exactly as above. (This is a single local broker, so it buys availability across restarts, not durability against disk loss; a real cluster replicates partitions across brokers for that.)
- Postgres and ClickHouse are also StatefulSets on PersistentVolumeClaims, and they are projections of the log, so even a total wipe is recoverable by replaying the log back into them.
The pattern repeats throughout: keep the truth on a persistent volume, and make everything else a restartable, replayable reader of it.
An error budget you can watch
That success-ratio SLI is recorded as a Prometheus rule
(prometheus-rules.yaml):
- record: sli:success:ratio5m
expr: |
sum(rate(stream_events_processed_total[5m]))
/ (
sum(rate(stream_events_processed_total[5m]))
+ sum(rate(stream_dlq_messages_total[5m]))
+ 1e-9
)
A 99% target leaves a 1% error budget. An “error budget remaining” panel shows how much slack is
left, and an ErrorBudgetBurn alert fires through Alertmanager if the ratio drops below target.
Because the producer’s poison rate is a small fraction (around 0.5%, well under the 1% budget), the
success ratio sits above target and the budget holds with room to spare. Killing the processor does
not move it: recovery loses no events, so the ratio is unaffected. That is exactly the point, a
survived failure should not spend error budget. (If you push the poison rate above 1%, the budget
drains to 0% and the alert fires, which is the failure this panel is built to catch.)
To watch it: the error-budget panel is on the
SLO and error-budget dashboard
in Grafana, and the ErrorBudgetBurn alert’s state (inactive, then pending, then firing) is on
Prometheus’s alerts page at http://localhost:9090/alerts.
Log-DLQ vs queue-DLQ
The phase also stands up ElasticMQ
(elasticmq.yaml),
an SQS-compatible queue, purely to feel a semantic contrast.
Both a log and a queue can dead-letter, but they decide differently:
- Log (Kafka/Redpanda): the consumer decides a message is poison and writes it to a separate topic. The original stays on the log and can be replayed.
- Queue (SQS): the broker redrives a message to the DLQ automatically after it fails to be
processed N times (a
maxReceiveCountredrive policy). The consumer never explicitly routes it.
The log's consumer inspects each message and writes poison to a dead-letter topic itself, leaving the original on the log to replay.
The demo sends a poison message to the SQS queue, fails to process it three times, and watches the broker move it to the dead-letter queue on its own. Since ElasticMQ speaks the SQS API, the demo drives it with the standard SQS client, the AWS CLI, pointed at the in-cluster endpoint. Same idea, opposite owner.
What’s in the overlay
Phase 4 is overlays/phase-4-resilience/, which bases on phase 3 and adds the dead-letter path, the SLO machinery, and the queue contrast.
dlq-topic-init.yaml: the dead-letter topic
dlq-topic-init.yaml is a Job that creates the events.dlq topic.
app-resilience-patch.yaml: make failure visible
app-resilience-patch.yaml sets the producer to emit a small fraction of poison messages, and gives the stream-processor backpressure (maxOffsetsPerTrigger) and the DLQ topic name.
prometheus-rules.yaml: the SLO
prometheus-rules.yaml holds recording rules for the success-ratio SLI and alerting rules for error-budget burn, high consumer lag, and a down processor.
prometheus-patch.yaml: load the rules
prometheus-patch.yaml patches the phase-2 Prometheus to load those rules and point at Alertmanager, and mounts the rules ConfigMap.
alertmanager.yaml: alerting
alertmanager.yaml is Alertmanager, which receives the alerts. The receiver is a no-op locally; the point is the alerting path exists and fires.
grafana-slo-patch.yaml: the SLO dashboard
grafana-slo-patch.yaml adds an SLO / error-budget dashboard alongside the platform overview.
elasticmq.yaml: the queue contrast
elasticmq.yaml is ElasticMQ, an SQS-compatible queue preconfigured with a main queue, a dead-letter queue, and a redrive policy, purely to contrast a queue DLQ with the log DLQ.
Done when
Injected failures are visibly survived and the SLO stays within budget. Poison diverts, lag recovers, a killed processor resumes, the queue redrives, and the error-budget panel dips but does not bottom out.