Phase 4: Resilience - Cahid Arda Öz

Fifth build post in the data-platform series. Everything so far assumed the happy path. Phase 4 breaks things on purpose and makes the platform survive. You cannot wait a quarter to find out whether it holds, so you inject the failures in a few seconds and watch the error budget absorb them.

SLO, SLI, and error budget

Three terms run through this phase, so quickly:

An SLI (service level indicator) is a measured number for how well the service is doing right now. Ours is a success ratio: of the events the processor handled, what fraction were processed successfully rather than dead-lettered.
An SLO (service level objective) is the target you hold that SLI to. Ours is 99%: at least 99 of every 100 events should succeed.
The error budget is what the objective leaves room for, 100% - 99% = 1%. It is the failure you are allowed before you have missed the SLO: a budget you spend, not a line you must never touch.

In this system the SLI is computed by Prometheus over a rolling 5-minute window, from two counters the processor already exports, stream_events_processed_total and stream_dlq_messages_total: success ratio = processed / (processed + dead-lettered), recorded as sli:success:ratio5m. The SLO target is 0.99, the error budget is the remaining 1%, and an alert fires when the ratio drops below target. The “error budget you can watch” section below builds all of that out.

Run it

Apply everything through this phase (cumulative), then open the live dev UI:

make phase-4
make tilt PHASE=phase-4-resilience

(First run on a clean machine: bring up the cluster first, see Phase 0.)

What’s added in this phase?

a dead-letter topic for poison messages,
backpressure on the stream processor,
failure recovery from the streaming checkpoint,
an SLO with recording rules, an error-budget dashboard, and alerts,
an SQS-compatible queue to contrast queue-DLQ with log-DLQ.

The cluster at phase 4: Resiliencecumulative · scrub to replay the growth

phase

k3d cluster: data-platform2 namespaces

platform

placeholderredpandapostgresproducerstream-processorapiclickhouseminioiceberg-restelasticmq

observability

prometheusgrafanalokialloykafka-lag-exporteralertmanager

Phase 4: Resilience

Adds Alertmanager, and an SQS-compatible queue (ElasticMQ) to contrast a queue DLQ with the log DLQ. Everything else (in muted) was already there: one cluster, growing a layer at a time.

Nothing is silently dropped

In phase 1 the processor quietly dropped events it could not parse. That is exactly the bug that corrupts a platform invisibly. So the producer now emits a small fraction of malformed messages, and the processor (job.py) splits each batch into valid events and poison, and routes the poison to a dead-letter topic instead of dropping it:

def process_batch(batch_df, batch_id):
    valid   = batch_df.where(col("e.event_id").isNotNull()).select("e.*")
    invalid = batch_df.where(col("e.event_id").isNull()).select("raw")
    sink_dlq(invalid)     # poison -> events.dlq, counted as a metric
    ...                   # valid -> the sinks

A poison message is now a thing you can inspect and replay, and a stream_dlq_messages_total counter drives a Dead-lettered messages/s panel on the Grafana “Platform overview” dashboard. The failure is visible, not lost.

Poison messages are dead-lettered, not droppedparse · split · route

to sinks 0dead-lettered 0

Every message is accounted for: valid events reach the sinks, and a poison message lands on the dead-letter topic where it can be inspected and replayed, instead of vanishing.

Backpressure

If lag spikes (a burst of events, or the processor falling behind), pulling the entire backlog into one giant batch is how a streaming job falls over. Structured Streaming caps how many offsets a trigger reads, set via a phase-4 patch (app-resilience-patch.yaml):

env:
  - { name: MAX_OFFSETS_PER_TRIGGER, value: "500" }

The backlog drains steadily instead of in one oversized gulp. Lag climbs, then recovers, and you watch both on the lag panel.

Survive a kill

Spark Structured Streaming records its progress, the Kafka offsets it has committed and its in-flight batch state, in a checkpoint directory. In this platform that directory is a PersistentVolumeClaim rather than the pod’s ephemeral emptyDir, so it outlives any single pod. That is what makes killing the processor survivable: a fresh pod mounts the same checkpoint and resumes from the last committed offset, with no gap and no double-processing.

One script (scripts/demo-phase-4.sh) exercises all four mechanisms in a single run, and the kill is its centerpiece:

bash scripts/demo-phase-4.sh
# == 1. Poison is dead-lettered, not dropped ==
#    events.dlq messages so far: 412
# == 2. Killed processor resumes from its checkpoint (no data loss) ==
#    events before: 234114  after: 234493  (climbed, no gap, no dupes)
# == 3. The SLO stays within its error budget ==
#    success ratio (SLI): 0.995  (target 0.99)
# == 4. Queue contrast: the broker redrives after maxReceiveCount ==
#    messages redriven to orders-dlq: 1

Step 2 is the kill. The script counts the events stored so far, deletes the running processor pod (kubectl delete pod -l app.kubernetes.io/name=stream-processor), waits for it to catch back up, and counts again. The deletion is the only deliberate act: from there Kubernetes does the recovery on its own, scheduling a replacement that mounts the same checkpoint PersistentVolumeClaim and resumes from the last committed offset. In a real outage this happens automatically, with no commands run by a human. The two counts bracket the kill and prove it worked: after has climbed past before, and because the processor restarted exactly at its checkpoint, the stored events are continuous, nothing lost while it was down and nothing processed twice.

Killed and recovered from the checkpointredpanda appends · processor commits offsets

The processor is keeping pace; only the newest offset is still in flight.

What if Redpanda or Postgres crashes?

The same idea covers the rest of the pipeline, because the durable state lives on disk, not in a pod:

Redpanda (the log) runs as a StatefulSet with its own PersistentVolumeClaim, so a restarted broker comes back with every event still on disk. Producers and the stream processor reconnect and carry on, and the processor resumes from its checkpoint exactly as above. (This is a single local broker, so it buys availability across restarts, not durability against disk loss; a real cluster replicates partitions across brokers for that.)
Postgres and ClickHouse are also StatefulSets on PersistentVolumeClaims, and they are projections of the log, so even a total wipe is recoverable by replaying the log back into them.

The pattern repeats throughout: keep the truth on a persistent volume, and make everything else a restartable, replayable reader of it.

An error budget you can watch

That success-ratio SLI is recorded as a Prometheus rule (prometheus-rules.yaml):

- record: sli:success:ratio5m
  expr: |
    sum(rate(stream_events_processed_total[5m]))
    / (
        sum(rate(stream_events_processed_total[5m]))
        + sum(rate(stream_dlq_messages_total[5m]))
        + 1e-9
      )

A 99% target leaves a 1% error budget. An “error budget remaining” panel shows how much slack is left, and an ErrorBudgetBurn alert fires through Alertmanager if the ratio drops below target. Because the producer’s poison rate is a small fraction (around 0.5%, well under the 1% budget), the success ratio sits above target and the budget holds with room to spare. Killing the processor does not move it: recovery loses no events, so the ratio is unaffected. That is exactly the point, a survived failure should not spend error budget. (If you push the poison rate above 1%, the budget drains to 0% and the alert fires, which is the failure this panel is built to catch.)

To watch it: the error-budget panel is on the SLO and error-budget dashboard in Grafana, and the ErrorBudgetBurn alert’s state (inactive, then pending, then firing) is on Prometheus’s alerts page at http://localhost:9090/alerts.

Log-DLQ vs queue-DLQ

The phase also stands up ElasticMQ (elasticmq.yaml), an SQS-compatible queue, purely to feel a semantic contrast. Both a log and a queue can dead-letter, but they decide differently:

Log (Kafka/Redpanda): the consumer decides a message is poison and writes it to a separate topic. The original stays on the log and can be replayed.
Queue (SQS): the broker redrives a message to the DLQ automatically after it fails to be processed N times (a maxReceiveCount redrive policy). The consumer never explicitly routes it.

Two ways to dead-letterthe consumer decides

The log's consumer inspects each message and writes poison to a dead-letter topic itself, leaving the original on the log to replay.

The demo sends a poison message to the SQS queue, fails to process it three times, and watches the broker move it to the dead-letter queue on its own. Since ElasticMQ speaks the SQS API, the demo drives it with the standard SQS client, the AWS CLI, pointed at the in-cluster endpoint. Same idea, opposite owner.

What’s in the overlay

Phase 4 is overlays/phase-4-resilience/, which bases on phase 3 and adds the dead-letter path, the SLO machinery, and the queue contrast.

dlq-topic-init.yaml: the dead-letter topic

dlq-topic-init.yaml is a Job that creates the events.dlq topic.

app-resilience-patch.yaml: make failure visible

app-resilience-patch.yaml sets the producer to emit a small fraction of poison messages, and gives the stream-processor backpressure (maxOffsetsPerTrigger) and the DLQ topic name.

prometheus-rules.yaml: the SLO

prometheus-rules.yaml holds recording rules for the success-ratio SLI and alerting rules for error-budget burn, high consumer lag, and a down processor.

prometheus-patch.yaml: load the rules

prometheus-patch.yaml patches the phase-2 Prometheus to load those rules and point at Alertmanager, and mounts the rules ConfigMap.

alertmanager.yaml: alerting

alertmanager.yaml is Alertmanager, which receives the alerts. The receiver is a no-op locally; the point is the alerting path exists and fires.

grafana-slo-patch.yaml: the SLO dashboard

grafana-slo-patch.yaml adds an SLO / error-budget dashboard alongside the platform overview.

elasticmq.yaml: the queue contrast

elasticmq.yaml is ElasticMQ, an SQS-compatible queue preconfigured with a main queue, a dead-letter queue, and a redrive policy, purely to contrast a queue DLQ with the log DLQ.

Done when

Injected failures are visibly survived and the SLO stays within budget. Poison diverts, lag recovers, a killed processor resumes, the queue redrives, and the error-budget panel dips but does not bottom out.