Phase 1: First Data in Motion
The end-to-end spine: a synthetic event producer to a Redpanda topic, a Spark Structured Streaming job that upserts current order state into Postgres, and a live orders view served over it.
Second build post in the data-platform series. Phase 0 left a green but empty cluster. Phase 1 puts the first event through the whole spine: generate, ingest, process, store, serve.
Run it
Apply everything through this phase (cumulative), then open the live dev UI:
make phase-1
make tilt PHASE=phase-1-ingest
(First run on a clean machine: bring up the cluster first, see Phase 0.)
What’s added in this phase?
1/4 · A producer builds one order event for a tenant.
point lookups by key: current order status
- a synthetic event producer,
- a Redpanda topic as the log,
- a Spark Structured Streaming job writing to Postgres,
- a small API serving a live orders view.
Phase 1: Ingest
Adds the data spine: a producer, the Redpanda log, the Spark stream-processor, Postgres, and the API. Everything else (in muted) was already there: one cluster, growing a layer at a time.
Redpanda and Kafka
Apache Kafka is the original distributed event log, and over the years its client protocol (the
“Kafka API”) became the de facto standard that a whole ecosystem of tools and client libraries
speaks. Redpanda is a different implementation of that same protocol:
it is wire-compatible with Kafka, so anything that talks to Kafka talks to Redpanda unchanged. In
this project that is why the producer uses a Kafka client library (confluent-kafka), the config
key is KAFKA_BOOTSTRAP, and Spark reads the stream with its Kafka connector, all pointed at
Redpanda. Nothing in the application code is Redpanda-specific; you could swap in real Kafka (for
example via the Strimzi operator) by changing only the broker address.
So why Redpanda here rather than Kafka itself? It is a single C++ binary with no JVM and no separate coordination service (it has Raft built in), which makes it dramatically lighter to run on a laptop, the binding constraint for this whole project. The trade-off worth knowing: Redpanda is source-available under the BSL licence rather than Apache 2, which does not affect local learning but would matter for some production uses. Throughout the series, read “Kafka” (the protocol, the topic, the consumer group) and “Redpanda” (the thing running) as the same log.
The producer
The producer (app/producer/producer.py)
emits order-lifecycle events for each tenant, keyed by tenant so a tenant’s events keep their order
on the log. It exposes a Prometheus counter from the start, even though nothing scrapes it until
phase 2.
def make_event(tenant: str) -> dict:
return {
"event_id": str(uuid.uuid4()),
"tenant": tenant,
"event_type": random.choices(EVENT_TYPES, weights=[5, 4, 3, 1])[0],
"order_id": f"ord-{random.randint(1, 100_000)}",
"amount_cents": random.randint(500, 50_000),
"event_time": datetime.now(UTC).isoformat(),
}
The stream processor
The stream processor is a Spark Structured Streaming
job (job.py).
Structured Streaming does not handle events one by one; it works in micro-batches. On a fixed
trigger (here every 5 seconds) it asks the topic for the records that have arrived since the offset
it last read, hands them to your code as one small table (a Spark DataFrame), and records the new
offset only once that batch has been written successfully. So its input is “the new events on the
topic since last time,” and its unit of work is one small batch.
For each batch the job parses the raw JSON into typed rows (event_id, tenant, event_type,
amount_cents, event_time, and so on) and writes them to two tables: it appends every event to
an append-only events table (the full history), and it upserts the latest state per order into
current_orders (the live view the API serves).
The one fiddly bit is “latest event per order wins,” since a single batch can contain a tenant’s
events out of order. That reducer is pulled into a small pure function
(transforms.py)
and unit tested on its own
(tests/test_transforms.py),
following the project’s rule: test the transformation logic you wrote, not the infrastructure around it.
The upsert itself guards against stale writes, so a late event never overwrites a newer status:
INSERT INTO current_orders
(order_id, tenant, status, amount_cents, currency, updated_at)
VALUES %s
ON CONFLICT (order_id) DO UPDATE SET
status = EXCLUDED.status,
amount_cents = EXCLUDED.amount_cents,
updated_at = EXCLUDED.updated_at
WHERE EXCLUDED.updated_at >= current_orders.updated_at
The serving layer
The serving layer is a small FastAPI app
(app/api/main.py) that reads
current_orders from Postgres and exposes a few endpoints. Tilt port-forwards it to
http://localhost:8000:
GET /is a minimal HTML page that polls/ordersonce a second and renders the live orders table. It is functional, not designed, which is the explicit bar for dashboards in this project.GET /ordersreturns the current order state as JSON (optionally filtered by tenant).GET /statsreturns a per-tenant rollup: order count and revenue.GET /metricsexposes Prometheus metrics (nothing scrapes them until phase 2).
What you see
Events flow on the topic, rows land in Postgres, and the live orders page updates once a second as
order statuses move from order_placed to order_paid to order_shipped.
There is no Grafana yet (that is phase 2), so you watch this one through the API page, the Tilt UI, and a few peeks at the stores:
-
The live orders page. Tilt port-forwards the API to
localhost:8000: open http://localhost:8000 for the auto-refreshing table,/statsfor a per-tenant rollup,/ordersfor the raw JSON. -
The Tilt UI at
localhost:10350: click any resource (producer, stream-processor, redpanda, postgres) to tail its logs. -
What the producer emits, read straight off the topic:
kubectl -n platform exec -it statefulset/redpanda -- \ rpk topic consume events --num 5 -
What landed in Postgres:
kubectl -n platform exec -it statefulset/postgres -- \ psql -U platform -d platform -c "SELECT count(*) FROM events;" \ -c "SELECT tenant, count(*) FROM current_orders GROUP BY tenant;" -
What the stream is doing, one log line per 5-second micro-batch:
kubectl -n platform logs -f deploy/stream-processor
Keep the orders page open and re-run the count(*) query a few times: the page ticks and the count
climbs. That is producer to Redpanda to Spark to Postgres to API, end to end.
What’s in the overlay
Phase 1 is overlays/phase-1-ingest/, which bases on phase 0 and adds the spine’s infrastructure and workloads.
redpanda.yaml: the event log
redpanda.yaml is a single-node Redpanda StatefulSet in dev-container mode, with a headless service exposing the Kafka port (9092) and the admin/metrics port (9644). One broker is plenty locally; the headless service gives it stable DNS for clients.
postgres.yaml: the row store
postgres.yaml is a Postgres StatefulSet with a PersistentVolumeClaim, its user and password read from the OpenTofu-generated postgres-credentials Secret. It holds current_orders and the append-only events table.
topic-init.yaml: create the events topic
topic-init.yaml is a one-shot Job that waits for the broker, then creates the events topic with three partitions. It is idempotent, so re-running it does nothing.
app.yaml: producer, stream-processor, api
app.yaml holds the three application Deployments: the producer, the Spark stream-processor (its streaming checkpoint on a PVC so a killed pod resumes), and the api. Config comes from platform-config, database credentials from the Secret, and enableServiceLinks is off so Kubernetes’ injected service env vars do not clobber the app’s own POSTGRES_PORT.
kustomization.yaml: ties it together
kustomization.yaml lists phase 0 as its base plus the files above, so applying it brings up everything through phase 1.
Done when
Producing events results in rows appearing in Postgres, visible live. The spine is the thing every later phase decorates: phase 2 makes it observable, phase 3 tees it into more stores, phase 4 makes it survive failure.