Building a Local Data Platform on Kubernetes

I wanted hands-on experience running a modern, cloud-style data stack on Kubernetes, and a project where I could lean on what I know about data engineering while learning a lot of the platform side around it. It is only local, one k3d cluster on a laptop with zero paid services, but standing the whole thing up by hand made the moving parts concrete in a way reading never did. So I built one: a data platform that ingests event streams, processes them, stores each dataset in the format that fits its access pattern, and exposes dashboards. The code is at github.com/CahidArda/local-data-platform; this post is the map for the series that follows.

I built it with a lot of help from AI, and I wrote these articles the same way: by asking hundreds of questions about the parts that were new to me or that I wanted to understand more deeply, then turning the answers into the explanations you are reading.

The series

Read it in order. The first two posts lay the conceptual groundwork (the architecture and the storage trade-offs); the rest build the platform one phase at a time.

What I wanted to learn

The aim was hands-on Kubernetes experience and a way to put my data-engineering background to work end to end, while filling in the tools, terms, and trade-offs that are easier to feel than to read about:

engineering batch and streaming pipelines that are observable, resilient, and efficient,
the trade-offs between transactional (OLTP), analytical (OLAP), and streaming processing, and between row and column storage, felt directly rather than read about,
operating a Kubernetes-native, cloud-style data stack end to end,
clean Git, GitOps, and CI/CD workflows around all of it.

And to learn it by building, with AI as the helper. The loop went like this:

create the repo and a series of articles, each explaining one concept at a time,
read the series and the code, stand the setup up, and try to make it work,
wherever something did not make sense, ask AI about it until it did,
fold the answer back into the article or the code, then go around again.

So these posts are not a tutorial written from authority. They are the trail of that loop: the explanations here are the ones that finally made each concept click for me.

The scenario everything hangs off

One scenario keeps the layers reinforcing each other instead of being disconnected toys.

Internal product groups, retail, marketplace, and subscriptions, each emit a stream of synthetic order events into a shared platform. The platform ingests those events, processes them (stream and batch), stores each dataset where it belongs, and serves dashboards over all of it.

That single scenario is enough to make every hard trade-off show up on its own: row versus column storage, a log versus a queue, OLTP versus OLAP, and what it costs to keep all of it observable.

Built in phases you can diff

The platform grows in eight phases. Each one adds exactly one layer onto the same cluster, and each is a Git tag, so git diff phase-2 phase-3 shows precisely what a step introduced, application code and infrastructure, in one diff. That is the whole point: describe a change, apply it, see what changed.

The stack is deliberately all open-source and local: k3d for Kubernetes, Redpanda for the event log, Spark Structured Streaming for processing, Postgres / ClickHouse / Iceberg-on-MinIO for the three storage shapes, Prometheus and Grafana and Loki for observability, Dagster for orchestration, OPA for policy. Each was chosen by its strength.