Skip to content

Cahid Arda Öz

DX - Software Engineer at Upstash

Istanbul, Turkey

← Building a Local Data Platform on Kubernetes

Phase 3: Row vs Column, Made Physical

Teeing the same stream into ClickHouse (column/OLAP) and Iceberg-on-MinIO (lakehouse) next to Postgres (row/OLTP), then running the identical analytical query on the row store and the column store to feel the difference on measured numbers.

Blog Essays, opinions, and how-tos.

Fourth build post in the data-platform series. The storage-shapes post argued why one dataset wants more than one storage layout. Phase 3 builds it and turns the argument into numbers you measure yourself.

Run it

Apply everything through this phase (cumulative), then open the live dev UI:

make phase-3
make tilt PHASE=phase-3-columnar

(First run on a clean machine: bring up the cluster first, see Phase 0.)

What’s added in this phase?

The cluster at phase 3: Columnarcumulative · scrub to replay the growth
phase
k3d cluster: data-platform2 namespaces
platform
placeholderredpandapostgresproducerstream-processorapiclickhouseminioiceberg-rest
observability
prometheusgrafanalokialloykafka-lag-exporter

Phase 3: Columnar

Adds the column store and lakehouse: ClickHouse, MinIO, and the Iceberg REST catalog. Everything else (in muted) was already there: one cluster, growing a layer at a time.

The log is the source of truth; these stores read from it

Redpanda is still the source of truth: every event is appended to the log in order and can be replayed. But a log is not where you read data from. It is append-ordered bytes tuned for streaming and replay, not for “revenue per tenant” or “fetch order ord-8123”, and keeping it forever is expensive. So the platform derives purpose-built stores from the log, each a materialised view you could rebuild by replaying it:

  • ClickHouse, the column store. An OLAP database that keeps each column stored together and compressed, so analytical scans (aggregate a couple of columns over millions of rows) are fast and small. It backs the dashboards and the analytics endpoint.
  • Iceberg on MinIO, the lakehouse. MinIO is S3-compatible object storage: cheap commodity disk, not a database. Iceberg turns the files in it into a real table with a schema, ACID commits, and time travel. This is where the full history lives once it is too big or too cold for the log’s bounded retention, queryable later by batch engines without paying warehouse prices to store it.
  • iceberg-rest, the catalog. An Iceberg table is just data files plus metadata files in the bucket; something has to track which files make up the table’s current snapshot and schema. The REST catalog is that authority: writers and readers ask it for the table’s current state and it points them at the files in MinIO. Without it, the bucket is just loose files nobody can read as a table.

So Redpanda holds the recent, replayable log; Iceberg holds the durable, queryable history of the same events; ClickHouse and Postgres are the fast read paths. One stream, fanned out to the shape each question needs.

The tee is a config flag, not a rewrite

The stream processor (job.py) was written from phase 1 to dispatch each batch to a set of sinks chosen by an environment variable. So adding two stores is mostly switching them on, plus the sink code:

def process_batch(batch_df, batch_id):
    rows = [r.asDict() for r in batch_df.collect()]
    if "postgres" in SINKS:   sink_postgres(rows)
    if "clickhouse" in SINKS: sink_clickhouse(rows)   # column store
    if "iceberg" in SINKS:    sink_iceberg(batch_df)  # lakehouse

The platform half of the phase is a patch (app-sinks-patch.yaml) that flips SINKS and wires the new credentials:

env:
  - { name: SINKS, value: "postgres,clickhouse,iceberg" }

This is the payoff of the log being the source of truth: ClickHouse and Iceberg are new readers of the same events, not a change to the producer. The Iceberg write goes through a REST catalog with S3FileIO pointed at MinIO, which is how a real lakehouse is run rather than a toy file path.

The measurement

The goal is to feel the row-versus-column difference from numbers you measure. The demo script (scripts/demo-phase-3.sh) runs two queries on the same events data, one suited to each store, and prints query time and table size (each store in its own SQL dialect):

bash scripts/demo-phase-3.sh

The same split shows up in production: the API serves analytics from the column store at localhost:8000/analytics/revenue and operational lookups from the row store at localhost:8000/orders (app/api/main.py).

Why is Iceberg not in the comparison?

Iceberg is the lakehouse, not an interactive query store competing with ClickHouse on latency: cheap, durable history on object storage (MinIO) with ACID and time travel, read later by batch engines like Spark or Trino. So the demo benchmarks only the two SQL stores; the Iceberg sink’s job is simply to land the same events durably, which you can see in the MinIO warehouse bucket and the REST catalog.

Measurement 1: an analytical scan (the column store wins)

Revenue per tenant per hour, the same query on both:

SELECT tenant,
       date_trunc('hour', event_time) AS hour,
       sum(amount_cents),
       count(*)
FROM events
GROUP BY tenant, hour;

This is an analytical scan: it reads a few columns across every row and aggregates them, the column store’s strong suit.

# == Postgres (row / OLTP) ==
#    query Time:            42.655 ms
#    table size on disk:    49.720 MB
# == ClickHouse (column / OLAP) ==
#    query Time:             5.000 ms
#    table size on disk:     3.450 MB

ClickHouse answers it faster and stores the same events far smaller, because it reads only the columns the query touches and compresses each column hard, while the row store drags every column off disk and discards most of them. (“Table size on disk” is the whole table’s footprint at rest, the same rows stored each way, not the bytes this query read.)

Measurement 2: a point lookup (the row store wins)

Flip to the row store’s home turf, fetching one whole row by its key:

SELECT * FROM events WHERE event_id = '...';

event_id is the Postgres table’s PRIMARY KEY, so Postgres seeks straight to that row through its B-tree index. ClickHouse stores the same rows ordered by (tenant, event_time), so it has no index on event_id and has to scan to find it.

# == Postgres (row / OLTP) ==
#    query Time:             0.803 ms
# == ClickHouse (column / OLAP) ==
#    query Time:             4.000 ms

Now the result inverts: the row store answers in well under a millisecond, the column store scans. This is the whole reason to keep both, the same events sitting in a row store for “fetch this one record” and a column store for “aggregate across all of them”.

Could ClickHouse just order by event_id too?

It could, and then the lookup would use its sparse primary index instead of a full scan. But you would not do it, for two reasons:

  • It would wreck the column store’s actual job. event_id is a random, unique value. Columnar compression depends on similar values sitting next to each other (so they encode as runs, deltas, or dictionaries); ordering the table by a random key scatters every column and collapses compression, and the analytical scans the column store exists for get slower because related rows are no longer near each other. Ordering by (tenant, event_time) keeps each tenant’s events together in time, which compresses hard and keeps the scans fast.
  • A column store still is not a point-lookup engine. ClickHouse’s primary index is sparse (one mark per ~8192-row granule), not a row-level B-tree. Even sorted on event_id it would read and decompress a whole granule and reassemble the row from every column file, heavier than Postgres seeking to a single contiguous row.

So the comparison is fair in the way that matters: each store is laid out for the workload it is meant to serve. The lesson is not “ClickHouse cannot index this”, it is that the layout that makes scans fast is the opposite of the one that makes point lookups fast, so you keep both.

What’s in the overlay

Phase 3 is overlays/phase-3-columnar/, which bases on phase 2 and adds the column store and the lakehouse.

clickhouse.yaml: the column store

clickhouse.yaml is ClickHouse as a StatefulSet (a MergeTree-backed column store), credentials from the clickhouse-credentials Secret.

minio.yaml: object storage

minio.yaml is MinIO (S3-compatible) plus a one-shot Job that creates the warehouse bucket. It backs the Iceberg lakehouse.

iceberg-rest.yaml: the table catalog

iceberg-rest.yaml is an Iceberg REST catalog that tracks table metadata and points the warehouse at MinIO via S3FileIO. It is the catalog the stream-processor writes through.

app-sinks-patch.yaml: turn the sinks on

app-sinks-patch.yaml flips the stream-processor’s SINKS to postgres,clickhouse,iceberg and wires the new credentials, and enables the api’s ClickHouse analytics endpoint. The sink code already existed, gated by SINKS.

Done when

You can state the latency and size difference between Postgres and ClickHouse for the same query, from measurements you took. At that point row-versus-column has stopped being a sentence you read and become a thing you felt.