But Why?#

Analytics queries on schemas larger than 50GB were consistently breaching 60 seconds. Analysts were waiting on dashboards. Decisions were slowing down. This is the story of how we evaluated five databases, picked ClickHouse, migrated to it, and what we learned the hard way on the other side.

The Problem#

We were running analytics on top of our existing database. As data crossed the tens of TBs mark, query latency became a serious problem. Analysts were waiting 60+ seconds for dashboard queries on schemas over 50GB. That's not just a UX problem — it was slowing down actual business decisions.

The instinct is to reach for query optimisation — better indexes, query rewrites, caching layers. We went down that path. The honest conclusion was that we weren't going to optimise our way out of this. The architecture wasn't built for this workload. We needed to move to one that was.

The Evaluation#

We shortlisted five options: Druid, Pinot, BigQuery/Snowflake/Redshift, Spark + Parquet, and ClickHouse. We evaluated against six criteria: query latency on ad-hoc queries, data freshness, ingestion throughput, operational complexity, storage efficiency, and cost predictability.

Here's how each one fell out.

Druid — Eliminated on Flexibility and Operational Complexity#

Druid was originally built at MetaLife around 2011 for internet advertising analytics. The problem it was solving was specific: billions of ad impression events per day, with analysts querying the same fixed dimensions — campaign, country, device, hour — via dashboards refreshing every few seconds.

For that use case, Druid's answer is brilliant: pre-aggregate the data at ingestion time. Instead of storing 10 billion raw rows, Druid collapses them into ~50 million aggregated rows grouped by known dimensions. Queries become instant because you're scanning 200x less data. This feature is called Rollup.

And Rollup is exactly why Druid didn't work for us.

Product analytics is not advertising analytics. PMs come in at 10pm with a new dimension nobody thought of — "can we slice this by referral source?" or "what about users who signed up in Q3 specifically?" Those dimensions weren't pre-aggregated. Druid either can't answer them, or you re-ingest the entire dataset with the new dimension added. That's operationally unacceptable.

Technically Druid can store raw data with Rollup disabled, but the architecture fights you at every layer when you do — the segment structure, the query planner, and the ingestion pipeline are all optimised around the assumption that data has already been aggregated.

Then there's the operational footprint. A production Druid cluster requires:

Coordinator    → manages which Historical nodes hold which segments
Overlord       → manages ingestion task assignment
MiddleManager  → runs the actual ingestion worker tasks
Historical     → serves queries on immutable segment data
Broker         → routes queries across Historical nodes
Router         → optional API gateway
ZooKeeper      → coordination layer for all of the above
Deep storage   → S3 or HDFS for segment backup and handoff

Eight moving parts. All needing monitoring, independent scaling, and operational expertise. Without a dedicated Druid team — which most companies don't have — this is a maintenance burden that compounds over time. We'd essentially be running a distributed systems cluster inside our analytics system.

Druid is a phenomenal tool if your query patterns are fixed and predictable. But product analytics is inherently ad-hoc. You can't pre-define every dimension a stakeholder will ask for.

Pinot — Eliminated on the Same Fundamental Constraints#

Apache Pinot was created at LinkedIn and optimised for user-facing, latency-sensitive queries at very high QPS — think "show a user their LinkedIn profile analytics in real time" with thousands of users hitting the system simultaneously.

The same pre-aggregation trade-off applies. Pinot is extremely fast on known, pre-indexed dimensions. It struggles with arbitrary ad-hoc queries on raw data. The indexing model is more sophisticated than Druid's — inverted indexes, sorted indexes, range indexes, text indexes per column — but you still have to define your access patterns upfront to get the most out of it.

Our bottleneck was scan speed on arbitrary queries from analysts who could ask anything. Pinot's strengths — high QPS on indexed dimensions, sub-millisecond latency for pre-defined queries — weren't the bottleneck we needed to solve. Wrong tool for our shape of problem.

BigQuery / Snowflake / Redshift — Eliminated on Latency Floor and Cost Model#

These are world-class tools. They just weren't built for our workload.

BigQuery, Snowflake, and Redshift are designed for data warehouse workloads — ETL pipelines, scheduled reports, batch processing, quarterly business reviews. They're optimised for correctness, scalability, and ease of use. Not for sub-second interactive queries on raw data.

The fundamental architecture separates compute from storage. When a query runs, it spins up compute resources, reads data from remote storage (S3 in most cases), processes it, and returns results. Even heavily optimised queries on large tables have a cold-start overhead — compute spinning up, data travelling across a network, results materialising. In practice, this means a latency floor of 3–10 seconds even for well-optimised queries.

For an analyst refreshing a dashboard or iterating on a query in real time, that wait compounds into a frustrating experience. It's the difference between a tool that feels responsive and one that feels like you're submitting a job.

The cost model was also a problem. BigQuery charges per TB scanned. Snowflake charges by compute credits consumed per second. At our query volume — many analysts, many dashboard refreshes, frequent ad-hoc queries — costs become unpredictable and hard to budget. A single runaway query scanning your entire dataset can generate a surprise bill. Self-hosted ClickHouse on fixed compute gives you flat-rate infrastructure costs.

For batch analytics — overnight ETL jobs, data science workloads, monthly reports — BigQuery and Snowflake are excellent. That just wasn't our primary use case.

Spark + Parquet — Wrong Category Entirely#

Spark is a batch data processing engine. When you run a Spark query, it spins up distributed workers, reads Parquet files from S3 or HDFS, processes them in parallel, and writes results somewhere. Query execution time is measured in minutes, not seconds.

This is the right tool for building pipelines into an analytics store. It is not the right tool for being the analytics store itself. An analyst running a query and waiting 3 minutes for results isn't doing interactive analytics — they're submitting batch jobs.

Hive has the same problem. It was built to bring SQL syntax to Hadoop MapReduce. The underlying execution model is still batch-oriented.

Both were eliminated in the first pass — not because they're bad tools, but because they're solving a completely different problem. We actually use Spark upstream for data transformation before ingestion into ClickHouse. Ruling it out as an OLAP store was the easiest decision in the evaluation.

Why ClickHouse Won#

Database	Core Assumption	Why It Lost
Druid	Query patterns are known upfront; pre-aggregate at ingestion	Ad-hoc flexibility impossible; 8-node operational complexity
Pinot	High QPS on pre-indexed, user-facing queries	Wrong latency profile; same pre-definition constraint
BigQuery / Snowflake	Batch analytics; compute-storage separation	Latency floor 3–10s; unpredictable per-query cost
Spark + Parquet	Batch data transformation pipeline	Minutes-level latency; wrong category entirely
ClickHouse	One wide table; raw data; analytical scans	Matched our exact workload assumptions

ClickHouse won on three dimensions: sub-second latency on raw unaggregated data, excellent compression (we saw ~7x on our event data), and simple enough architecture to self-host without a dedicated team. We set up ingestion directly from Kafka.

Post-migration, p99 dashboard query time dropped from 60+ seconds to 2–3 seconds on schemas that were previously the bottleneck. That's a 20–30x improvement — not from tuning the old system, but from moving to an architecture that was built for this workload.

Why ClickHouse Is Actually Fast#

The performance isn't magic. Three things stack on each other.

Layer 1: Columnar Storage — Less Data Read From Disk#

In a row store, a query touching 2 columns still reads all 20 columns off disk for every row — you get everything whether you need it or not.

In ClickHouse, data is stored column by column. A query reads only the columns it needs. Everything else stays on disk untouched.

At 50GB schemas, a row-oriented store might be reading 500GB+ of data to answer a query that logically needed 50GB. ClickHouse reads only what the query touches.

Layer 2: Compression — Columns Are Physically Smaller#

Similar values stored together compress dramatically. Countries repeat (IN, IN, IN, US, US...). Dates cluster. Numbers have patterns. ClickHouse typically achieves 5–10x compression on real-world event data.

The bottleneck in analytics is disk I/O, not CPU. Compression directly attacks that bottleneck:

Without compression: 50GB ÷ 3GB/s = ~17 seconds minimum I/O time
With 5x compression: 10GB ÷ 3GB/s = ~3.3 seconds

CPUs decompress faster than SSDs can read — so decompression adds near-zero overhead. You get smaller data on disk with negligible cost to read it back.

Layer 3: Vectorized Execution — Faster CPU Processing#

Normal (scalar) execution:

ADD revenue[0]   → one instruction
ADD revenue[1]   → one instruction
ADD revenue[2]   → one instruction

Vectorized execution (SIMD):

ADD [r0, r1, r2, r3, r4, r5, r6, r7]  → one instruction, eight values

SIMD — Single Instruction, Multiple Data — has been available on every modern CPU since the late 1990s. ClickHouse's query engine is specifically written to batch values and use these instructions. Most databases aren't.

Columnar storage and compression mean far less data reaches the CPU. Vectorized execution means the data that does reach it gets processed faster. Each is a multiplier. Together they explain the 20–30x improvement on our workloads.

What ClickHouse Is Bad At — Joins#

This is where we got burned, and it's worth being direct about.

ClickHouse's entire architecture is optimised for one large table, scanned fast. Joins require matching rows across tables, which means loading the smaller table entirely into RAM as a hash map, then scanning the large table and looking up each row. This breaks down when:

Both tables are large — one must fit in RAM. If it doesn't, performance collapses.
Distributed cluster — joins across shards require network shuffling, which is expensive and poorly optimised.

Postgres has sophisticated join strategies — hash join, merge join, nested loop — with a query planner that picks the right one. ClickHouse's planner is brilliant at single-table aggregations. Multi-table joins are an afterthought.

What ClickHouse wants instead is denormalization. Instead of joining orders + users at query time, copy the user fields directly into the orders table at ingestion time:

orders: order_id, user_id, revenue, country, age, signup_date  ← denormalized

Storage goes up. Query complexity goes to zero.

What we underestimated was how often analysts would want to join ClickHouse data with our operational database. We had to go back and denormalize several schemas after the migration — that was painful and took longer than the migration itself. This is something we'd have thought much harder about upfront.

If your analytics require heavy joins across multiple large tables, or your data model is highly normalised and you can't afford to denormalise, ClickHouse will fight you constantly. It wants one wide table. Design your schemas around that or pick a different tool.

MergeTree vs ReplacingMergeTree — Understanding ClickHouse Writes#

ClickHouse never updates in place. Every write is an append. Data lands in small immutable chunks called parts. A background process periodically merges small parts into larger ones. This is MergeTree — and it's why ClickHouse is fast at writes. Tight packing and compression make in-place updates impossibly expensive at the architecture level.

MergeTree stores all inserts with no deduplication ever. If you insert the same row twice, you have two rows. This is exactly right for immutable data: click events, log lines, transactions — facts that happened and never change.

ReplacingMergeTree adds one behaviour: during background merge, if two rows share the same primary key, keep only the latest one. You specify a version column (usually a timestamp) — highest value wins. This is how ClickHouse simulates updates — you insert a new version of the row, not update the old one.

INSERT: user_id=123, status='trial',   ts=2024-01-01
INSERT: user_id=123, status='paid',    ts=2024-01-15
INSERT: user_id=123, status='churned', ts=2024-03-01

AFTER MERGE: user_id=123, status='churned', ts=2024-03-01  ✅

The critical catch: the merge happens in the background on ClickHouse's schedule, not yours. Between inserts and merge completion, all versions coexist. Query during that window and you get duplicates.

Two defences against this:

FINAL keyword — simple but slower:

SELECT * FROM users FINAL WHERE user_id = 123

Forces deduplication at query time. Correct, but adds overhead.

argMax pattern — what production teams actually rely on:

SELECT user_id, argMax(status, updated_at)
FROM users
WHERE user_id = 123
GROUP BY user_id

Manually picks the value corresponding to the latest timestamp. No dependency on merge state whatsoever.

	MergeTree	ReplacingMergeTree
Duplicates	Keeps all rows	Removes on background merge
Use case	Immutable event logs	Stateful records (users, orders, subscriptions)
Risk	None	Duplicates between merges
Defence	—	FINAL keyword or argMax pattern

The Mutation Incident#

Several months post-migration, we noticed query performance gradually degrading. Not sharply — gradually, invisibly. Queries weren't getting slower in isolation. The disk was just constantly busy.

The root cause: certain workflows were doing row-level updates:

ALTER TABLE orders UPDATE status = 'paid' WHERE order_id = 123

Completely normal in Postgres. Quietly catastrophic in ClickHouse.

In ClickHouse, every UPDATE creates a mutation — a background job that reads every affected data part, rewrites the entire part with the change applied, and replaces the old part. Over 4–5 months of updates accumulating, there was a backlog of pending mutations constantly rewriting data on disk — competing with queries for the same disk I/O.

A deep dive into ClickHouse's system tables revealed the mutation queue. Hundreds of pending mutations, queued up silently. We refactored the update workflows to use ReplacingMergeTree — insert new versions of rows instead of mutating existing ones. Performance recovered.

The lesson: every database has a mental model for writes, and you have to design your application around that model, not against it. ClickHouse's model is append-only. Fighting that assumption is expensive, silent, and cumulative.

Partition Keys — The Right Way to Think About Them#

Without partitioning, a query filtering on January 2024 still scans 3 years of data. ClickHouse has to read everything to find what's relevant.

Partitioning physically separates data on disk by partition key value:

/orders/202401/   ← all January 2024 data
/orders/202402/   ← all February 2024 data
/orders/202403/   ← all March 2024 data

Parts from different partitions never merge with each other — it's a hard physical boundary. When a query has a WHERE clause that filters on the partition key, ClickHouse uses partition pruning: it looks at which partitions could possibly match and never opens the others. Zero I/O on pruned partitions.

Query touches 1 month of 36 months → reads ~3% of data → 97% I/O eliminated before the query even starts

Three rules for choosing a partition key:

Rule 1: Partition on your most common filter. If 80% of queries filter by time, partition by month. The key only helps if queries actually filter on it.

Rule 2: Don't over-partition. Monthly gives you 36 partitions over 3 years — manageable. Daily gives you ~1,000. Hourly gives you ~26,000. Too many partitions means too many small parts, merge efficiency collapses, and queries slow down. Each partition should contain hundreds of millions of rows minimum. If a partition would have only millions of rows, go coarser.

Rule 3: Low cardinality only.

toYYYYMM(date)   → 12–36 values      ✅
country          → ~200 values        ✅
user_id          → millions of values ❌
order_id         → unique per row     ❌

Partition key vs primary key (ORDER BY) are different and complementary:

	Partition Key	Primary Key (ORDER BY)
Level	Directory / filesystem	Within a part
Skips	Entire partitions	Granules within a part
Granularity	Coarse	Fine
Analogy	Which floor of the library	Which shelf on that floor

Use both together:

PARTITION BY toYYYYMM(created_at)   -- skip entire months
ORDER BY (country, user_id)          -- fine-grained index within a month

We partitioned by month on our primary event tables. Analysts almost always queried within a time range, so partition pruning eliminated the majority of I/O before the query even started. We considered daily partitioning for finer granularity but monthly was the right balance between pruning effectiveness and operational manageability.

The 2500 Table Limit — What the ClickHouse Team Told Us and Why#

The ClickHouse team advised staying under ~2500 tables per instance. This isn't an arbitrary number — it has three distinct technical root causes.

In Postgres, a table is essentially a logical entry in a catalog. In ClickHouse, every table is a living operational entity with ongoing background processes. Having 10,000 tables isn't 10,000 rows in a catalog — it's 10,000 things the server is actively managing simultaneously.

Root Cause 1: Startup time grows with part count. On every server restart, ClickHouse reads the metadata of every part of every table to rebuild its in-memory state before it can serve queries. More tables → more parts → restarts take minutes instead of seconds. In production, a server that takes 8 minutes to come back after a crash is a schema design problem.

Root Cause 2: Background merge thread pool gets overwhelmed. ClickHouse has a shared background merge thread pool — default 16 threads — responsible for merging small parts into larger ones across all tables on the instance. With 2500+ tables all competing for those 16 threads:

16 threads ÷ 2500 tables = each table gets merge attention extremely rarely

Parts accumulate faster than they get merged. Queries slow down because they're scanning hundreds of small unmerged parts instead of a few large clean ones. Silent, gradual degradation — exactly like the mutation incident, but caused by table count instead of update patterns.

Root Cause 3: Replicated tables overwhelm ZooKeeper. Every replicated table has a corresponding path in ZooKeeper (or ClickHouse Keeper). Every write, every schema change, every replication operation goes through it. Too many replicated tables → ZooKeeper node count explodes → jute.maxbuffer limits get hit → schema changes start failing → replication coordination breaks down. This is why the limit is stricter for replicated tables (a few hundred) than non-replicated ones (a few thousand).

Safe zone guidelines:

Table Type	Safe Limit	Hard Constraint
Non-replicated MergeTree	~2,000–5,000	Startup time and merge pressure
ReplicatedMergeTree	~few hundred	ZooKeeper coordination overhead
Tables with real-time inserts	Keep to dozens	Merge pool contention is immediate

We designed our schema to stay well within that boundary by using partitions for logical separation instead of creating separate tables.

What We'd Do Differently#

Denormalization strategy upfront. We underestimated how much analysts would want to join ClickHouse data with our operational database. Thinking hard about denormalization before migration rather than retrofitting it post-launch would have saved significant pain.

Audit update flows before migration. We should have walked through every workflow touching the tables we were migrating and explicitly asked: does this workflow do updates? ClickHouse's mutation model is fundamentally different from Postgres. Catching that before go-live would have saved the 4–5 month latency degradation we saw post-launch.

The Generalizable Principle#

Every database is built on a set of assumptions about how data will be written and read. Postgres assumes normalized data, frequent updates, mixed read/write workloads. ClickHouse assumes append-heavy, analytical, one wide table. The migration taught us that picking the right tool isn't about features — it's about matching the database's mental model to your actual access patterns. When those align, you get 20–30x improvements without a single line of query optimisation. When they don't align, no amount of tuning saves you.

Clickhouse for Query Speedup