Running ClickHouse in Production: Ingestion, Merges & Cost at 2B Rows
When we started building tracio.ai's analytics tier, we needed a database that could handle our specific workload: ingest 50,000 device identification events per second, store 2+ billion rows, and answer analytical queries in under one second. We evaluated PostgreSQL (too slow for aggregations at this scale), Elasticsearch (too expensive for time-series analytics), and ClickHouse. ClickHouse won decisively.
Why ClickHouse
ClickHouse is a column-oriented OLAP database designed for real-time analytics. Its key advantage for our workload is that it reads only the columns needed for each query. When a fraud analyst asks "show me the fraud rate by country for the last 7 days," ClickHouse reads only the country, timestamp, and risk_score columns — ignoring the other 40+ columns in the event table. On a 2B row table, this reduces I/O by 95%.
ClickHouse also compresses data extremely well. Our 2B row events table occupies 340 GB on disk — about 170 bytes per row compressed, versus 1.2 KB per row uncompressed. The compression ratio of 7:1 means more data fits in memory, which directly translates to faster queries.
Schema Design
Our primary table stores one row per identification event:
The table uses the MergeTree engine, ordered by (workspace_id, toDate(timestamp), visitor_hash). This ordering is critical — it means queries filtered by workspace and date range read minimal data. The visitor_hash column enables fast lookups by visitor ID without a secondary index.
We chose LowCardinality(String) for country, device_type, browser_family, and os_family because these columns have fewer than 10,000 distinct values. ClickHouse stores LowCardinality columns as dictionary-encoded integers, reducing storage by 80% compared to plain strings and speeding up GROUP BY operations.
Sharding Strategy
We shard the events table across 6 nodes using a hash of workspace_id. This ensures that all events for a given customer are on the same shard, which means most queries (filtered by workspace_id) hit a single shard. Cross-shard queries are needed only for internal analytics.
Each shard has 2 replicas for high availability. Replication uses ClickHouse's built-in ReplicatedMergeTree engine with ZooKeeper coordination. Failover is automatic — if a shard goes down, queries are routed to the replica with no client-side changes.
Ingestion Pipeline
Events flow from our Kafka topic into ClickHouse through a custom Go service that batches inserts. We insert in batches of 10,000 rows every 500ms — this balances ingestion latency (sub-second) with insert efficiency (ClickHouse performs best with large batches).
The ingestion service handles back-pressure gracefully. If ClickHouse is slow to accept inserts (during merges or heavy query load), the service buffers up to 1 million events in memory and applies back-pressure to the Kafka consumer. In 18 months of production, we have never lost an event.
Query Optimization
Materialized Views For common dashboard queries, we use materialized views that pre-aggregate data. Our fraud rate dashboard, for example, reads from a materialized view that aggregates fraud_detected counts by workspace, country, and hour. The view reduces the data scanned for this query from 2B rows to 5M rows.
Projection Ordering ClickHouse projections let us define alternative sort orders for a table without duplicating data. We added a projection ordered by (workspace_id, visitor_hash, timestamp) for visitor timeline queries. Without the projection, these queries scanned entire date ranges. With it, they read only the blocks containing the target visitor.
Approximate Functions For dashboard queries where exact counts are not critical, we use ClickHouse's approximate functions: uniqCombined for distinct counts (2% error margin, 10x faster than uniqExact) and quantileTDigest for percentile calculations. The fraud analytics dashboard uses approximate functions exclusively, which keeps all dashboard queries under 200ms.
Performance Numbers
Here are representative query benchmarks on our 2B row production cluster:
Fraud rate by country, last 7 days: 120ms. Visitor timeline (50 events): 8ms. Unique visitors per day, last 30 days: 340ms. Risk score distribution, last 24 hours: 95ms. Top 100 devices by event count, last 30 days: 210ms.
These numbers include network round-trip from our application servers to the ClickHouse cluster. Pure query execution time is typically 30-50% lower.
Operational Lessons
Lesson 1: Monitor merge lag ClickHouse's MergeTree engine continuously merges small data parts into larger ones. If merges fall behind (due to high insert rate or disk I/O contention), query performance degrades because queries must scan more parts. We monitor the parts count per partition and alert when it exceeds 300.
Lesson 2: Avoid large ALTER TABLE operations Adding a column to a 2B row table in ClickHouse is instant (it is metadata-only). But changing a column type requires rewriting all data parts — a process that took 6 hours on our cluster. We now treat the schema as append-only: new columns are added freely, but type changes go through a migration table.
Lesson 3: TTL with caution ClickHouse supports automatic data expiration via TTL. We set a 90-day TTL on our events table. The gotcha: TTL deletion happens during merges, which means deleted data may persist for hours or days after the TTL expires. For compliance-critical deletion, we run explicit ALTER TABLE DELETE queries on a schedule.
Cost
Our 6-node ClickHouse cluster (each node: 32 vCPU, 128 GB RAM, 2 TB NVMe) costs approximately $8,400/month on bare metal hosting. This stores 2B rows with 90-day retention and handles 50K inserts/second plus 200 concurrent dashboard queries. The cost per event stored is $0.0000042 — orders of magnitude cheaper than comparable analytics on managed cloud databases.