ClickHouse for Device Analytics at Billion-Row Scale
When we launched tracio.ai, PostgreSQL handled everything: visitor storage, event logging, analytics queries, and dashboard rendering. It worked beautifully until we crossed 100 million events. At that point, our analytics queries — "show me unique visitors by country over the last 30 days" or "aggregate bot detection rates by hour" — started taking 30+ seconds. Dashboard pages timed out. Our monitoring showed PostgreSQL spending most of its time on sequential scans across wide rows.
Why Column-Oriented Storage Matters
The fundamental issue is that analytical queries touch a small number of columns across many rows, while PostgreSQL stores data row-by-row. A query like "count distinct visitorId where botDetected = true and timestamp > 30 days ago" needs only three columns, but PostgreSQL reads the entire row — including the full signal payload, IP metadata, and 24 smart signal results — for every matching record.
ClickHouse stores data column by column. That same query reads only the three relevant columns, skipping everything else. With compression (LZ4 by default, ZSTD for cold data), each column compresses independently and achieves 5-15x compression ratios because similar values are stored adjacently.
Our Schema Design
We designed our ClickHouse schema around the MergeTree engine family. The primary table uses ReplacingMergeTree with the visitor hash as the sorting key. This means queries that filter by visitor ID hit a single sorted range — effectively an index seek.
The partition key is based on month, which means old data can be dropped efficiently (entire partitions are removed, not individual rows). We retain 12 months of detailed event data and 36 months of aggregated summaries.
For high-cardinality dimensions like visitorId and requestId, we use the new sparse index feature in ClickHouse 23.8+. This keeps the index in memory while supporting billions of unique values.
Sharding Strategy
We run a 6-node ClickHouse cluster with 2 shards and 3 replicas. Data is distributed using the visitor hash modulo 2, which ensures that all events for a given visitor land on the same shard. This is critical for queries like "show all events for visitor X" — they hit a single shard instead of scattering across the cluster.
Replication uses ClickHouse's built-in ReplicatedMergeTree engine backed by ClickHouse Keeper (a ZooKeeper-compatible coordination service). Each shard has 3 replicas for redundancy. Reads are load-balanced across replicas, and writes go to any replica and are automatically propagated.
Query Performance Results
The migration was dramatic. Here are real numbers from our production cluster processing 3.2 billion events:
Unique visitors last 30 days: 47 seconds (PostgreSQL) → 380ms (ClickHouse). Bot detection rate by hour: 22 seconds → 120ms. Top 10 countries by visitor count: 35 seconds → 95ms. Signal distribution histogram: 55 seconds → 210ms.
These numbers include network round-trip from our API servers to the ClickHouse cluster. The ClickHouse query execution time is typically under 50ms.
Materialized Views for Real-Time Dashboards
For our customer-facing dashboard, we use ClickHouse's materialized views to pre-aggregate common queries. When an event is inserted into the raw events table, materialized views automatically update hourly, daily, and monthly rollup tables. Dashboard queries hit these rollup tables instead of scanning raw events.
This pattern gives us sub-10ms dashboard responses regardless of the underlying data volume. The tradeoff is storage: rollup tables add approximately 5% overhead. For our use case, this is an excellent deal.
Lessons Learned
ClickHouse is not a general-purpose database. It excels at analytical workloads but struggles with point lookups, updates, and transactions. We still use PostgreSQL for account management, API key storage, and configuration. Redis handles the hot path: visitor ID lookups and real-time bot scoring.
The key insight is that different access patterns need different storage engines. Our architecture uses three data stores — Redis for sub-millisecond lookups, PostgreSQL for transactional data, and ClickHouse for analytics — each optimized for its specific workload. The complexity is worth it when your analytics tier needs to handle billions of rows with sub-second response times.