Building a Real-Time Fraud Analytics Pipeline
Processing 50,000 fingerprint events per second, enriching each with smart signals, and returning a risk score in under 10 milliseconds requires a carefully designed streaming architecture. This article walks through our pipeline from ingestion to decision.
Ingestion Layer
Events arrive as HTTPS POST requests from our JavaScript agent running in visitors' browsers. Each event contains the encrypted signal payload — typically 8-12KB of compressed data covering 1,000+ browser signals. Our edge servers terminate TLS, validate the request signature, and forward the payload to the processing pipeline.
We use a multi-region deployment where edge servers are colocated with our customers' CDN nodes. This keeps the network round-trip under 20ms for 95% of requests globally. The edge servers are stateless Go services running behind a load balancer, scaling horizontally based on request volume.
Signal Extraction
The first processing stage decrypts and parses the signal payload. Each signal is extracted, validated, and typed. Canvas hashes are verified against known impossible values (which indicate canvas blocking or spoofing). WebGL parameters are cross-validated for consistency. Navigator properties are checked against known valid combinations.
This stage also performs signal normalization. User agent strings are parsed into structured components (browser, version, OS, device). Screen dimensions are normalized to account for DPI scaling. Timezone offsets are validated against the IP geolocation data.
Smart Signals Enrichment
The extracted signals are then enriched with Smart Signals analysis — our server-side intelligence layer. This includes incognito detection (comparing signal patterns against known private browsing signatures), VPN detection (cross-referencing IP data with timezone and locale signals), browser tampering detection (identifying inconsistencies that indicate signal spoofing), and virtual machine detection (recognizing hardware profiles associated with VMware, VirtualBox, and cloud VMs).
Each smart signal is computed independently and produces both a boolean result and a confidence score. The enrichment stage adds 24 additional signals to each event, providing a comprehensive threat assessment that goes beyond what client-side collection alone can achieve.
Risk Scoring Engine
The enriched event is passed to our risk scoring engine — a gradient-boosted decision tree model trained on millions of labeled events. The model considers all 1,000+ raw signals, 24 smart signals, and several derived features: velocity metrics (how many events from this device in the last 5 minutes, 1 hour, and 24 hours), historical behavior patterns, and network reputation scores.
The model outputs a risk score between 0 and 100, along with the top contributing factors. A score of 85, for example, might be accompanied by factors like "VPN detected," "incognito mode," and "high velocity — 47 events in 5 minutes." This explainability is critical for fraud analysts who need to understand why a particular event was flagged.
Storage and Query Layer
All events are persisted to ClickHouse — a columnar database optimized for analytical queries over large datasets. ClickHouse handles our write volume (50K events/second) without breaking a sweat, and its columnar storage enables sub-second analytical queries over billions of rows.
We use a multi-tier retention strategy. Hot data (last 7 days) is stored on NVMe SSDs for sub-100ms query response. Warm data (7-90 days) is on standard SSDs. Cold data (90+ days) is compressed and moved to object storage, queryable but with higher latency.
Kafka as the Backbone
Apache Kafka ties the pipeline together. Every stage reads from and writes to Kafka topics. The ingestion layer writes raw events. The signal extraction stage reads raw events and writes extracted events. The Smart Signals enrichment stage reads extracted events and writes enriched events. The risk scoring engine reads enriched events and writes scored events.
This architecture provides several advantages: stages can be scaled independently, failures in one stage do not affect others, and we can replay events through any stage for debugging or reprocessing. Kafka's consumer groups enable parallel processing within each stage, and its exactly-once semantics ensure that no event is processed twice or lost.
Latency Budget
Our end-to-end latency target is 10ms from the moment the enriched signal payload arrives at the processing pipeline to the moment the risk score is returned. Here is how the budget breaks down: signal extraction takes 1-2ms, Smart Signals enrichment takes 3-4ms, risk scoring takes 2-3ms, and serialization and response takes 1-2ms. The Kafka hop between stages adds less than 1ms in our co-located deployment.
Meeting this budget consistently at 50K events/second requires careful optimization at every stage. We use pre-allocated memory pools, zero-copy serialization, and batched ClickHouse writes. The risk scoring model is compiled to native code using ONNX Runtime, eliminating Python interpreter overhead.
Mark spent two weeks profiling the pipeline before finding the bottleneck in our distributed lookup layer — a single mutex was serializing lookups across all goroutines. After switching to a sharded lock design, p99 dropped from 48ms to 9ms. Sometimes the fix is embarrassingly simple once you find it.