The Math Behind Cross-Session Device Matching
Device fingerprinting faces a fundamental challenge: signals change. Browsers update, users modify settings, fonts are installed and removed. A strict comparison would treat every changed signal as a new device, destroying identification accuracy. Our cross-session matching system solves this by using AI-powered techniques that quantify similarity rather than requiring exact equality.
The Problem: Signal Drift
Consider a device that was fingerprinted yesterday and returns today after a browser update. The user agent string has changed. Two new CSS features are now supported. A WebGL extension was added. The canvas rendering remains identical (same GPU, same driver). The audio fingerprint is identical. The WebGL parameters are identical except for the new extension.
With exact matching, this device would not be recognized — the combined fingerprint has changed. But intuitively, we know this is the same device. The hardware signals are identical, and the software changes are consistent with a browser update. Our cross-session matching system formalizes this intuition.
Set-Based Comparison for Feature Signals
Many browser signals are naturally represented as sets: the set of supported CSS features, the set of available fonts, the set of WebGL extensions. For these signals, we measure overlap using set-based similarity metrics. For two sets A and B, we compute the ratio of shared elements to total elements.
A device with 45 CSS features yesterday and 47 today (with 44 in common) has a high similarity score. This is enough to indicate the same device with a browser update. A completely different device might share only 30 CSS features, giving a much lower similarity score. The threshold between "same device" and "different device" is learned from labeled data.
Efficient Candidate Generation
Computing similarity between every pair of devices would be prohibitively expensive at scale. Our candidate generation system uses advanced indexing techniques that map similar items to the same lookup bucket with high probability, allowing us to find potential matches in constant time.
This approach eliminates 99.9% of comparisons in the candidate generation phase, making the system efficient even at billions of device profiles.
AI-Powered Analysis for Complex Signals
Some signals do not decompose neatly into sets. Canvas fingerprints, audio processing output, and WebGL parameter vectors are complex data where simple set comparison does not apply. For these signals, we use AI-powered analysis that maps device profiles into a representation where similar devices are close together.
The AI model captures non-obvious relationships between signals. For example, it learns that a change in the WebGL renderer string from one GPU model to a slightly upgraded version of the same model represents a GPU upgrade on the same machine, while a change to a completely different GPU vendor represents a different device entirely.
Combining the Techniques
Our production system uses multiple techniques in a cascade. First, efficient candidate generation identifies potential matches. Second, set-based comparison provides a precise measure of overlap for feature-based signals. Third, AI-powered analysis scores the similarity of hardware-dependent signals. The final confidence score is a weighted combination of all methods, with weights tuned on labeled data.
This cascade architecture is both accurate and efficient. The total matching time for a returning visitor is under 5ms on average.