Over the last few years I’ve read countless op-eds from founders and C-level execs that talk about cross-device as the next (or current) big thing in marketing and advertising. They may be right, but as a data scientist, the term is misrepresentative of where we're really trying to go. If the "Holy Grail of marketing” is indeed achieving cross-device, cross-channel, cross-everything attribution – then merely starting from a particular device and walking a few steps across a messy graph to find other possibly related devices can only be the beginning of the Last Crusade. Let me explain.
A consumer’s complete digital footprint looks – to all who lack deterministic data from log-ins and who dare to respect privacy – like a fuzzy set of devices, IDs, ad requests, first-party identifiers, and third-party data segments, all dynamically popping in and out of existence as device churn and usage fluctuates. These data points are all passively and incompletely observed, and all associated together with varying degrees of probability. It can be a fuzzy, messy picture. The real-life person underlying this data, however, rarely churns. And that's the treasure that marketers really seek – a stable handle into a consumer's identity. Devices are just the means to that end.
Many marketers are in an all-too-familiar camp where cross-device linkages are dissolving beneath their feet due to this churn. Therefore it behooves any good digital marketer to seek the stable digital identities underlying a roiling sea of digital soup – where the cookies, IDs, audience segments, and CRM data points are constantly evolving – by adopting more intelligent, user-centric models and systems that can see through this data churn. The question digital marketers seek to understand is: Can we make digital identity as stable as physical identity?
Another issue is that the physical boundaries between individual consumers do not carry over straightforwardly into the digital realm. For people who are truly only related by a chance encounter in a coffee shop, it is quite easy to separate their devices based on even a little observational data. But dividing lines are much more difficult to draw between devices that are more closely related (i.e., pairs of devices owned by distinct family members). Typical cross-device solutions use device graphs to find the dividing lines: devices that are possibly related to a given device can be found by starting a walk from that device's node in the graph to that device's strongly-connected neighbors (very likely owned by the same user) to those neighbor's neighbors (less-so), and so on. This kind of walk is a practical, though crude, solution to the problem.
To do better, the name of the game is getting scientific with how you treat the gray area in between the two extremes. Which device connections (edges) emanating from specific portion of a graph of devices are crossing spousal or parental boundaries? Shared devices, such as connected TVs, offer another wrinkle, as they may be thought of as belonging to a household, more than to any one particular user in that household. For savvy marketers looking to direct ad-spend or do attribution at the individual-level, rather than the household-level, an effective cross-device identity solution must be able to convincingly resolve the difference between the two.
While the advanced machine-learning algorithms that are currently brought to bear on scoring device-to-device associations can be incredibly precise given enough observations, the majority of identifiers in any massive probabilistic identity graph will not be observed with the same level of comprehension. As an industry, we usually cannot afford to aggressively shrink scale. That’s why we ditched deterministic-only as a realistic strategy in the first place. Unfortunately this has lead to a situation where many of the typical graphs that are shipped throughout the industry are noisy, highly interconnected, and, for many of their linkages, downright inaccurate. Without further processing, a graph of devices is too raw for difficult tasks like attribution.
It is also extremely difficult to do at scale, as the ordinary task of community-detection (i.e., “clustering,” or picking small sets of devices that likely belong to one user out of a noisy graph) is notoriously fraught with quality and instability issues on massive input graphs that contain tens of billions of nodes (device identifiers).
The quest to construct these users at a very high level of rigor has driven data scientists across the industry to new frontiers of research in graph algorithms and machine learning on graphs. Data scientists have found it possible to deliver on this vision of user identity that faithfully represents the underlying reality with the kind of quality that is required of even the most demanding of measurement tasks.
For those who are brave enough to try to build out a cross-device capability, and to get it right, the task will run far beyond co-occurrence (devices seen together) and breadth-first search (devices also seen with those devices seen together). For those choosing the more expeditious route, get ready for more sophisticated buyers who are starting to understand this.
My advice to the brands, agencies, and enterprises that are evaluating graphs is to ask some probing questions to gauge the vendors’ understanding of the consumers behind the devices. If a vendor gives a variant of “find the devices seen together, then you have users; find other devices associated to that first group, then you have households,” you may have chosen… poorly. If, on the other hand your vendor understands that devices are not people – then you may have found the “X” that marks the spot, and the Holy Grail is imminent.
Read More >