Loading...
The Art of Data Fusion
Magic or Illusion?

Wands at the ready.

You will not sit through a data strategy webinar without hearing the term ‘data integration’ and with good reason too. Your people aside, it is your most important source of competitive advantage.

Why is data integration so critical?

  • Your competitors have the same data sources
  • Your competitors use the same analytics tools
  • Data collection is limited in scope by data privacy
  • Data is less credible in isolation

Within customer insights, competitive advantage is found when your data sources “talk to each other”, allowing you to “connect the dots” and “fill in the gaps.” These throwaway stock phrases gloss over the fact that “bringing data together” is hard. Very hard. In most cases, impossible.

For some, it means more PowerPoint slides, a puzzle piece of disparate sources, with a narrative relating one graph to the next. For others, it means plotting different data sources on the same timeline to spot trends. For the lucky few, it is a hope that enough customers are common across databases.

To me, it means something else entirely.

The illusion

To understand data fusion, first meet its cousin. Imputation is used to mend partial and inconsistent data, filling in missing data points with a series of calculated guesses. It draws on the relationships found in complete records to create the illusion the new data points are real.

If Ted didn’t share his income, we can still predict it using other details, such as age and occupation.

We’re not going to be correct all the time, but you can imagine how we could be reasonably accurate at an overall level. As we can also impute completed records, we enjoy the comfort of knowing how accurate we are by comparing our imputed data against the real data.

Unfortunately, as each field must be imputed according to its own set of rules, we upset the inter-relationships among all the imputed data points. If we need to impute a little, we’re on very solid ground, but if we need to impute a lot, we quickly lose consistency, and the illusion dissipates.

In other words, one cannot make independent guesses to build a coherent picture. A good analogy is ‘Paint by Numbers’: Imagine most of the numbers were missing and we didn’t always know which bit of the painting we were working on at any time. Each adjacent colour choice might be a reasonably good guess, but as a whole, the painting won’t look like anything.

Data Fusion approaches the problem differently, by matching missing data with a complete record. We can then borrow information from the complete record to fill in everything that is missing. Our goal is not an accurate prediction, but a close enough match.

If we don't know Ted’s income but have some data to suggest Bill has lots in common with Ted, we can approximate Ted's income using Bill’s. In the example below, Bill and Ted both enjoy excellent adventures, and love Rock 'n Roll. We'll stick Bill and Ted's data together, as if Ted is answering in place of Bill, or vice versa.

A single match now mends all the missing data points in one swoop and so we retain perfect consistency across them. We’re also putting all our efforts into making one calculated guess, rather than hundreds.

Bill and Ted may be slightly different, but again, you can imagine how we can be reasonably accurate at an overall level.

Donors

Name Adventures Music Income
Bill Excellent! Rock 'n Roll £25k
Rufus Excellent! Rock 'n Roll £30k
John Bogus Smooth Jazz £50k

Recipients

Name Adventures Music Age Occupation
Ted Excellent! Rock 'n Roll 25 Guitarist
Grim Bogus Smooth Jazz 4000 Reaper

Synthetic

Name Adventures Music Income Age Occupation
Bill/Ted Excellent! Rock 'n Roll £25k 25 Guitarist
John/Grim Bogus Smooth Jazz £50k 4000 Reaper

Immediately, it becomes obvious when fusion is most valuable: It’s not when we are missing a few data points - but when we are missing all of them!

The Magic

All fusions share these basic principles:

  • The first dataset contains donors (with A and B), the other are recipients (with B and C).
  • If we want to analyse A by C, we need A and B, and B and C to be strongly related.
  • B is common to both datasets. B represents a set of variables called ‘hooks’; they are the keystone. If they are weak, the whole thing comes crashing down. Using these relationships correctly is where the work goes in.

Usually, the smaller dataset will contains the recipients, giving each recipient the best possible choice of donors. The greater the ratio of donors to recipients, the better. Like many things in life, it all boils down to having a better set of options. No amount of mathematical hocus-pocus can make up for that.

Once merged, the synthetic dataset behaves nearly as well as if each record were the same individual. This only works on an overall level, and because the rules by which donors are selected are consistent across all recipients. That said, it does not behave exactly as a single source dataset would. No matter how similar, Bill is not Ted.

Data fusion comes in different guises, some dead simple, others leagues beyond my understanding. Some take months of work, others are set up to run 'on-the-fly', and will differ by how they go about matching and representing the original data. To get a feel for how this might be done, let’s take a closer look at the mechanics...

There are three magical ingredients to a fusion:

The first magical ingredient

At this point, you already have a good idea how fusion works.

Step 1: There is a calculation to work out how close each potential match is.
Step 2: Individuals are then merged with those who are most similar.

But there is also a middle step. Even if the datasets were the same size, we can’t just have a 1:1 match: What happens if a donor is the best match for the first recipient and we went ahead with the match up… only to find that a second recipient who could have matched almost as well with that donor, now has no good options left to choose from? That’s right - it’s an optimisation problem! But if it were, a great proportion of recipients will not get their closest match. It's not a zero-sum game - everyone can potentially lose. Sometimes, the best way to solve it, however, is not to. Rather, we could just use some donors more than once. Yes, there's the sleight of hand.

Some fusion approaches benefit from reusing donors, although this too comes with its own drawbacks as some individuals are more ‘normal’ than others. These ‘normal’ donors are middle aged, earn a median income, have mainstream opinions, drive a Ford to Tesco to buy PG Tips, then go home to watch BBC 1, while planning a European package holiday (and insert your own punchline here).

Donors like these will be good matches for many recipients, in different ways. So should we choose to reuse donors, we’ll find ourselves reusing Average Joe donors many times over, and that’s fine to a point, but we don’t want to end up with data with just Joes in it. More technically stated, when we use a donor more than once, we reduce the effective sample size and destroy the statistical reliability of our final analyses. Worse still, the tendency creates a ‘regression to the mean’ effect, flattening out differences which could otherwise be insightful.

So, we might want to place a hand on the scales, to include slightly poorer matches where reasonable, particularly the ones that Average Joes will inch out of the running. But no matter how we strike a balance between tightness of match and representation, we still need to use some individuals more than others. We’re always going to land up with a skewed unrepresentative dataset – and we know that when we lose representation, our estimates will be wrong.

For example, if 70% of our customers in our donor pool are loyal, but in our fused database, only 40% are loyal – the first question we’ll be fielding is, “Where have all our sales gone?" Tricky.

If we need to represent our donors properly, we need to solve what statisticians call a “transportation problem”.

That’s right - we’ll need more magic. Lots.


The second magical ingredient

So far, we haven’t created a synthetic dataset insomuch as we’ve borrowed records from one dataset and stuck them onto another. We’ve tried to find the best matches among a representative spread of donors. However, because of this nasty transportation problem, we could, for instance, be selecting too many younger customers, or too many rural, and ignoring the rest.

To solve it, we could try structure the task differently and introduce more matching rules:

  • let's treat both datasets as donors, and draw on both to fill up a synthetic dataset
  • first, we'll split each individual up into a hundred little individuals, or make a hundred individuals from each (same maths)
  • then we'll control the rate at which individuals are selected from each dataset.

Donors

Name Adventures Music Income Weight
Bill Excellent! Rock 'n Roll £25k 2
John Bogus Smooth Jazz £50k 4

Donors

Name Adventures Music Age Occupation Weight
Ted Excellent! Rock 'n Roll 25 Guitarist 1
Rufus Excellent! Rock 'n Roll 55 Time traveller 2
Grim Bogus Smooth Jazz 4000 Reaper 3

Synthetic

Name Income Age Occupation Weight
Bill/Ted £25k 25 Guitarist 1
Bill/Rufus £25k 55 Time Traveller 1
John/Rufus £50k 55 Time traveller 1
John/Grim £50k 4000 Reaper 3

We now have a dataset which matches the measures found in both the original two. In the above example, we’ve used some respondents twice. Ted and the Grim Reaper are each used once. Notice that the sum of the weights for each individual matches the weight they had in their original dataset! On closer inspection, Ted was used once, the Grim Reaper was used thrice, and the rest were used twice. They’ve retained their original weighted representation.

Here the transportation problem was solved by matching several individuals from the one dataset with several from the other! Neat huh? Another way would be to consider the weight, when deciding on the matching rules – usually called a constrained fusion. Again, there are a several viable approaches.

Thus far, I’ve been careful not to draw attention to 'the rabbit'. John and Rufus are a bad match, and this introduces error into the data. Without better options for individuals like John and Rufus, we can only hope to minimise the error caused by poor matches. We can do so by being purposeful in how we build in additional hooks. Some may be more important to match correctly than others.

Fusion does not lend itself to a best practice approach. There is no correct way, and much depends on how the data will be used. It is perhaps the last bit of data science that appears more like a profession than a trade.

The third magical ingredient

Data Fusion was nursed into commercial research in the late 80s by two indefatigable giants, Roland Soong and Steve Wilcox. For a while at least, it was very misunderstood, haphazard, and untested, requiring the tight guard of a likeminded cabal of stats heroes so it could establish itself. It remained a niche field, becoming a sort of hand-me-down. I am grateful to Martin van Staveren whom taught me. It felt like a rite of passage.

As data science becomes automated and commodified, more decisions are taken by algorithms, fewer are left to the analyst. In stark contrast, Fusion calls for a hell of a lot of decisions, demanding an instinctive grasp of both the data and technique. On the one hand is simple distance mathematics determining how the task is best structured and which algorithms will handle it better - but on the other, a nebulous, unwinnable game of push and pull. Just like there is no gold standard approach, there isn’t a mathematically defined optimal solution. Martin would spend selfless hours at a time, month in and month out, over many years developing my intuition for it. It is not a recipe I could have copied and pasted. I couldn’t have picked it up from a course or textbook (although there are good ones available).

Underneath prominent fusion products, such as audience currencies, are smaller single source datasets, which are used to monitor the health of the fusion. They provide validation and direction to the larger piece. However, most fusion projects don’t have this luxury. In its place are a handful of sketchy metrics, nothing to say “Thumbs up! Good job.” This places any analyst in an uncomfortable position and open to fair criticism.

It’s an absolute stinker when you think about it: It’s impossible to demonstrate that a fusion worked. If there were, there would be no need for the fusion. A catch 22. Rather, like a segmentation study, a successful fusion is something we can only aim for. It starts with everyone on the same page, carefully and patiently planning, nailing down its use case and managing expectations.

Perspective

I remember long evenings at the MRG (Media Research Group); the room would temporarily split into three factions: Fusers argued its practicality while Non-fusers detested its lack of accountability. The third group would hang their heads, trying to find refuge in cheap white wine. When the F-word was mentioned, the air would shift, and the sniping would begin within minutes. I would sit as still as I could, safely ensconced amongst the former, trying to memorise the moves of the titans as they renewed old skirmishes. I did not appreciate then, what I do now: the voices were reasoned and sincere, the contention was only whether an analytical technique was fit for purpose and when it could be applied soundly. The quality of the technique wasn’t decided by whether a client would buy it or not. I pine to be back in that room.

Those meetings always concluded in the same way I must here.

The question is not whether fusion works. It is easy to prove that it works in principle. Just chop up your own data and try to stick it back together again. When you do, you’ll notice that, even with the same donors and recipients, you will be unable to re-constitute all the complex relationships the dataset contained.

Fusion is now an established and maturing technique. It has also stood the test of time and evolved under the most intense scrutiny. There is a lot riding on those topflight fusions, those building upon audience currencies. I know of those who have been grilled so hard on the details, that they have left exhausted and in tears, but I hasten to add, with dignity intact. Because of its miraculous promise, it has not been allowed an easy path. Yet because it lives on, despite never receiving the benefit of doubt, I know of no other technique more deserving of the faith it calls for.

We’re not in the business of achieving perfection. Instead, we’re asked to conjure up competitive advantage through the clever use of data. If this isn't it, I don't know what is.

The next time you are asked about data integration, and you simply refuse to regurgitate the platitudes of a webinar, know that you too have a better set of options. Be no longer afraid of this magic but hold your wand up high. Stand on the shoulders of giants, for you have a mighty spell to cast.


Looking for more:

Baker, K, Harris, P & O’Brien, J. (1989). Data fusion: an appraisal and experimental evaluation’, JMRS, 31, (2).

Montigny, M & Soong, R. (2003). Does fusion-on-the-fly really fly? Western Academy of Management Conference.

Montigny, M & Soong, R. (2004). No Free Lunch In Data Fusion / Integration, Western Academy of Management Conference.

Santini, G. (1986). Méthodes de Fusion: Nouvelles réflexions, Nuvelles expériences, Nouveaux enseignements, Les médias, expériences et recherches, Séminaire de l’IREP.

Sharot, T (2007). The design and precision of data-fusion studies. International Journal of Market Research 49(4):449-470