I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there’s importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.
Matching Algorithms
I’d like to get a little technical on this post. I try to keep my posts business-friendly, but sometimes there’s importance in detail. If none of this post makes any sense to you, I wrote a sort of primer on how matching works in many data quality tools, which you can get here.
Matching Algorithms
When you use a data quality tool, you’re often using matching algorithms and rules to make decisions on whether records match or not. You might be using deterministic algorithms like Jaro, SoundEx and Metaphones. You might also be using probabilistic matching algorithms.
In many tools, you can set the rules to be tight where the software uses tougher criteria to determine a match, or loose where the software is not so particular. Tight and loose matches are important because you may have strict rules for putting records together, like customers of a bank, or not so strict rules, like when you’re putting together a customer list for marketing purposes.
What to do with Matches
Once data has been processed through the matcher, there are several possible outcomes. Between any two given records, the matcher may find:
- No relationship
- Match – the matcher found a definite match based on the criteria given
- Suspect – the matcher thinks it found a match but is not confident. The results should be manually reviewed.
It’s that last category that the tough one. Mitigating the suspect matches is the most time-consuming follow-up task after the matching is complete. Envision a million record database where you have 20,000 suspect matches. That’s still going to take you some time to review.
Some of the newer (and cooler) tools offer strategies for dealing with suspect matches. The tools will present the suspect matches in a graphical user interface and allow users to pick which relationships are accurate and which are not. For example, Talend now offers a data stewardship console that lets you pick and choose records and attributes that will make up a best of breed record.
The goal, of course, is to not have suspect matches, so tuning the matches and limiting the suspect matches is the ultimate. The newest tools will make this easy. Some of the legacy tools make this hard.
Match mitigation is perhaps one of the most often overlooked processes of data quality. Don’t overlook it in your planning and processes.