Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.
Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.
A data profiling tool can help you by automating some of the grunt work needed to begin your data analysis, such as generating levels of statistical summaries supported by drill-down details, including data value frequency distributions (like the ones shown to the left).
However, a common mistake is to hyper-focus on the data values.
Narrowing your focus to the values of individual fields is a mistake when it causes you to lose sight of the wider context of the data, which can cause other errors like mistaking validity for accuracy.
Understanding data usage is about analyzing its most important context—how your data is being used to make business decisions.
“Begin with the decision in mind”
In his excellent recent blog post It’s time to industrialize analytics, James Taylor wrote that “organizations need to be much more focused on directing analysts towards business problems.” Although Taylor was writing about how, in advanced analytics (e.g., data mining, predictive analytics), “there is a tendency to let analysts explore the data, see what can be discovered,” I think this tendency is applicable to all data analysis, including less advanced analytics like data profiling and data quality assessments.
Please don’t misunderstand—Taylor and I are not saying that there is no value in data exploration, because, without question, it can definitely lead to meaningful discoveries. And I continue to advocate that the goal of data profiling is not to find answers, but instead, to discover the right questions.
However, as Taylor explained, it is because “the only results that matter are business results” that data analysis should always “begin with the decision in mind. Find the decisions that are going to make a difference to business results—to the metrics that drive the organization. Then ask the analysts to look into those decisions and see what they might be able to predict that would help make better decisions.”
Once again, although Taylor is discussing predictive analytics, this cogent advice should guide all of your data analysis.
The Real Data Value is Business Insight
Returning to data quality assessments, which create and monitor metrics based on summary statistics provided by data profiling tools (like the ones shown in the mockup to the left), elevating what are low-level technical metrics up to the level of business relevance will often establish their correlation with business performance, but will not establish metrics that drive—or should drive—the organization.
Although built from the bottom-up by using, for the most part, the data value frequency distributions, these metrics lose sight of the top-down fact that business insight is where the real data value lies.
However, data quality metrics such as completeness, validity, accuracy, and uniqueness, which are just a few common examples, should definitely be created and monitored—unfortunately, a single straightforward metric called Business Insight doesn’t exist.
But let’s pretend that my other mockup metrics were real—50% of the data is inaccurate and there is an 11% duplicate rate.
Oh, no! The organization must be teetering on the edge of oblivion, right? Well, 50% accuracy does sound really bad, basically like your data’s accuracy is no better than flipping a coin. However, which data is inaccurate, and far more important, is the inaccurate data actually being used to make a business decision?
As for the duplicate rate, I am often surprised by the visceral reaction it can trigger, such as: “how can we possibly claim to truly understand who our most valuable customers are if we have an 11% duplicate rate?”
So, would reducing your duplicate rate to only 1% automatically result in better customer insight? Or would it simply mean that the data matching criteria was too conservative (e.g., requiring an exact match on all “critical” data fields), preventing you from discovering how many duplicate customers you have? (Or maybe the 11% indicates the matching criteria was too aggressive).
My point is that accuracy and duplicate rates are just numbers—what determines if they are a good number or a bad number?
The fundamental question that every data quality metric you create must answer is: How does this provide business insight?
If a data quality (or any other data) metric can not answer this question, then it is meaningless. Meaningful metrics always represent business insight because they were created by beginning with the business decisions in mind. Otherwise, your metrics could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings .
Instead of beginning data analysis with the business decisions in mind, many organizations begin with only the data in mind, which results in creating and monitoring data quality metrics that provide little, if any, business insight and decision support.
Although analyzing your data values is important, you must always remember that the real data value is business insight.