Yet, minimal attention is given to the “unglamorous” side of predictive analytics which is the “Data”. A business problem is often posed and one slide is devoted to the sources of data that were used to develop the solution. No attention is given to the rigor in examining these data sources in order to arrive at an optimum data environment that can be used to develop the predictive analytics solution. We refer to this rigor as the “Data Audit Process”. Yet, it is this process and the discipline of being a “data grunt” which provides the backbone in building predictive analytics solutions. But what does the data audit entail.
Upon commencement of any predictive analytics solution, the data requirements and data sources are defined. A data extract document is written by the practitioner and the data is then delivered to the practitioner. If twenty data files or tables are requested, a separate data audit is done on each file/table which ultimately results in the creation of twenty data audit reports. But what does the data audit report contain? Three different types of output are created which ultimately yield a level of detailed insight about the data and how it can be used in an analytical solution.
The first output is a report depicting a random sample of 100 records from the file. This output is to simply provide us with a picture of what the actual table or file looks like. From this sample, the practitioner can begin to better understand the composition of certain fields based on the values and outcomes being reported in that field.
The second output is a data diagnostics report. This output looks at each field within a given file. The report outputs the field format, number of missing values and , the number of unique values for each field in the file. Along with these diagnostics, the report also outputs the mean value and standard deviation for each numeric field in the file. This output begins to reveal the utility of a variable in any predictive analytics solution. For example, variables with more than 90% of its values reported as missing will not be useful in any analytics exercise. Variab les that only have 1 unique outcome will also not be useful in any analytics solution.
The third output is the frequency distribution reports which are output for each variable in the file. These reports provide a more detailed view of the field or variable by displaying how the outcomes or values distribute within a given field. Besides yielding additional information regarding what information or variables will be useful in a future predictive analytics exercise, frequency reports also provide insights on how to derive new variab les from the source variables.
Although these actual outputs themselves are not revolutionary, this discipline of “data” investigation represents the initial process in any analytics exercise. It is this initial process that provides the framework in creating the all-important analytical file which will be used to develop the predictive model. Without this framework, it is akin to trying to read without understanding the alphabet. In the next blog, I will discuss what we need to consider in creating a robust analytical file once this framework is established.