I like the CRISP-DM process model for data mining, teach from it, and use it on my projects. I commend it to practitioners and managers routinely as an aid during any data mining project. However, while the process sequence is generally the one I use, I don't always; data mining often requires more creativity and "art" to re-work the data than we would like; it would be very nice if we could create a checklist and just run through the list on every project! But unfortunately data doesn't always cooperate in this way, and we therefore need to adapt to the specific data problems so that the data is better prepared.
For example, on a current financial risk project I am working, the customer is building data for predictive analytics for the first time. The customer is data savvy, but new to predictive analytics, so we've had to iterate several times on how the data is pulled and rolled up out of the database. In particular, target variable has had to be cleaned up because of historic coding anomalies.
One primary question to resolve for this project is an all-too-common debate over what is the right level of aggregation: do we use transactional data even though some customers have many transactions and some have few, or do we roll data up to the customer level to build customer risk models. (A transaction-based model will score each transaction for risk, whereas a customer-based model will score, daily, the risk associated with each customer given the new transactions that have been added.) There are advantages and disadvantages to both, but in this case, we are building a customer-centric risk model for reasons that make sense in this particular business context.
Back to the CRISP-DM process and why it is advantageous to deviate from CRISP-DM. In this project, we jumped from Business Understanding and the beginnings of Data Understanding straight to Modeling. I think in this case, I would call it "modeling" (small 'm') because we weren't building models to predict risk, but rather to understand the target variable better. We were not sure exactly how clean the data was to begin with, especially the definition of the target variable, because no one had ever looked at the data in aggregate before, only on a single customer-by-customer basis. By building models, and seeing some fields that predict the target variable "too well", we have been able to identify historic data inconsistencies and miscoding.
Now that we have the target variable better defined, I'm going back to the data understanding and data prep stages to complete those stages properly, and this is changing how the data will be prepped in addition to modifying the definition of the target variable. It's also much more enjoyable to build models than do data prep, so for me this was a "win-win" anyway!
Other Posts by Dean Abbott
» Already a member? Login now to comment!
» Not a member? Register to comment!
A guest says:
I can't agree with you enough on this! As more and more BI practitioners learn Predictive Analytics and Data Mining techniques, the processes we use need to adapt to their needs rather than raising the barriers by making them conform to ours. I always instruct users to leverage the parts of the process that make sense for them rather than trying to learn everything before they start.
Independent of any toolset, the processes we use need to be flexible enough that new practitioners can apply and leverage Predictive Analytics in new and creative ways.
Dean Abbott says:
Wayne:
Great to hear from you! SEMMA is an excellent methodology as well, and overlaps significantly with CRISP-DM. SEMMA, as you know, is more geared for the analytics part of data mining, whereas CRISP-DM is more project oriented; CRISP-DM starts with Business Understanding where SEMMA starts with the data.
I will definitely read through the link you provided.
D.
Wayne Thompson says:
Very useful post. I totally agree that defining the target variable and a good candidate set of predictor variables with close consideration of the prediction window and of course the correct specification of the target is of extremene importance. One of my co-workers at SAS provides some really good tips and examples on how to prepare analytical modeling (training) tables. See http://www.sascommunity.org/wiki/Data_Preparation_for_Analytics . CRISP-DM is good common sense practices. We also designed SAS Enterprise Miner around an iterative SEMMA Sample, Explore, Modifiy, Model, Assess model development process with sampling be optional and more commonly used for oversampling rare target events. Thanks again
A guest says:
Very useful points Deans. Agree assembling a representive training table along with the specification of the target and the predictors is critical to developing meaningful models. The timing element of critical importance. One of my coworkers at SAS has some very good information on preparing analytical data marts here http://www.sascommunity.org/wiki/Data_Preparation_for_Analytics that some may be interested in.
CRISP DM is good common sense data analysis. At SAS we also propose a Sample , Explore, Modifiy, Model , Assess (SEMMA) model development process with the sampling be optionally (more commonly used for oversampling rare target events). The components are very interative and exploration for example may be applied after model assessment to do things like residual diagnostics and generalizaton. Anyway great post and looking forward to you hands on workshop and Predictive Analytics World in San Fran this March 2011.
The moderated business community for business intelligence, predictive analytics, and data professionals.
The Predictive Analytics in the Cloud Study is complete!
Register here to access the full results of this exclsuive study on Predictive Analytics and Cloud Technology including a whitepaper, 2 webinars, multiple podcasts and more!
SmartData Collective

About Social Media Today


