In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a ‘single row per customer’ or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although…
In my experience of customer focused data mining projects, over 80% of the time is spent preparing and transforming the customer data into a usable format. Often the data is transformed to a ‘single row per customer’ or similar summarised format, and many columns (aka variables or fields) are created to act as inputs into predictive or clustering models. Such data transformation can also be referred to as ETL (extract transform load), although my work is usually as SQL within the data warehouse so it is just the ‘T’ bit.
Granted a lot of the ETL you perform will be data and industry specific, so I’ve tried to keep things very simple. I hope that the example below to transform transactional data into some useful customer-centric format will be generic. Feedback and open discussion might broaden my habits.
Strangely many ‘data mining’ books almost completely avoid the topic of data processing and data transformations. Often data mining books that do mention data processing simply refer to feature selection algorithms or applying a log function to rescale numeric data to act as predictive algorithm inputs. Some mention the various types of means you could create (arithmetic, harmonic, winsorised, etc), or measures of dispersion (range, variance, standard deviation etc).
There seems to be a glaring big gap! I’m specifically referring to data processing steps that are separate from those mandatory or statistical requirements of the modelling algorithm. In my experience relatively simple steps in data processing can yield significantly better results than tweaking algorithm parameters. Some of these data processing steps are likely to be industry or data specific, but I’m guessing many are widely useful. They don’t necessarily have to be statistical in nature.
So (to put my money where my mouth is) I’ve started by illustrating a very simple data transformation that I expect is common. On a public SPSS Clementine forum I’ve attached a small demo data file (I created, and entirely fictitious) and SPSS Clementine stream file that processes it (only useful for users of SPSS Clementine).
Clementine Stream and text data files
my post to a Clementine user forum
I’m hoping that my peers might exchange similar ideas (hint!). A lot of this ETL stuff may be basic, but it’s rarely what data miners talk about and what I would find useful. This is just the start of a series of ETL you could perform.
I’ve also added a poll for feedback whether this is helpful, too basic, etc
– Tim
Example data processing steps
a)Creation of additional dummy columns
Where the data has a single category column that contains one of several values (in this example voice calls, sms calls, data calls etc) we can use a CASE statement to create a new column for each category. We can use 0 or 1 as indicators if the category value occurs in any specific row, but you can also use the value of a numeric field (for example call count or duration of the data is already partly summarised). A new column is created for each category field.
For example;
customer | category | score |
bill | food | 10 |
bill | drink | 20 |
ben | food | 15 |
bill | drink | 25 |
ben | drink | 20 |
Can be changed to;
customer | category | score | food_ind | drink_ind |
bill | food | 10 | 1 | 0 |
bill | drink | 20 | 0 | 1 |
ben | food | 15 | 1 | 0 |
bill | drink | 25 | 0 | 1 |
ben | drink | 20 | 0 | 1 |
Or even;
customer | category | score | food_score | drink_score |
bill | food | 10 | 10 | 0 |
bill | drink | 20 | 0 | 20 |
ben | food | 15 | 15 | 0 |
bill | drink | 25 | 0 | 25 |
ben | drink | 20 | 0 | 20 |
b) Summarisation
Aggregate the data so that we have only one row per customer (or whatever your ‘unique identifier’ is) and sum or average the dummy and/or raw columns.
So we could change the previous step to something like this;
customer | food_score | drink_score |
bill | 10 | 45 |
ben | 15 | 20 |