In certain situations, the data miner has to perform sampling on the dataset before applying any algorithm. The main reason being too many data to mine. In such a case, a possible technique is random sampling. If classes are uniformly distributed, one may use random sampling before supervised learning.
In certain situations, the data miner has to perform sampling on the dataset before applying any algorithm. The main reason being too many data to mine. In such a case, a possible technique is random sampling. If classes are uniformly distributed, one may use random sampling before supervised learning.
But what about association rule mining? If you use random sampling before an association rule algorithm, you may end up finding no rule. The reason is that association rule mining analyses the data as transactions. The idea is to find recurrent trends in a set of transactions that are usually continuous. Here is an example:
Transaction ID / product
112 / bread
112 / butter
112 / jam
113 / cheese
113 / bread
...
The issue with random sampling is that it will not take into account the continuous sequence of events. In the case of association rules, one should take a continuous subset of the data in order to get meaningful rules.
Do you have any other examples where random sampling can’t be used? Other issues with association rule mining? Feel free to comment on this post.