When I first came across Data Mining and Machine Learning in 1997 I had no idea of the kind of applications that this field can have. As time passes by, the knowledge that can be available to a data/text miner becomes more and more a serious business… actually, a very serious one.
Not long time ago I have seen a presentation where a map of emotions from the web was created in real time by aggregating specific keywords from blogs and forum posts…
When I first came across Data Mining and Machine Learning in 1997 I had no idea of the kind of applications that this field can have. As time passes by, the knowledge that can be available to a data/text miner becomes more and more a serious business… actually, a very serious one.
Not long time ago I have seen a presentation where a map of emotions from the web was created in real time by aggregating specific keywords from blogs and forum posts. Twistori is an example of such an application. Now, let’s take this idea one step further.
Twitter is a “social messaging utility” in which users describe what they are doing — or what they are feeling/thinking — now. Users are able to send “tweets” even through SMS messages. The way that these messages are written is an ideal format for text mining : Short phrases that summarize what a user wants to say are a text miner’s paradise.
It is logical to assume that Text mining and Information extraction techniques will become more important, since more data will be generated in the future. It is only a matter of time until the next “killer app” like FaceBook, YouTube and Twitter appears. Data/Text miners will be able to identify common “thought clusters” of people.
Now, consider the following example : By visiting this link you will get a list of people that have written on their “tweets” the phrase “I don’t want to…”.
Once this textual information is captured, preprocessed and then analyzed through clustering analysis we could end up with the following clusters of “I don’t want-er’s ” :
– The cluster of users that do not want to work again/tomorrow/today (18.5%)
– The cluster of users that do not want to go to sleep (6%)
– The cluster of users that do not want to hurt someone (4.2%)
What is also interesting is the ability to quantify the proportion of cases belonging to each cluster to the total of tweets. As shown in the example above, the most frequently occurring thought is from people that do not feel like working.
Now in the same way one could perform this type of analysis for :
“I Believe….”
“I wish i….”
“I want to buy…”
Essentially, what we are talking about is the extraction of the values, hopes and beliefs of hundreds of thousands — or even millions — of users… and in descending order. Once a first run is performed and clusters are extracted one could run this process again every month and see the trends of those clusters in time. It would be also interesting to see how these thought clusters change after specific World events.
For some people such as marketeers and social researchers — providing that results are accurate enough — this information is invaluable. Others, might feel that such an analysis is bad practice. Of course, there are companies that already capture brand sentiment across the web : Crimson Hexagon and Twitrattr are just two examples.
This post is the first in a series of posts discussing the application of Analytics to capture the thoughts that — as we speak now — exist on the Web. We will go through ways that one could explore this information and more specifically we will look at :
- How clustering can group people’s values, beliefs and emotions.
- Why Ontologies and Natural Language Processing are needed for better results.
- How classification analysis might give us knowledge on what are the common characteristics of various ‘categories’ of users.