My goal was not to classify between positive or negative sentiment but to extract the general “buzz” about the product by means of clustering analysis. After extracting the tweets that contain the word “kindle” I continued in removing non-relevant information (such as tinyurl links) by using regex expressions.
Next, it was time to understand the data and a good way to do this is to look at word frequencies using TextStat. Here is what I came up with :
My goal was not to classify between positive or negative sentiment but to extract the general “buzz” about the product by means of clustering analysis. After extracting the tweets that contain the word “kindle” I continued in removing non-relevant information (such as tinyurl links) by using regex expressions.
Next, it was time to understand the data and a good way to do this is to look at word frequencies using TextStat. Here is what I came up with :
Top on the word frequency list are the usual suspects: “I”, “and”, “to”, but also “kindle”, “kindle2” and “amazon”, which is something that was expected. Now, let’s see what are some of the words that do not occur frequently:
Here appears a fact that requires attention: Text miners use stop-word lists to remove the most frequent words but they also remove words that do not occur frequently. The table above shows that a non-frequently occurring word is disappointed and if we had chosen to omit words of a specific frequency range – such as less than 3 – we could loose this important information. So caution is needed.
After running the analysis, I came up with 20 different clusters of similar “thinking”. Note that we are not only interested in which those clusters are but also – more importantly – to the proportion of cases that each cluster contains (see previous post). Some of the examples of clusters found are :
1) A cluster of users that are questioning the usefulness of the product
2) Excited users
3) Users that are happy about the text-to-speech recognition of the product
4) Text-to-speech recognition and potential copyright issues
Twitter is a great source for sentiment extraction but one problem is the fact that people are re-tweeting the same news (” The new Kindle 2 is out”) or they tweet about similar information from various tech news websites.