
First we have to identify the problem and see what information we should consider. Every Tweet has an author, a piece of content, and is posted on a specific day and time. More specifically, for every tweet we can collect usage data such as
- Day of Post
- Time of post
- Elapsed minutes since tweet has been posted
- Author of tweet (Twitter username)
- Number of followers of the author
- Subject of post
- Whether the tweet involves a question being asked
- Whether the tweet contains hashtags
- Whether the tweet contains a "Please Re-Tweet" directive (or variants)
- Whether a user is mentioned
- The text of the tweet itself.
For this data and text mining exercise (and keeping in mind that tweets have been sampled from one website and not Twitter itself) let's define what is a viral tweet: After collecting approx. 8000 tweets from dailyrt.com it was found that the median value of Re-tweets is 17. Here we make the assumption that if a tweet exceeds 30 Re-tweets it is considered viral (and actually this specific assumption makes the classification task much easier).
As discussed above, usage data do not tell us anything about the content of a tweet. Usage data tell us about the name of the author, his/her followers, when the tweet has been posted and how many minutes elapsed since its post. Can this information alone predict whether a tweet will become viral? A data mining model predicted (without using the elapsed time as input field) with an overall accuracy of 75.03% whether a tweet can be viral and - perhaps as expected - shown that the most important factor for making a viral tweet is its author. Running a process called Feature Selection tells us just that :
But what we have seen so far only tells us one - the data mining - side of the story. With text mining we can see the importance of words and authors. To do that, each author is appended at the end of each tweet (so essentially the author becomes a part of each tweet text). Here is what Feature Selection tells us :
The difficult - but also interesting - task is to predict a viral tweet that has an impact not because of its author but because of its content and to do this the methodology of data collection and analysis differs significantly.
On the next post we will see a model predicting viral tweets in action: We will submit several tweets and their author and the model will tell us the probability that each submitted tweet has to become viral.
Link to original post

About Social Media Today






