Sentiment analysis isn’t perfect and anyone who has tried to do it with social media data will confirm that. The nuances of language, including sarcasm, emoticons, slang, spelling errors, grammar creativity, and more mean that 100% accuracy is simply unattainable. But in market research, we aren’t looking for 100% accuracy, not even 90% accuracy. We know those kinds of numbers are unrealistic. What we expect, however, is to see that social media data has some relationship with real world data. And that is what we investigated here.
This project began by simply finding a third party source of fuel prices and we turned to Gasbuddy to give us average monthly US gas prices. Given that we estimated data points by carefully eyeballing a chart on the screen, the Gasbuddy numbers aren’t accurate to the last decimal place. But if you compare our Gasbuddy chart with the official chart, you’ll see that the trend is accurate. This is our criterion dataset.
The second dataset came from Conversition’s Evolisten database. We collected hundreds of thousands of verbatims from thousands of websites all of which in some way referenced fuel or gas prices or costs. Twitter, Facebook YouTube, Flickr, any type of website where people felt like sharing their opinions about gas prices was our target. After cleaning out the spam, we measured the sentiment of the remainder of opinions. Then, we calculated the inverse of the sentiment score. For example, a score of 5 (very positive) was changed to 1 (very negative), and a score of 1 was changed to a 5.
What you see in this chart is a correlation of 0.65. In other words, as the price of gas increases, sentiment decreases.
It just makes me think… what if everyone tweeted and messaged that the price of gas was really low. Could we turn this correlation into causation? It’s worth a try!