One of the most interesting applications of Data/Text Mining and Information Extraction is Politics. I started collecting information from various blogs, websites and forums and applying Information Extraction and Data/Text Mining techniques to extract potentially useful knowledge in this area. By combining different pieces of information one could come up with trends that may tell us what lies ahead of us.
The latest developments in Greece are more or less known to most of people that read International News. The situation is difficult and the voice of citizens in various blogs and forums could give us the sentiment of Greek Web Users. For example :
- Which are the most frequently occurring words?
- Which are the most frequently occurring thoughts?
- What are the things that have to be changed by Greek politicians?
To answer these questions i have started collecting information found on the top 120 Greek blogs, the OpenGov website (a state-run website where Greek citizens express their opinions) and a couple more Greek sites…
One of the most interesting applications of Data/Text Mining and Information Extraction is Politics. I started collecting information from various blogs, websites and forums and applying Information Extraction and Data/Text Mining techniques to extract potentially useful knowledge in this area. By combining different pieces of information one could come up with trends that may tell us what lies ahead of us.
The latest developments in Greece are more or less known to most of people that read International News. The situation is difficult and the voice of citizens in various blogs and forums could give us the sentiment of Greek Web Users. For example :
- Which are the most frequently occurring words?
- Which are the most frequently occurring thoughts?
- What are the things that have to be changed by Greek politicians?
To answer these questions i have started collecting information found on the top 120 Greek blogs, the OpenGov website (a state-run website where Greek citizens express their opinions) and a couple more Greek sites of economic content. For blogs and forums a Java program scans every 20 minutes for new information :
This information is then sent to an annotation engine which analyzes the textual content. Once the text is analyzed we can -for example- produce a keyword vector that we can later use to understand what citizens are saying on the Web. We can then find out answers to many interesting questions such as :
- With which words is Mr George Papandreou (PM of Greece) associated with?
- When there are some very negative words (such as swearing) what other words are found in the same text?
- What does keyword trending tell us? (For example, we identify an increasingly number of swear words in citizen posts)
First let’s see some examples regarding the OpenGov website where thousands of citizens have expressed their opinions on the tax policy of the Greek state. The following chart shows us a number of pairwise correlations between written words in these comments :
Under the red rectangle appear two words (dikigoros,iatros) which in Greek mean “Lawyer” and “Medical Doctor” respectively. This essentially tells us that these two professions are used together frequently in citizen discussions. By looking closely at these messages one can reveal that professionals in these two sectors are said to avoid taxes by not issuing receipts.
Next we could use association rule learning to look for some more -potentially interesting – rules :
The highlighted rule although one of low support it could prove interesting : A subset of citizens are requesting that freelancers and the self-employed should be more closely monitored for tax fraud.
Apart from rule learning, it is interesting to identify the proportion of the total dataset for which each rule holds. That also gives us a sense of order with which different ideas and thoughts exist on the mind of citizens.
In the next post : What is the Voice of the Citizen tells us in Blogs and forums?