Big Data, Text Analytics and Privacy
Big Data, Text Analytics and Privacy
Today as part of the Next Gen Market Research Guru series, I’m interviewing Jeff Jonas. Jeff is IBM Distinguished Engineer and Chief Scientist IBM Entity Analytics. For a quick background on Jeff check out Tuesdays post where I posted an interesting video highlighting some of Jeff’s ideas on Big Open Data.
Tom: You’re IBM Distinguished Engineer Chief Scientist, IBM Analytics. Can you tell us what is Entity Analytics is and what it is you do?
Jeff: Entity Analytics is the remnants of my Systems Research and Development (SRD) company. I founded SRD in the early 80’s and IBM acquired it in January of 2005. My work focuses on people, organizations, things, places – call these “entities” – things that can be uniquely counted. The analytics part refers to making sense of entity data. Detecting that the employee a retailer hired had previously been arrested for stealing from the same retailer would be an example of entity analytics. My work involves doing this is real time, so organizations can avoid making such blunders in the first place.
Tom: You’ve said we’re going to get ads that are so tailored for us that we’re going to say “I love you!”, yet a the same time I’ve noticed people complaining about services like Rapleaf and ReTargeter that allow you to advertise to people based on where you’ve been on the web. What do you make of this? Will people become more comfortable with the marketer knowing everything about you?
Jeff: My Facebook ads say thinks like “Are you a triathlete, 47 years old and want abs like this?” Let me tell you that a well crafted ad for me. I didn’t click on it though as I do not want to encourage them. Where it gets more problematic is when consumers are not opting in. Add to this the collection and use of data that would surprise consumers (e.g., their web activity) and eyebrows really start getting raised. I prefer opt in models.
Tom: You’ve talked about “Spear Phishing”, a darker side of data mining, can you tell us briefly how you define Spear Phishing and why it’s inevitable?
Jeff: Phishing is a method some criminals use to acquire sensitive information such as usernames, passwords, credit cards information, etc. Most folks see such emails from time to time. They stand out because they are so lame. Now imagine an email that is so personalized that you hand over such information to the criminal – and to this day you don’t realize you have been duped. Now that is spear phishing. As targeted ad engines get better “Are you a triathlete, 47 years old and want abs like this?”… mass spear phishing may follow suit whereby machines deliver such tailored malicious emails over half the people respond. Millions of people.
Tom: How do you think we will reconcile protection from Spear Phishing with legitimate use of data mining and social media analytics such as marketing and marketing research?
Jeff: Criminals try to take advantage of technology the same way as legitimate businesses do. And law enforcement uses many of the same technologies to catch the bad guys too. Cops and robbers. Not sure how to change this cycle.
Tom: Can you describe what a more legitimate use of ‘Spear Phishing’ might look like, or can these techniques only be used for bad?
Jeff: Well, being that Phishing involves trying to acquire sensitive information by masquerading as a trustworthy entity — I cannot think of a legitimate purpose for Phishing.
Tom: What do you think about various privacy measures such as “Do-Not-Track” etc. currently being discussed in government?
Jeff: I would prefer to see opt in models that opt out models. In any case, I also think location privacy is going to become a very important debate. Geospatial movement data is something I refer to as “Analytic Super Food.” Knowing how someone moves day in and day out can be used for some pretty remarkable predictions e.g., what street corner you will be at next Thursday at 5:57pm. I blogged about this in more detail here: Your Movements Speak for Themselves: Space-Time Travel Data is Analytic Super-Food!
Tom: You talk a lot about “information in context” what kind of data is this intended for? You talk about how it can be used in Spear Phishing, but what are some of the good uses? Is it better for finding the needle in a haystack (like data mining for terrorists), or analysis in aggregate (like marketing research)?
Jeff: If your Facebook page has a map with a pin in it about what country you are going to next, that information would sharpens one message – whether that be a travel recommendation and link generated by a spear phisher or a bona fide hotel ad. My definition of context is “better understanding something by taking into account the things around it.” More context improves understanding and estimation; something both needle in haystack and overall market statistics benefit from.
Tom: How if at all can the idea of data in context be used to better leverage social media listening in order to understand customer needs better?
Jeff: If observational data related to social media listening including such things as title, duration of listening, and so on … then this additional contextual data, when added to other data, would likely increase understanding.
Tom: I understand what you’re saying in theory, that computers need to look at the context and larger picture rather than a single puzzle piece, but is this really possible? Where do human analysts need to enter the picture?
Jeff: If a computer is determining which ad to deliver to a user on a web page, whether they are using accumulated context or not, this happens at a rate that prevents any human review. Now contrast this with analytics used by a taxing authority that selects companies for a tax audit. Such a system produces a suggestion, and then humans take a look at the details. There are many scenarios where the analytics should not be on the trigger, so to speak.
Tom: How does info in context relate specifically to text analytics, if at all?
Jeff: Hidden in text are entities – names of people, places, things, dates, times, consumer sentiment towards products, social circles, and more. When this data is commingled with other data, a clearer picture may emerge. In short, if information trapped in text can add to one’s understanding … then it is valuable.
Tom: What is your opinion on the current state of text analytics and natural language processing?
Jeff: Entity extraction algorithms, software that can pull entities and concepts out of text, have a long ways to go to get anywhere close to human quality. Last year I would have told you there have not been any major technical gains in this field in years. But, that may have changed this year. The IBM research system called Watson, the one that played Jeopardy! and won, is possibly the biggest breakthrough in the entity extraction and classification field in decades.
Tom: You’ve talked about big data and how analysis actually becomes faster the more data you have. For those of us who have crashed various analytics programs with larger data sets this is hard to understand. Can you explain a bit how this can be possible? And is this more theory than actual testing you have done?
Jeff: Back in 2006 I saw this happen. It was an accident actually. Imagine that – a system that gets more accurate and faster with more data. Sounds wild at first, but actually it is something quite simple. Why are the last few pieces of a jigsaw puzzle as easy as the first few? You have more data (puzzle pieces) in front of you than ever before. I have some more details about this here: Puzzling: How Observations Are Accumulated Into Context and here: The Fast Last Puzzle Piece.
Tom: Getting back to Privacy and Identity theft, what do you think should happen? Will we all have a number on our foreheads soon? How do you think our concept of privacy will change?
Jeff: Well, first I would say that our concept of privacy has already changed radically over the last decade. And more to come I am sure. Anyway, I don’t think it will be a number on our forehead. No, it will be a number though … your cell phone number … attached to your right or left ear. Knowing where you are and when will be used to take a real bite out of identity theft. By the same token, it is going to get harder and harder to have secrets – the trend being more transparency whether you like it or not.
But truthfully, most people seem to like it as there seems to be more upside than downside. On that note, check out this very cool Youtube video entitled “Hans Rosling’s 200 Countries, 200 Years, 4 Minutes” that brings this to life just how much better the world is getting.
Tom: Yes, I do like that video and area of research. One final question today, what magazines/websites to you use to stay current and get ideas?
Jeff: I don’t read. But when I do … it is Wired Magazine.
Tom: Love the honest answer, Wired’s one of my favorites as well!
Thank you Jeff for sharing your ideas with NGMR. You’re truly a Next Gen Researcher!