Anderson Analytics’ Tom H. C. Anderson speaks with Seth Grimes about text mining.
Anderson Analytics’ Tom H. C. Anderson speaks with Seth Grimes about text mining.
If you know of text analytics you know of Seth Grimes, he is a true Text Analytics Guru. While I may know everything there is to know about text analytics in market research, Seth’s knowledge is far broader encompassing all of Business intelligence (BI).
TomHCA: Welcome Seth, so happy to finally get to interview you for the blog. I very much enjoy your text analytics conferences and events!
Seth Grimes: Thanks Tom, and I’ll start by saying that I’ve a lot learned from your writing and presentations [Anderson Analytics], for instance about “triangulation“ methodologies for NGMR.
TomHCA: Thank You Seth, Likewise! As most of NGMR’s readers are market researchers, can you tell us a bit about how you define BI, and also of course your definition of Text Analytics?
Seth Grimes: Business Intelligence is a confluence of information, analysis software, and business processes that transform data into insights that support better business decision making.
Most of the BI world — especially software heavyweights including IBM/SPSS, SAS, SAP/Business Objects, Microsoft — has defined BI as analysis of sales, marketing, customer transactions, and other data from operational systems. But now they’re all seeing how limiting this view is, that customers need and want to bring social media, enterprise feedback, online news, and other “unstructured” sources into enterprise BI initiatives.
This “unstructured data challenge” is why I got into Text Analytics, and it’s the starting point for my latest conference, Smart Content, covering content analytics. It’s slated for October 19 in New York. We have some great speakers so do check it out.
In sum, getting back to the question: It wouldn’t be inapt to define Text Analytics as Business Intelligence focused on text.
TomHCA: Text Analytics has been around, well I guess you could say since
just after WWII with the first crypto/translation related efforts.
Given this, are you surprised we’re not farther along than we are now?
Seth Grimes: Yes, text analytics has been around for a long time. IBM researcher Hans Peter Luhn published seminal papers in the ’50s that actually defined BI as knowledge extraction from text,
but it’s obvious in retrospect why BI became something else, analysis of sales, financial, and marketing data and the like. That data is “low hanging fruit”: easy to analyze, containing A LOT of business value.
Contrast automated text analysis. In the words of expert systems pioneer Edward Feigenbaum, “Reading from text in general is a hard problem because it involves all of common sense knowledge.” Further, the link between information captured in text and business challenges is frequently not direct.
Net result: Business focused on the easier analysis need, on data in databases, but now we’ve begun to see the true potential of automated text analytics and we finally have the tools to do the job well.
TomHCA: When I started Anderson Analytics in 2005 with aim of bringing text analytics to market research, it seemed no one in my field had heard of it. Then later in 2007 Nielsen (BuzzMetrics) and TNS (Cymfony) got into it a little for the sake of social media. Now on the other hand, perhaps especially because of Twitter, it seems to be one of the hottest buzz words around! Is it just me or does there seem to be explosive growth in just the past 2-3 years? Surely it isn’t all do to social media monitoring. How have you seen the use of Text Analytics evolve more recently?
Seth Grimes: Yes, you were out ahead, and I think it’s in 2005 that we first met, at the first Text Analytics Summit. When I started looking at text analytics in 2002 or so — check out, for instance, The Word on Text Mining from 2003 — really only folks in life sciences and intelligence were using the technology.
Now there are solutions for every industry and business function that can benefit — and that means every organization that’s online or communicates electronically in any way. The growth in awareness and uptake does come from user-generated content — social platforms and also e-mail and messaging — and because publishing, marketing, advertising, and customer support has shifted its primary focus to online and other electronic channels.
Further, there are solutions that range from traditional installed software to online, as-a-service offerings, both free-standing text analytics for those who want it and, most importantly, built into line-of-business applications where the user doesn’t even know she’s doing text analytics.
TomHCA: OK, How about Natural Language Programming (NLP)? It seems to me, based on the many vendors we have worked with and investigated, that everyone claims their software is using some state of the art NLP algorithm. And of course it’s usually completely black box. It seems there is ‘A LOT’ of hype here. What are your thoughts about where we really are and where firms ‘claim’ to be. Is there a gap? What should customers look for?
Seth Grimes: Natural Language Processing: We could get into a deep discussion about statistical approaches versus the use lexicons and grammar rules of also machine learning. The science is published for anyone who wants to learn it, but most folks in business don’t want to, nor do they need to.
Business wants solutions that “just work,” and they can have them.
Fortunately, solutions are testable: How well do they “just work” for you own business problems, whether in market research or competitive intelligence or customer service? Sure, there’ll likely be a gap, wide if you choose the wrong solution, bridgeable if you choose well. There’s no one-size-fits-all set of selection criteria. I make this point over and over again to consulting clients, also that if create the right short list you’ll be most of the way there, to a solution that “just works” (for you).
TomHCA: Yes, makes sense… Taking sentiment as an example, a lot of fuss is made about how accurate this is, yet mostly it seems sentiment is off by +/- 20-30% what are your thoughts about where software vendors say they are and where we actually are in this regard? Also, does it really matter. I mean, as long as it’s consistently off differences can be measured right?
Seth Grimes: Untrained tools can be 50% accurate in sentiment classification, or untrained they can top 80% if they were designed for the business problem at hand. Train them, and you can beat 90%, which is as good as the agreement you’ll likely get out of two humans. But this is a red herring: The argument is a distraction.
The simple fact is that computers are faster (and yes, more consistent) than humans. Computers handle huge volumes of information, working 24/7, very often allowing you to tap information sources that would have been inaccessible ten years ago.
So the simple answer, for now, is to take a hybrid approach that combines human knowledge and judgment with machine power. You’ll get better results than with either humans or machines alone.
TomHCA: Yes, that’s what we have found, and I like that “it’s a distraction”, I may borrow that.
So what industries do you feel have used Text Analytics in creative ways, can you give some examples?
Seth Grimes: We all use text analytics. Here’s an example: Type “map massachusetts”
into Google or Bing. You’ll see, first up, a map of Massachusetts. That’s because the data scientists have studied searches and they understand that a searcher who asks a search engine “map ” probably wants a map rather than a list of documents containing those words. And they did some “named entity recognition” that sees “massachusetts” as a geographic area. This is text analytics, and it’s creative, and most important, it delivers very broad business value.
Other examples? I love one from Gaylord Hotels, which used software from Clarabridge, a vendor that focuses on customer experience management (CEM). Here’s a case-study quotation
“Automated analysis of survey comments showed that customer experience was measurably enhanced when bell services staff accompanied lost guests to their destinations within a resort, as opposed to merely pointing them in the direction they needed to go.”
Creative is great, but there are much more compelling reasons to try automated text analytics. I remember a presentation by an EDS staffer back in 2005, that his company cut processing time for large-scale employee surveys from 5 staff-days to half a day. (That was using Megaputer’s PolyAnalyst software. That kind of ROI is pretty convincing.
TomHCA: Yes, Hospitality industry certainly is rich with VOC data, and we’ve done a lot of interesting work there as well with firms such as Starwood Hotels and Flyertalk for instance. But, how about the other side of the coin, are there any specific industries that you feel are behind the curve considering the potential ROI of text analytics for them?
Seth Grimes: There’s been across-the-board uptake, sometimes more enthusiastic, sometimes less. To me, the real behind-the-curve issue involves users who handle text in isolation. I’m thinking in particular of Social Media Analytics (SMA) (which relies on text analytics). I’m getting tired of people who think the business goal of social-media use is to gain follower, friends, and connections, that success is measured in and social-media mentions and “retweets.”
That attitude is silly. Social ROI is properly measured in the ability to drive business outcomes, and that means sales and cost reductions.
Social followers have no value unless they contribute to the corporate bottom line. The only correct way to measure social ROI is to link mentions to transactions: product and service sales, resolution of customer issues, etc. Linkage entails bridging social media with enterprise operational systems. Text analytics enables semantic integration. If you’re not working toward integration, toward data fusion, that’s when you’re behind the curve.
TomHCA: Interesting and challenging. You’re down in Washington DC, lots of Pentagon, NSA, CIA, FBI contract work. Some of the government stuff I’ve seen in the past has been pretty darn low tech. I’ve often had the feeling that what we’ve been using in market research has been more powerful. I realize this probably isn’t what most people would think given what we see on the tube and Hollywood screen with RAPTOR listening into every phone conversation and email. So what’s the truth here in your opinion. Is government further ahead as I’m sure they’d like everyone to believe, or is this false?
Seth Grimes: The government is ahead and behind. The government is early to recognize, cultivate, and adopt new technolgies — think of work at DARPA and funded by the CIA’s venture arm, In-Q-Tel — but the government remains plagued by insularity, mismanagement, territoriality, and political meddling when it comes to procurement, information sharing, technology scale-out, and executing on intelligence.
I’ll add that I’d absolutely love to work with government agencies on text analysis and semantic challenges, but as an independent, I can’t afford to work the procurement bureaucracy. It’s a shame.
TomHCA: Yes, working with procurement suck, especially academic and government. How about other industries? Pharma or Finance for instance. I know Finance industry were quiet early adopters. Can you speak at all to how effective predictive models using text analytics have been in predicting stock price fr instance?
Seth Grimes: Yup, pharma. My buddy Breck Baldwin of Alias-I thinks it’ll be just a few years before a Nobel Prize award for physiology or medicine will have involved the use of text analytics — mining scientific and clinical literature — for drug discovery or related goals.
The modeling problem in finance is trickier. People have been looking for “systems” for a long time. It’s not irrelevant that the early development of statistics was linked to gambling or that “Monte Carlo” methods, named for a casino locale, are a key simulation technique. Gambling and finance are kissing cousins.
Now we understand that news can move markets and the possibility, via text analytics, to automate the extraction of information from news that can be incorporated into models. The trick is extracting the right signals, quickly, and linking it to all the rest of the market data that’s out there in ways that can reliably inform trading strategies.
Does it work? Got me. But there are certainly folks out there who are trying. Check out, for instance, Thomson Reuters News Analytics.
TomHCA: For others who want to get their hands real dirty, which computer languages you have found are better/worse for handling text analytics? And how about free/academic resources for sentiment and/or NLP?
Seth Grimes: There are lots of ways to do text analytics, and not all of them require getting deep into the technology. You can find a business focused solutions that address business needs and problems, for instance, for survey or qualitative research or social CRM (Customer Relationship Management). But you’re right, users who want to a highly performing solution may have to build (or extend) it themselves or work with a services provider that can handle that technology.
Do-it-yourselfers can try traditional, installed software. There are many choices, including open source tools such as GATE, RapidMiner and modules for programming languages such as Python and Java services.
Or check out as-a-service semantic tagging, accessed via a Web application programming interface. Examples are Thomson Reuters’ Calais and Evri, which focus on entities and terms; AlchemyAPI, which adds in concepts and topics; topic-focused TextWise; Open Amplify for relationships and intent signals; and Lexascope from Lexalytics and the Clarabridge API for
sentiment.
There are other options: Text-analysis solutions from companies including IBM, SAS, SAP, Attensity, TEMIS, Open Text, SRA, and others; search-focused technology from Autonomy, Endeca, Exalead, and Open Text; and a myriad of “listening platforms” that focus on social media. If you don’t mind a plug: Advising users on solutions and strategy, and vendors on product and market positioning, is a large part of my consulting practice. Also, folks who want to learn more will have a great opportunity at the up-coming Smart Content content analytics conference, October 19 in New York.
TomHCA: Thanks Seth, certainly continues to be an interesting time for us
Seth Grimes: Tom, thanks for the opportunity to do a bit of market education.
Text analytics and semantics can and should be part of Next Generation Market Research initiatives, so I was glad to have a chance to explain how.
TomHCA: Always a pleasure Seth
@TomHCAnderson
Managing Partner
Anderson Analytics, LLC
[More on Seth – Seth Grimes is an analytics visionary: A consultant, writer, and industry analyst working in text analytics, business intelligence, data analysis and visualization, and information strategy as applied to information-age challenges. Seth founded consultancy Alta Plana in 1997 and is a long-time contributing editor at TechWeb’s IntelligentEnterprise.com, a channel expert at TechTarget’s BeyeNETWORK, and founding chair of the Smart Content: The Content Analytics Conference, the Text Analytics Summit, and Sentiment Analysis Symposium.]