When you were a child, and the grownups asked what you wanted to do when you grew up, you said, “I want to do text analytics on Big Data!” Right?

Well, maybe not. Maybe you wanted to be…  a rock and roll star. Or President of the United States.  Or an astronaut! Those are jobs to dream about, glamorous jobs.


Nobody rises to fame alone. Behind the rock star, there are roadies, security staff, movers, makeup artists, costumers, caterers and drivers. There are a hundred or more unsung workers who are not famous individually, but still a necessary part of the glamour machine. A Presidential campaign depends on thousands of staffers and volunteers. And a space mission involves much more than a few astronauts.

Want to go into space and get back alive?  You’ll need a team. A big team. Data collection, management and analysis requirements for a space mission demand a substantial staff.  This is sophisticated stuff!  Just think of the monitoring you’ll need:

Astronauts’ physical condition and medical information

Geodesy (spacecraft location) and gravitational fields

Meteorology – cloud cover and radiation balance

Atmospheric physics

Air density from drag and non-gravitation forces

Ionospheric physics

Magnetic fields

Cosmic rays and trapped radiation

Electromagnetic radiation (UV, X-ray and gamma)

Interplanetary medium


You’ll need all this data and have a lot of sophisticated calculations to make, real time. The consequence of failure? Death.


And here you were worried about taking on text analytics with Big Data.

Computing resources

NASA has a lot of resources, of course, but so have you. Care to guess how much computing power a space mission requires?  Let’s compare it to the familiar devices we use for everyday missions. An iphone 4S has two core memory chips, each with 512 MB, so about a GB altogether. How many iphones worth of core computing power does it take to get out of Earth’s atmosphere and back alive?

Would you believe… less than one? How about much, much less than one? It’s been done with… 300 kB. That’s “k”, for “kilo-” not tera, not giga, not even mega. That was the core memory of the IBM 709, which powered the Mercury program, our first American manned space missions. (I recently had the pleasure of discussing this with one of the many unsung yet remarkable people who made space flight happen, Lucy Simon Rakov, a programmer on the Mercury project. She and fellow Mercury programmer Patricia Palombo recently received the National Center on Women in IT’s Pioneer Awards, and they certainly are computing pioneers!)

Three hundred kB? How is that possible? By today’s standards, they didn’t have much computing power, so they used brain power. This remarkable resource is still available today! So let’s use it to make the most of our data, Big or little.

Big Data

There’s no official definition of Big Data. If it’s big enough to pose a challenge for you, it’s Big. Big Data is often characterized by the three Vs – Volume, Velocity and Variety. The first element of Big Data – Volume- is large quantity. Velocity simply means that data is being created, or collected, rapidly. And finally, variety implies many forms of data. These might include numbers, categories, audio, video and others. Text is the largest component of most business data sources, and the type of Big Data that generates the most interest today.

Text itself can exhibit a lot of variety. It comes from many sources – social media, help inquiries, email, texts, surveys, and many others. It can appear in any of hundreds of languages, as well as variants. Text can be formal, as in news articles, informal, as in everyday email and posts in social media platforms like Facebook, or ridiculously informal, like twitter! It may be cryptic because of the specialized language of a particular industry or workplace, or due to deliberate encoding to disguise the real message, as in communications regarding illegal activity.

You may think, “Hey, we’re only looking at everyday talk, and we’re the US, so English is enough. All that stuff about different languages doesn’t apply to us. We can get by without that.”

What would you think if you heard someone say, “African-Americans aren’t important to US business. We can just ignore them.” You’d probably think they were real fools. That isn’t a respectful attitude, and it’s not good for business either, since 12% of the US population is African-American.

If you’re one of those who is quietly ignoring non-English text, please wake up and smell the coffee. Over 20% of American speak a language other than English at home. Thirteen percent of Americans are Latinos. Between 2000 and 2010, more than half the population increase in the US was among Latinos. You wouldn’t take for granted the 1 in 8 Americans who are African-American, so why do so many business people neglect our even larger Latino population? Our fastest growing ethnic group is Asians, so expect to deal with much more variety in text analytics very soon.

If you’re still not convinced that understanding American consumers means dealing with multiple languages, then consider that the competitive landscape is global. You may be reluctant to deal with multiple languages, but your competition is not.

Text analytics

A market study released last year revealed a shocking truth about text analytics: most people who attempt it do not achieve positive return on investment. (For more details, read Altaplana’s "Text/Content Analytics 2011: User Perspectives on Solutions and Providers”.) That’s a shame, and it’s not necessary. The reason so many business don’t get positive returns from text analytics is just that they never started with a clear plan to get there. How does Big Data fit into the picture? We need even better plans! To achieve positive returns, you need to start with clearly defined goals that tie directly to the bottom line – increase revenue or decrease costs – and work backwards to define the steps to reaching those goals.

Simplify processes and reduce your costs along the way. Ask yourself – what data do you really need? Can you select just the most relevant data and focus on that to reduce costs? Can you sample? How about starting with a small pilot to test your ideas before biting off the big budget?

Let me give you an example. I work in text analytics full-time. Some of my clients have a lot of data. Think government applications, social media from all over the world, looking for suspicious activity – lots of data. One such client approached me about an application involving so much text that just testing was going to involve several powerful computers running full tilt. I suggested that he use a selective cross-lingual process, one which worked with the text in its native language, and only on the text that was relevant to the topic of interest.

Although he seemed to appreciate the logic of my suggestions and the quality benefits of avoiding translation, he just didn’t want to deal with a new approach. He asked to just translate everything and analyze later – as many people do. But I felt strongly that he’d be spending more and getting weaker results. So, I gave him two quotes. One for translating everything first and analyzing later – his way, and one for the cross-lingual approach that I recommended. When he saw that his own plan was going to cost over a million dollars more, he quickly became very open minded about exploring a new approach.

Unless you have a money tree, you too should explore the ways that you deal with Big Data, and look for good opportunities to simply processes and reduce costs.

The bottom line

We’re using way too much brute force in Big Data analysis today. (Seriously, if I had a dollar for every man who has bragged to me about the size of his Hadoop cluster… )

Take a lesson from NASA. We’ll get better results by using less computing brawn, and more human brains!  Start every analytics project with a meaningful plan, because planning and brainpower are the keys to ROI.