Improving the accuracy of search is of utmost importance to companies like Google and Yahoo, and one of the best ways to do this is to incorporate text analytics (AKA text mining) into the back end.
Let’s take a typical enterprise search engine and break down the steps that go into an actual search. First, a database of unstructured content is fed into a pipeline, where it is converted into a structured document. That document is then fed into an index, and when a person queries the index, results appear.
Text analytics occurs within the pipeline, before the content is indexed, where it analyses the content and extracts meaningful metadata such as entities being discussed, sentiment, and themes.
The information gained from the text mining process can then be used to create a more efficient search. A common tool for this purpose is faceted search. Any time you’ve used an advanced search option while using a search engine, you’ve been using faceted search. It is particularly useful because it enables cross-referencing through all of that metadata.
Faceted search engines come in a variety of complexities and flavours. Major retail websites use rudimentary faceted search to narrow down the categories in which you are searching, while databases such as ones for academic or legal documents may have a more complex set of cross referencing tools.
Text analytics is crucial for word sense disambiguation. Word sense disambiguation is the process of determining what meaning of a word that has multiple definitions is being used in a sentence.
In a typical string based search engine, a search for a term with multiple definitions is going to yield results for all possible uses of the word. Using text mining, the context of the rest of the sentence or phrase in which the word is located is used to determine what the word refers to, when that knowledge is applied to search, it improves the relevance of search results.
More than anything, text mining’s power in search is that it allows you to ask more general questions like “who’s hot and who’s not?” and “is there any breaking news I need to know?” and get results that actually answer those questions.
All in all, the ability to add context and extract metadata from unstructured content before it is indexed makes search engines a far more powerful tool.