Skip to main content

Text Analytics & Search

Text analytics processes and outputs can be difficult to imagine when discussed rather than shown. Just how does one put text analytics into practice and how are the results surfaced?

Managing information and knowledge should be treated holistically by architecting an information ecosystem without silos. For many information applications, the end goal is findability. Findability may be simply finding information to work with or, by extension, finding insights within content or bodies of content. The use of the search interface for locating and understanding all types of content make it a natural place to surface the results of text analytics work.

Baselining Search through Search Log Analysis

One place to start with text analytics in the search ecosystem is with search log analysis. Search logs are a record of every search conducted. While search logs may include several data points, such as date, user, click throughs, and the number of times a search phrase was used, the most useful data is the search phrase itself. In a typical search log output, the search phrases are organized by number of times the phrase was searched, but they can usually be sorted to show the searches alphabetically and see similar concepts together.

The problem is, we don’t all search in the same way, nor do we search alphabetically. Take, for example, an Intranet search in which a user is looking for benefit information. The search phrases human resource and human resources fall neatly together in the H’s as do benefits and benefits information in the B’s. HR is at the beginning of the H’s while 2018 benefits enrollment sorts at the beginning of the list with other numbers. Oops, someone made a typo, and there’s a search for zhuman resources down there in the Z’s. Y? Who knows, but there’s probably something related in the Y’s, too. Someone misspelled benifits, but that’s ok because we’re still looking through the B’s with only 20,000 more lines of search phrases to analyze for this month’s search logs. You should be done just in time for next month’s search logs.

The spread of directly, and indirectly, related concepts across the entirety of all search phrases across any given time period is overwhelming to analyze. While an internal search on an Intranet may scale into the multiple thousands per month, onsite search and external search engines may reach millions of searches. When I checked this morning, there was nearly an estimated 2.5 million searches on Google today. Who wants to manually scan those search logs? While there are many good search analysis tools which are included out of the box or can be easily integrated with a search engine, deep concept analysis is usually lacking.

What to do? One approach is to output the search logs for a given period (usually available as a CSV file), remove all the information except for the search phrase column, and enter this information into a text analytics tool. Utilizing a variety of statistical word and phrase clustering techniques, the analyst can get a view of terms clustered together. These can be by words or phrases, in whole or in part, on roots or full terms.

While you may believe that your top 10 searches might be the most important, seeing concepts clustered across the entire range of search phrases may paint a very different picture of what people are searching for. The marketing concept of the long tail (see my blog) applies here: the total number of less frequently searched concepts may actually add up to far more significant areas of search improvement and content delivery.

Many tools will also allow you to take those results and output them as a visual representation. Suddenly, you can literally show a search picture to executives, driving search improvement, website, or internal document process projects.

Taxonomy Building

Well, that was a lot of work. What do you do with the search analytics results? Lots of things! The number of actionable search insights is only limited by imagination. Oh, and budget. Yeah, and resources. And difficult technical integrations. But let’s think positively.

Internally or on an external site search, you can perform a gap analysis between your enterprise taxonomy terminology you are using to tag content and what people are searching for. You do have an enterprise taxonomy, right? You are using it to tag content with controlled metadata, right? Right? If not, here’s your start. You just analyzed your search logs and now have some starting concepts for building a tagging taxonomy. Whether starting from scratch or maintaining an existing taxonomy, search logs are a rich source for preferred terms, synonyms, and acronyms.

Of course, it’s much easier to find related truths when the searches are within a narrower scope of information. Large web search companies field millions of unique searches across an infinite number of topics. Within an organization or on an externally facing website, the search volume may still be very large, but will at least fall within known parameters relating to your business.

Search Navigation & Scopes

Within the search application itself, identifying terminology may inform how the faceted navigation is presented. Whether powered by the enterprise taxonomy or not, search facets provide guidance and filters for users to narrow searches to the content they are seeking. Likewise, frequent search concepts can indicate which sites should be presented at the top of search results as a best bet or preferred result. On a larger scale, clustering search phrases may naturally group searches and topics into large buckets. These buckets may match with a body of content which can be a search vertical. So, for instance, a huge number of benefits searches may warrant a scoped search in which only human resources documents and sites are included, narrowing the breadth of the search to more likely candidates.

Externally, search phrase volume may warrant changes in the site navigation or in the way content is tagged for external findability. SEO programs can benefit from a deep search log analysis through text analytics in order to optimize landing pages.

Content Gap Analysis

Looking at search phrases which returned no results is one way to identify content which needs to be created or needs to be easier to find in search. Likewise, analyzing clusters of search phrases can assist in conducting a content gap analysis.

Too frequently, there is a mismatch between what users type in to search and what terms are used in the content or used to tag the content. Modifying a tagging taxonomy is one way to address the gap. Another missing component is having the wrong content, outdated content, or no content at all. Using search phrase analysis results as a map for what important content needs to be found allows an organization to conduct an audit and address the gap.

Inline Indexing

Within an academic context, the inline indexing and annotation of concepts within a document can aid in locating particular sections and the topics included. Seeing concepts highlighted in context, like one may find in search result snippets, can help users quickly navigate to the appropriate section, especially if highlighted concepts are identified in various ways, such as colors, highlight box shapes, or other visual indicators.

More robust text analytics tools can also parse and highlight a variety of information, including managed taxonomy terms, parts of speech, named or unnamed entities such as organizations, people, and locations. For many businesses, highlighting parts of speech may not be very useful, but the identification of named or unnamed entities can help in identifying new concepts which are not yet being managed in a corporate taxonomy.

Live Streams

Most of the search applications I’ve mentioned deal with relatively slow-moving information and analysis. However, it is possible to run text analytics applications over fast-moving data to extract insights.

For example, setting up predefined searches for social media mentions of companies or products, news feeds, or other text-based information sources can provide a body of content which can then be analyzed. The analyses can run from sentiment analysis to identifying areas of product improvement as users report positively or negatively through online reviews.

One challenge organizations may face is getting their hands on the data to include in such an application. While some companies may get direct feedback on products or services on their website, other companies need to purchase this kind of data from third-party sources or vendors who may offer their own analytic services on content before the company can perform their analysis. They may even not offer direct access to the content, preferring to only offer the results of an analysis. While challenging, getting the right data for internal text analysis can result in more specific and tailored analysis and freedom to use the results for many purposes.

Using the Results to Drive Projects

What all of these potential applications have in common is a statistical, supportable, and presentable view of search to the people who fund information projects. Telling an executive people are searching for things and not finding them is not as compelling as showing an executive a visual chart of what people are searching for and how this matches, or doesn’t match, the way content is tagged or organized. Search phrases in textual language are subjective. Using text analytics to present this information in a more objective fashion can help drive all kinds of information projects.

Search is the end game for most of these projects, using all of the insights from text analysis to build information applications to find content. Showing the value of text analytics in search is a win for building powerful applications.