Skip to main content

The Knowns

In my last blog, Exploring with Text Analytics, I discussed the knowns and unknowns which can be identified and explored through text analytics. In this blog, I’m going to highlight some use cases which involve established values and how text analytics utilizes what is already known.

Vocabulary Building

Enterprise vocabularies are typically built to suit the content used by the organization. In the process of creating or maintaining this vocabulary, terms are added or modified over time as new concepts are discovered or needs are identified. While text analytics can be viewed as an exploratory tool for content, analyzing known content for terminology to include in enterprise vocabularies is a way of reinforcing established concepts.

Deliberately selecting a small body of known training content to match to existing vocabulary terms is a way of ensuring the taxonomy is still providing adequate coverage for existing content. By the same token, known content can be used to identify and extract unknown concepts to build out the taxonomy in areas which are undeveloped. Either method is a way to build and maintain the enterprise taxonomy for use in categorizing content.


One of the most common enterprise use cases for text analytics processes is the auto-categorization of content. While using controlled vocabularies offer benefits in the application of consistent metadata to content for identification and retrieval, the barrier to taxonomy adoption has often been the labor-intensive building, maintaining, and application of taxonomies and their values to content. Tools may be getting better at the automatic creation of taxonomies, but few, if any, offer taxonomy generation in any complete and usable state. What is automated, however, is the application of taxonomy values as metadata to content.

Auto-categorization works best with known vocabularies and known content. For example, a news publisher may write and publish news articles which need to be discoverable on a publicly accessible website. Onsite search or web search applications index content and make it discoverable. Most work from the content itself and use various methods to rank the page for returned results. The most notable, of course, is Google’s PageRank. Although modern search engines are far more sophisticated than simply matching keywords, having embedded meta tags which describe the document as a whole are best supplied from a common vocabulary so they are consistent on all content across the site and even between sites. The rapid velocity of news story generation, publication, and sharing requires meta tags to be applied more quickly than is practical by manual application. Thus, using a text analytics tool to identify and match concepts appearing in the content to controlled vocabulary concepts speed the application of metadata. As content changes between versions and over periods of time, the tagging taxonomy is also continually updated to cover concepts. Likewise, the text analytics tool continues to evolve rules-based categorization so concepts which are directly or even indirectly found in text can be tagged.


Similar to auto-categorization, analyzing content for keywords, particularly as they appear in a given field, can be used to route documents in workflows. For example, emails coming into a common address can be scanned–particularly text in a field like “subject”–and routed to the correct recipient based on pre-identified concepts. Again, a process like this is optimal when there are pre-defined and known categories into which all content must be routed. In this case, having items automatically filed into “other” categories is not particularly useful unless it’s clear additional categories need to be identified and added.

Rules-Based Pattern Matching

Typically, entity recognition is done with large lists, vocabularies, or other dictionaries. As with auto-categorization, the concept in the text either matches what is in one of these sources or it doesn’t. However, there are patterns which are fairly common but are not practical to include in lists because of the nearly infinite possible options.

For example, two or three words in a row is often going to be a proper name, including the names of people, a title and a name, and the names of known objects, such as the Brooklyn Bridge or the Golden Gate Bridge. While other natural language processing techniques may identify these concepts in text, specific rules can help to group items which likely fit a category. Further, employing users to perform quality control on the results and verify whether the identified concepts are accurate or not can train the classification to become more accurate over time.

Similarly, you may have content you know includes social security numbers even if you don’t know the exact values. We can write rules matching the patterns ###-##-####, ########, or any other variants which might be social security numbers. The same can be said for other generally formatted information, such as phone numbers and addresses, and specifically formatted information, such as a company’s part numbering scheme, contract labeling, or customer IDs. If the pattern is regular and does not conflict with other identified concepts, they can be pulled from text relatively easily.

These are just a few ways using text analytics tools can speed the identification and application of known concepts to content. In my next blog, I’ll dive into the unknowns which, used with known concepts, create powerful applications for information discovery.