One of the primary motivations for performing text analytics is to determine the aboutness of a document or sets of documents (a corpus). While we are no doubt familiar with applying a subject or topic to content, such as the way we organize content on the Internet or apply metadata in a content management system, aboutness takes into account both the topic of a written text and its intent or relationship between the writing and the topics addressed. While it’s relatively easy to determine the topic or topics of content in the form of keywords appearing in the content, intentionality is often a human interpretation or abstraction of a text.
Keywords
For example, manually indexing a book and automatically building a search index will both primarily rely on keywords and concepts identified and extracted from the content. The user then hopes to match the concept between what is in his or her head or by typing a text phrase into a search box to retrieve the information. For more esoteric or abstract content, however, the keywords may or may not be directly expressed in the content. The Modern Language Association’s (MLA) International Bibliography, for instance, manually indexes humanities research papers using controlled vocabulary terms applied by subject matter experts. Although the terms serve the same purpose–retrieval–as keywords, the concepts are often a level of abstraction above a keyword in that they convey additional meaning. Furthermore, research articles written in the humanities are often not as straightforward as, say, scientific literature. Thus, a human hand is required to capture the overall aboutness of the work. While generally more accurate (despite studies that show human indexing is prone to variance between indexers and even from one indexer over time), human indexing is intensive and time-consuming work.
Text analytics can help to speed the processes involved in capturing the aboutness of text. While indexers have nightmares of text analytics and artificial intelligence beasts living in their closets, the truth is that text analytics involves tools and methodologies which save time and aids people in understanding texts, especially on the very large scales which have become the information norm in our day. Not only is information generated at very large scale, it is generated in a very short time cycle. While humanities journals can allow a lag between publishing and retrieval and remain relevant, information such as news and user reviews need to be categorized and analyzed in as little time as possible. See, for instance, the 2016 Presidential Debate Analysis conducted by The New York Times, which used live feed fact checking and post-debate discussion and analysis to interpret the debates. A linguistic analysis conducted by HuffPost, however, drew some very interesting conclusions by using text analytics. Although the analysis focused on the relatively easy task of counting word frequency, the interpretation of these results was sped to several conclusions by this processing. In addition, thematic identification and categorization helped speed the overall aboutness summarization of the debates.
While a typical organization may start with no or poor metadata applied to content and then move to controlled metadata manually and automatically applied to content, few organizations make the leap to document summarization and aboutness. Using a combination of controlled vocabularies, open source Linked Data, and text analytics tools which analyze the language, the existence of known and unknown entities, and the relationships between these concepts, an organization can move from burdensome manual tasks to automated concept identification. Further, the ability to identify concepts and their relationships opens the door for extracting more meaningful aboutness to documents, creating a more semantically rich description of information.
I’ll explore more text analytics applications in upcoming blogs.