Organizations have information–lots of information. Some of that information is structured data and is stored in databases comprising metadata as well as textual and numerical content. Even more of that information is unstructured content in the form of images and text. It’s commonly agreed we are producing more and more information and the bulk of this information is unstructured. It’s also commonly accepted that there are insights to be found in that information.
Insights can be gleaned from structured data. Analyzing, crunching, and processing numbers in rows and columns is a relatively easy task. You may run the risk of having the wrong number, the wrong number in the wrong spot, or the wrong formula processing those numbers, but, if you have everything correct, the result will likely be a straightforward answer. The answer will probably be unambiguous and not subject to multiple interpretations.
Gaining insights from processing textual language is more difficult. In text, the number of problems and exceptions are many. What does a word mean? How many forms and versions of that word exist in one language or in multiple languages? Was it spelled correctly? Was it deliberately spelled incorrectly? Was it used in an expression, strange turn of phrase, or in sarcasm? When you mention a subject in one sentence, is there a pronoun in the next sentence referring back to that subject? Are there acronyms or abbreviations? Is the full form available in the same text? And so on and so on.
The most common and simplest approach to analyzing text is to simply count words–often as part of the “bag of words” approach–treating everything in a textual source as equal and counting the number of times there are exact matches to a given term. You’ll see this frequently in word cloud generators. Matching on exact terms is why there may be several forms of the same concept and unimportant, yet frequently mentioned, concepts in large font. At a very broad level, there is some value or insight gleaned from such an approach, but the uses are limited.
Text analytics is the field dealing with these language problems and creating tools and processes for handling text, or unstructured information, to extract meaning and value. Often what an organization is looking for is contextual understanding and targeting specific language or patterns within text. In addition to the specific language problems mentioned above, there seems to be an industry problem of creating tools which are either too difficult for the average user or creating simplified versions of these tools which hide too much of the functionality to expand beyond common, templated use cases.
In this series of text analytics blogs, I’ll tackle issues within the field of text analytics and discuss best practices.