Text Analytics Forum
This year’s Text Analytics Forum was one of five conferences at the 2018 KMWorld in Washington, D.C. Over the course of two days, a wide variety of presentations and panels detailed use case approaches and best practices in text analytics. I’ve tried to summarize some of the tools, techniques, and trends I heard about at the conference. You can read more from the source in in the presentations posted here.
No Single Answer
My biggest takeaway from the conference is there is no single best answer to using text analytics in the enterprise. While text analytics has been around for decades, only now with the rise of big data, server capacity, and machine learning are we seeing more interest and adoption within the enterprise. Tools and techniques will need to be selected and developed based on the organization and for each use case. It may be possible to template some tools and processes, but the unique demands of un-, or, rather, semi-structured text will necessitate clever approaches to each situation.
The biggest missing factor in the presentations I saw were proposals on how to get text analytics adopted in the enterprise, where this function should live, and who should perform it. The lack of guidance on this may simply be a matter of maturity: while text analytics is well-established as a discipline, formal propositions for widespread use in the enterprise are not as common as they have become for knowledge management and taxonomy. In both KMWorld and Taxonomy Boot Camp, there were multiple presentations on gaining traction for projects and adoption from end users. In contrast, when text analytics was discussed in the enterprise, it was localized and performed by a variety of consultant resources, IT, or business users who learned text analytics as part of their role. I suspect we’ll see more presentations on getting projects started, adoption, and hiring specific text analytics roles in the future.
New Techniques
Despite text analytics being a mature field, there were several novel approaches presented. For example, one presentation focused on using stop words–which are typically ignored–as contextual indicators rather than the usual focus on nouns, verbs, and adjectives. A focus on pronouns, prepositional phrases, and functional adverbs can indicate things like intent and mood because of their high frequency and relational use.
Another presenter suggested a bottom up approach to identifying concepts using rules to chunk phrases and then matching to existing taxonomies and gazetteers rather than using other existing NLP techniques. The main difference between this suggestion and existing techniques was the phrases generated from the text were matched to taxonomies without requiring an exact match.
Another method suggested started with a large set of keywords and then contextualized the meaning around those standalone terms. The presenters used externally ranked web pages to verify and contextualize concepts as essentially crowdsourcing context to apply to concepts.
Manual Tagging
Manual tagging isn’t dead. As much as we look forward to more and more automation in the enterprise, the use of manual tagging is still relevant. Whether done in small scale on training documents by an individual or team or large scale by on or offshore labor forces, manual tagging will still be one of the surest ways to provide well-tagged, “gold standard” document sets for auto-categorization and machine learning training.
One driver for this is subject matter expertise. For each domain, machine learning requires examples to learn and apply to future materials. While general domains may be easier to tag and provide to machine learning models, more specific areas of expertise will require human experts to define, apply, and check tags applied to sample content.
Machine Learning & Hybrid Models
While machine learning is gaining ground in its ability to perform well in the general enterprise, there is a lot of work involved in defining the problem, selecting the right tools, and creating efficient processes. A theme repeated throughout the conference is a hybrid approach.
For example, the general thought at the conference was there will be a continued use of taxonomies and ontologies to provide controlled values for content tagging. While some speakers suggested not using taxonomies, the trade-off requires much more data volume to compensate in training sets.
Similarly, the use of rules and machine learning will be better served used in conjunction with people included in these technologies. People will need to remain in the process and both manual employee effort and automation through machine learning will be essential to usable outcomes.
Text Analytics Applications
There were also some common thoughts in the applications text analytics processes may power. Like taxonomies, text analysis and the results of the analysis of unstructured data will power knowledge graphs. Linking both controlled and extracted entities to outside bodies of information to surface to employees or to public-facing websites allows users to see an array of information.
Similarly, using text analytics applications to power semi-structured text-based business intelligence dashboards will become more common as enterprises integrate machine learning techniques into the content creation and analysis workflow. As with taxonomies, search can also be improved by using text analytics to better understand the content being indexed and presenting this in conjunction with structured data.
Overall, there was a wide variety of tools, techniques, and philosophies supporting the idea that text analytics is a growing area of interest and application within the enterprise.