Another powerful use of enterprise taxonomies is autocategorization, or the automation of content description. With autocategorization, concepts from enterprise taxonomies or ontologies can be applied to enterprise content at scale. Computer algorithms “read” and analyze documents to match content to specific taxonomy concepts or ontology elements. The specific technology may vary, but autocategorization of text-based content generally relies on natural language processing (NLP) techniques or tuned queries to large language models (LLMs).[1] These techniques form text analytics services that power machine annotation of content.
This machine annotation can serve multiple use cases:
- Tagging: Identifying the many taxonomy concepts and named entities mentioned in a document.
- Classification: Identifying the few concepts and named entities that best describe the aboutness of a whole document.
- Extraction: Identifying the new concepts and named entities that are found in the full text of a document but not in the taxonomy.
Quality of an autocategorization service can be measured in many ways, but each metric answers the essential question: “How close is the machine labeling to human understanding?” Answering this question requires the introduction of benchmark data – human “gold standard” labels. The outputs of an autocategorization service are compared against these benchmarks to assess true positives (the machine and human annotation agree), false positives (the machine added an annotation that does not match the human benchmark), and false negatives (the machine failed to find an annotation provided as a human benchmark). The number and ratio of true positives, false positives, and false negatives can be summarized in a number of standard metrics:
[1] Autocategorization of non-text content may rely upon computer vision techniques, audio signal processing, or other machine learning models. For this section, we will focus on autocategorization of text-based enterprise content (e.g., documents, transcripts).