Autocategorization

This content has been archived. It may no longer be relevant

Autocategorization

Automation of content description

Another powerful use of enterprise taxonomies is autocategorization, or the automation of content description. With autocategorization, concepts from enterprise taxonomies or ontologies can be applied to enterprise content at scale. Computer algorithms “read” and analyze documents to match content to specific taxonomy concepts or ontology elements. The specific technology may vary, but autocategorization of text-based content generally relies on natural language processing (NLP) techniques or tuned queries to large language models (LLMs).[1] These techniques form text analytics services that power machine annotation of content.

This machine annotation can serve multiple use cases:

Tagging: Identifying the many taxonomy concepts and named entities mentioned in a document.
Classification: Identifying the few concepts and named entities that best describe the aboutness of a whole document.
Extraction: Identifying the new concepts and named entities that are found in the full text of a document but not in the taxonomy.

Quality of an autocategorization service can be measured in many ways, but each metric answers the essential question: “How close is the machine labeling to human understanding?” Answering this question requires the introduction of benchmark data – human “gold standard” labels. The outputs of an autocategorization service are compared against these benchmarks to assess true positives (the machine and human annotation agree), false positives (the machine added an annotation that does not match the human benchmark), and false negatives (the machine failed to find an annotation provided as a human benchmark). The number and ratio of true positives, false positives, and false negatives can be summarized in a number of standard metrics:

[1] Autocategorization of non-text content may rely upon computer vision techniques, audio signal processing, or other machine learning models. For this section, we will focus on autocategorization of text-based enterprise content (e.g., documents, transcripts).

”Machine annotations resulting from autocategorization can serve many of the use cases of enterprise taxonomies or ontologies: search indexes, content management systems, etc.

This image is a recreation of the illustration available on Wikipedia page:https://en.wikipedia.org/wiki/Precision_and_recall

For these metrics, a perfect score is 1 or 100%. The higher the score, the closer the autocategorization is to reaching human gold standard quality at scale.

A key step in supporting higher quality machine annotations is enriching taxonomies with alternative labels and annotation-specific rules. To bridge the gap between human understanding and machine understanding, taxonomies or concept records may require expansion and/or refinement. For example, a human reading description of apples, pears, and bananas will understand that the document is about “fruit.” If the enterprise taxonomy has not included “apples,” “pears,” and “bananas” either as children or alternative labels of “fruit,” a machine tagger will not accurately label these same documents as being about fruit. Equally, if “apple” has been defined as a child of Fruit, a machine tagger may inaccurately label a corporate policy regarding Apple devices (iPhones, iPads, etc.) as being about fruit.

In this case, the concept record may require tagging-specific rules to not tag “fruit” if “iPhone,” “iPad,” etc. are also present in the document. The process of autocategorization will reveal quite quickly how closely the current taxonomy design matches enterprise content.

Autocategorization services, like all algorithms, will vary in their transparency and explainability. A risk of any computerized approach to classification is the “black box” – the unknowability of why a machine has assigned a specific annotation to a document. The level of transparency available in any autocategorization service will help to determine:

The level of technical skill required to operate (implement, analyze, retrain) the service.
The confidence in the quality of the system outputs (e.g., hallucination or bias concerns).
The speed of iterative improvement through analysis and retraining.

Synaptica offers AI Studio for autocategorization of text-based enterprise content according to enterprise taxonomies.

AI Studio can be used to create human-in-the-loop workflows that automatically classify diverse data sources to enterprise taxonomies, as well as to extract candidate entities and information from content.

Machine annotations resulting from autocategorization can serve many of the use cases of enterprise taxonomies or ontologies: search indexes, content management systems, etc. If integrated into a graph database, however, machine annotations can drive further functionality through enrichment of an enterprise knowledge graph. Integrating machine annotations into a query able graph database such as GraphDB can drive better business insight (e.g., data visualization and analysis) and process automation (e.g., recommendation services powered by similarity indexing) beyond the simple automation of tagging and classification.

Download the full Synaptica Guide to Developing Enterprise Ontologies, Taxonomies, and Knowledge Graphs.

Download the Synaptica Guide