This article was originally published in the MESA M + E Journal Spring 2024.
Abstract: Many organizations invest significant resources in tagging their content, with varying results for accuracy and comprehensiveness. Machine tagging offers both promise and peril: scaling tagging with algorithms is valuable only if machine performance can exceed human tagging. Extending document tagging into a content-aware knowledge graph through algorithmic categorization (using extensible analytics services, including LLMs), powered by enterprise taxonomies and curated by a human-in-the-loop, offers a step-change for the functionality of enterprise content: similarity indexing, recommendations, data audits, and data insights accessible to non-technical users.
Many organizations invest significant resources in tagging their enterprise content, with varying results for metadata accuracy and comprehensiveness. This content metadata then powers content discovery through browse, search, or recommendations.
But are today’s workflows doing enough to surface relevant content? With a different approach, can you put your content metadata to greater use?
Traditionally, tagging approaches rely either on human tagging or machine tagging with little partnership between the two approaches. Marrying the power of information science with data science and placing a human-in-the-loop to manage the process can power better outcomes.
Further, this same process can cultivate a content-aware knowledge graph, which can power more refined and relevant content discovery of enterprise content.
The blog below explores these themes further. Where is there opportunity for your organization to take content metadata to the next level?
”What challenges exist for human tagging today? Most enterprise workflows face challenges for accuracy, completeness, and scalability.
The limits of human tagging
What challenges exist for human tagging today? Most enterprise workflows face challenges for accuracy, completeness, and scalability.
These challenges arise because many common workflows rely on human readers who apply concepts from a controlled vocabulary (e.g., an enterprise taxonomy) to identify the topics described within a piece of content. Often, these controlled vocabularies will describe branded products and services, industries, and subjects. The human reader may rely on tagging governance or business rules (e.g., apply one tag from an industry taxonomy, no more than three tags from a product taxonomy, etc.).
Usually, these human readers are either content experts (e.g., the authors of the content) or a team member responsible for content publishing (i.e., an owner of a process but not an expert in the subject of the document).
In the first “expert reader” case, the resulting content metadata is likely to be accurate but incomplete. An expert knows the content well and is unlikely to apply an incorrect tag, but they are often not incentivized to tag the content completely. Their core job is not tagging content and they have better things to do with their time than apply every relevant tag to a document. They are likely to apply the most important tags and then move on.
In the second “process owner” case, the reader is more likely to apply a wider range of tags as this is a core port of their workflow and responsibilities. The content metadata, however, may still be incomplete, especially for longer documents, which require “skimming” by the reader. Further, the tags applied are more likely to be incorrect. The process owner is not a content expert and may be prone to errors and misunderstanding.
Ultimately, many organizations invest a lot of time and human resources in tagging enterprise content, but the quality of the resulting content metadata is questionable, and the cost tends to be high.
”Machine tagging of content offers both promise and peril; scaling tagging with algorithms is valuable for most organizations only if machine performance can match or exceed human-generated content metadata.
The risks and cost of machine tagging
Machine tagging of content offers both promise and peril; scaling tagging with algorithms is valuable for most organizations only if machine performance can match or exceed human-generated content metadata.
Relative to human readers, tagging algorithms are more likely to generate many tags, but these may be more “noise” than “signal”; some type of intervention or training is generally required to tune machine tagging algorithms for specific enterprise content.
Even large language models (LLMs), which seem to perform so exceptionally, are prone to highly convincing hallucinations. Further, at this point in their evolution, use of LLMs trained on large publicly available corpora can perform variably on specific proprietary content.
Whichever model or algorithm selected for machine tagging, use of machine techniques for enterprise content tagging generally requires skilled data science or engineering resources to train and retrain models or query data to understand model performance. This adds either expense or a process bottleneck as content teams need to rely on technical teams to support their use or improvement of machine-generated content metadata.
Marrying information science and data science
There is a blended path, however, that can harness the power of machine scale to apply curated controlled vocabularies for creation of high-quality content metadata.
There are multiple ways to achieve this, but any approach that achieves transparency and explainability will have the benefit of enabling content experts (e.g., content strategists, taxonomists, content authors, marketing managers) to manage the process with limited engineering and data science support. This expands the user community available to drive a process of iterative improvement that can ultimately exceed the performance of human tagging or machine tagging alone.
The core elements of a human-in-the-loop approach include:
- Connecting enterprise taxonomies to text analytics services that apply the controlled vocabularies at scale as document annotations.
- Enriching enterprise taxonomy properties to support machine tagging (i.e., expanding alternative labels to increase concept recognition, introducing context rules such as ‘must match’ or ‘must not match’ to filter inaccurate concept matches and thereby decrease noise in the content metadata)
- Displaying outputs of machine tagging in a transparent and explainable interface.
- Iteratively improving machine tagging through review of explainable document annotations and enrichment of the taxonomy in a continuous feedback loop.
- Capturing quality metrics to track improvement: false negatives, false positives, recall, precision, F1 score.
Relative to human-only tagging, this approach can reduce cost while improving quality of the content metadata.
Relative to machine-only techniques, this approach has many benefits:
- Rapid deployment with no extensive model training required
- Faster implementation and rapid iterative improvement
- Extended user community (no coding or scripting skills required)
- Transparency and confidence in data
Cultivating a content-aware knowledge graph
The value of content metadata is increased further when content tags are stored in a graph database. In this way, content tagging cultivates a content-aware knowledge graph, which can support further content insight.
With a graph database storing content metadata, any of the following relationships can be stored, queried, analyzed, and visualized:
- Which documents or content sets are most similar to one another?
- How are specific concepts trending over time, language, or geography?
- What concepts tend to co-occur in the same document?
- Which content types are low performing in their representation of concepts in controlled vocabularies?
In this way, the content-aware knowledge graph can power data insight, better content recommendation systems, content and compliance audits, and many other use cases that are inadequately served through many standard tagging workflows today.
”The content-aware knowledge graph can power data insight, better content recommendation systems, content and compliance audits, and many other use cases.
LLMs and your knowledge graph
The approach described above can also support LLM-driven analysis of enterprise content. This is not ‘either / or’ but ‘both / and’. Pairing an LLM with the approach outlined above can drive the following benefits:
- Boosted LLM inputs to generate higher quality outputs
- Retrieval augmented generation (RAG) to leverage custom data
- Improved interpretability of LLM outputs
- Add embeddings for similarity search
- Extract novel concepts (i.e., identifying concepts not already represented in enterprise-controlled vocabularies)
Conclusion
Most enterprises today allocate significant resources to tagging proprietary content, but this is not always time and money well spent. A smarter investment is to balance the human and machine approaches and cultivate a content-aware knowledge graph that can power similarity indexing, recommendations, data audits, and data insights accessible to, and driven by, non-technical users.
Many organizations would benefit from migrating to a hybrid approach to content metadata creation – harnessing algorithms to scale human-curated controlled vocabularies. The partnership of information science and data science can achieve precision and recall scores that exceed that of siloed approaches.