What Are Training Sets?
Text analytics and auto-categorization software doesn’t simply work out of the box. It requires attention and training. It is possible to install a taxonomy-based auto-categorizer and run it against content without any training. Unless your taxonomy is already aligned with the content you are trying to classify, however, the automatic application of tagging will generally provide poor results.
Document classifiers require training on selected documents which are model examples of the focus topic and include negative examples of other topics. To really work toward accurate categorization, these training documents should be specific to your taxonomies and content. I’ll describe some guidelines and methods for creating training sets of documents specific to categorizing content using existing vocabularies.
Where Do I Get Training Sets?
The content you currently want to classify is perhaps the best place to look for documents you want to use as models for expected categorization. For example, if your classification target is internal or external web pages, finding web pages about the specific topics you want to improve in classification is a good start. The drawback to this is the amount of time and work it will take to identify and prepare these documents for use in your auto-categorization tool. I’ll expand on this in a minute.
A reasonable alternative to get started more quickly is to find predefined training sets about general subjects. A topic classification document set from Reuters or 20 Newsgroups can help by providing a large data set which is already pre-classified and will help to train your auto-categorization system. These document sets are general and may or may not reflect the nature of your content, so it may require mapping your existing taxonomies to the values tagged to the training sets to maximize the value. However, they may provide a rich source if you are using keyword and entity recognition and extraction to build from scratch or further develop a vocabulary.
What Makes a Good Training Document?
Good training documents are typically text-heavy, avoid a lot of specific formatting, and are already tagged to the categories you want automatically applied to future similar documents.
Using the web page example, although HTML is regularly formatted, other text in sections like navigation, banner advertisements, or external links may skew the auto-categorization. Preparing the documents by removing or otherwise blocking sections which can trip up an auto-categorization system will provide cleaner results. Of course, this will not solve the formatting problem if encountered again unless the system supports document section recognition, but having an accurate training set is a first step until the formatting issues can also be addressed.
The best training documents are those which are specific to your vocabulary and your organization. As above, using general, predefined training sets may provide a lot of substance, but they may not reflect the topics of interest to your organization.
A more accurate, but admittedly more time-consuming, method is to create a small document set of even as few as ten documents for each concept in a vocabulary. Not every concept will need a training set, but having documents representative of each ideal categorization of a concept will provide a much higher degree of accuracy. A gold standard set of documents modeling how the concept should be identified or disambiguated can help design highly accurate rules ensuring the next time a similar document is encountered, it will be categorized correctly.
Tagging those documents with the same values from the taxonomy you’d like to see applied to them automatically and then categorizing the documents will provide examples proving out whether the concept will categorize correctly in the future. This method will require an iterative process of running categorization trials and then modifying concept rules so they more accurately represent the document content. Once the rules are firing as expected on the training documents, move on to other concepts until the vocabulary is ready to categorize documents in production.
Training Documents for Disambiguating Terms
Document sets for disambiguating terms should include documents illustrating the various uses of the term in order to specify rules indicating which concept is which. Providing document sets which include one or more instances of a term used in various ways (either in one document or across several documents) can help a text analyst to create rules for each concept to be disambiguated and find positive and negative contextual examples the system can use in future categorization.
For instance, a concept like hearing can have rules specifying positive or negative context words such as ears, auditory, court, and criminal. Appearance of these context keywords will indicate the use of the concept in the text.
Many organizations have quite a bit of topical clarity because the terminology they work with is specific to their industry. That said, it can be very surprising what common terms used in everyday writing will coincide and conflict with industry-specific terminology. To improve automatic categorization, these terms should be disambiguated.
Creating good sets of training documents will help to improve taxonomy-based auto-categorization results and provide an example base from which to work to continue improving document categorization.