What Is a Homograph?
What is a synonym for thesaurus? You may have heard this question floating around as a social media meme or at a conference for taxonomists. Interestingly, there is a synonym for the synonym half of a thesaurus: synonymicon. According to Merriam-Webster, this may actually be an accurate alternative term since one of their definitions of a thesaurus is “a book of words and their synonyms”. However, I usually think of a thesaurus as including both synonyms and antonyms, so a synonymicon isn’t an exact synonym despite being a great word.
The problem of synonymy is similar to the open problem of word sense disambiguation. It is relatively easy for a human mind to distinguish the meaning between the same or similar words. For instance, if someone tells you they play the bass, you know by pronunciation this is a musical instrument, not a fish. This is what is known as a heteronym.
If someone writes you an email and tells you they play the bass, you know from context this is a musical instrument, not a fish. If someone writes to you and tells you they hit a fly at the game last night, we know by experience they are probably talking about baseball, even though they could very well be telling you they swatted an insect while at a game. Words that have the same spelling but different meanings are called homographs. The term can apply to a broad category of similarity, but, in this case, we can apply the term to words with the same spelling but different meaning. We will also not dive into the nuances of whether terms with the same spelling and different meanings are truly different if they share a common origin. For this discussion, it’s not relevant.
Since I love synonymy and word sense disambiguation, it’s probably no surprise that I also love puns. The more groaning and eye-rolling involved in the telling, the higher the status I will assign to the pun. Some people find this kind of humor punishing. Text analytics software doesn’t understand the double meanings inherent in puns and also can’t detect the difference between words with the same spelling but different meanings. Straightforward meaning is challenging enough, but language riddled with double meanings, implications, misspellings, and slang make word sense disambiguation incredibly challenging. What can we do to begin to address the problems of word sense disambiguation, especially homographs?
Taxonomies as Disambiguation Sources
One of the simplest methods to begin disambiguating terms is to use a taxonomy to control the different versions of a term. There are linguistic resources, such as WordNet or BabelNet, which offer pre-existing terms and context to use as a standard to define and disambiguate terms. Such resources are fine for general use and also an easy way to scale up to many concepts without much time and effort, but they are not tailored to the specific use cases or domains of your organization. While they may provide a starting point, chances are there will need to be modifications and specificity added to reflect your specific needs.
Within an organization, an enterprise-specific taxonomy can act as a pointed source for term disambiguation. Modifying an existing taxonomy or building from the ground up can be time-consuming and include a lot of overhead in governance and ongoing maintenance, but the specificity can improve accuracy in both auto-categorization and search retrieval.
Taxonomies offer several functions aiding in word sense disambiguation:
- The ability to maintain homographs, their parenthetical disambiguators (qualifiers in parentheses), definition, and other associated metadata,
- Context based on parent and child hierarchical relationships,
- Context based on other types of relationships, such as related term, used for, or customized relationships, and
- Linguistic features, such as capitalization.
Let’s expand on each function.
Maintaining Homographs
A hierarchical, navigable taxonomy is a simple way to create homographs and maintain them as a model of organizational knowledge. A homograph can be created in its best contextual location and include parenthetical disambiguators, such as Fruit > apple (fruit) and Companies > Apple (company). Each term can include a definition or scope note and, depending on the taxonomy management tool, can also include other pieces of metadata which may help to distinguish terms with the same or nearly same label.
By virtue of putting a term in its one best location–or, perhaps many best locations if using polyhierarchy–the parent and child hierarchy can offer context. For example, the term Celestial Bodies > Planets > Gas Giant Planets > Saturn (planet) can be disambiguated from Automobiles > Automobile Brands > Saturn (automobile) by virtue of their hierarchies and not just their parenthetical qualifiers.
Similarly, standard taxonomic or thesaural relationships can add additional context to the term. Adding related terms from the same or other vocabularies as context words builds a network of semantically related concepts providing evidence to distinguish one term from another. Most taxonomies also include synonyms as use/ used for terms so equivalent terms can be redirected to the single preferred term. All of these contextual terms can be turned into auto-categorization rules to aid in classification and disambiguation.
In addition, something as relatively simple as respecting capitalization can offer more information to disambiguate terms which are otherwise exactly the same in text. For instance, a very ambiguous term like BE A (an acronym for Beacon Explorer A) which could be confused with the verb phrase be a if there is no part of speech recognition can be distinguished by adding capitalization.
Building a Corpus
In conjunction with a taxonomy of preferred terms, their relationships, and network of context keywords, a document set used for training helps to build a corpus of the terms annotated in context. Auto-categorizing a small set of selected topical documents against one or more terms proves out the accuracy of the tagging. Human involvement in the process allows users to select successful and unsuccessful instances of term categorization, building a database of sample contexts at the sentence, paragraph, or document level to use in training future document sets.
Unlike public web pages which can amass huge quantities of interaction information in order to build a profile of the page’s content and use, internal, organizational content requires human tagging and review in order to build a profile of positive and negative contexts around a concept.
A combination of taxonomies and a context corpus provides a simple method of disambiguation without advanced knowledge of text analytics and machine learning.