Special Characters
I grew up in central Michigan. When I moved to the Southeast, I learned the word “character” had a very special meaning when applied to people. Someone described as a character typically meant the person was eccentric, odd, or particularly flamboyant. Calling someone a character wasn’t necessarily an insult but wasn’t necessarily a complement, either. When you hear, “He’s such a character, bless his heart”, chances are this is a nice way of not being nice.
This Southern usage of “character” reminds me of what we typically refer to as “special characters” in text. Special characters are special and sometimes eccentric, odd, or particularly flamboyant. There are thousands of special characters which can include anything in written English not part of the 26 modern Roman alphabet letters or standard punctuation. Simple and more common examples include loan words from languages which use diacritic marks, such as café, piñata, and, arguably, Hawaiʻi or Shi’ite.
In many cases, we can leave out diacritical marks without consequence. We can understand word difference in context in a sentence like, “The exposé sought to expose the corruption in the system.” Machines, on the other hand, only have the diacritic mark as an indicator of difference between the two otherwise identical terms. In such an example, we know by virtue of the acute accent that exposé and expose are two different words with two different pronunciations. What do we do with special characters in text analytics processes?
Removing Special Characters
Commonly, special characters and numbers are removed from text during preprocessing. Preprocessing attempts to regularize text by tokenizing the text, removing stopwords, stemming, removing extra spaces, normalizing capitalization, removing punctuation, or any other process which makes the text easier to analyze. While some of this processing is useful or even essential, it does have an impact on the original nature of the text.
There are good arguments for removing special characters during preprocessing. For example, if several versions of the same term appear in the text, such as uber, Uber, über, and Über, chances are good we will want to treat all versions of the term as the same, so normalizing capitalization and removing special characters makes sense. Likewise, in messy and inconsistent text, removing special characters may help to focus on the words used rather than dealing with variants.
If an apostrophe is used in a word such as Shi’ite, removing the character and closing the space retains the word. On the other hand, if the apostrophe is treated as punctuation instead of part of the word, the output text may stand as Shi ite, forming two separate words which are not matched to vocabularies or recognized as a single entity. Therefore, removing special characters which may confuse other text analysis processes may eliminate problems.
Acknowledging Special Characters
The other course of action is to include and acknowledge special characters. Concepts with special characters can be managed as part of controlled vocabularies or can be treated specifically in text analytics rules. Many taxonomy management systems allow for Unicode characters and, when used in conjunction with text analytics tools, have the option to match the term exactly as written. Likewise, stand alone text analytics systems will also have the option to match the text exactly, including recognizing case sensitivity and special characters.
There are many advantages to recognizing special characters. Even in an environment in which English is the primary language, foreign-language terms or characters may need to be handled by the system and recognized in text. Similarly, proper names for companies or even individuals can include mixed capitalization or special characters. In some instances, the capitalization or special character may be the only way to distinguish between the proper name and the general name. When Prince became the Artist Formerly Known as Prince, his only designation was a symbol. While it’s possible to recognize the symbol by the supporting words surrounding it in text, as a standalone, the ability to handle and recognize special characters is important.
Special characters can be recognized as part of managed concepts or can be dealt with in pattern matching rules for specialized tasks. For instance, analyzing social media content may require recognizing existing terms from a taxonomy, extracting previously unknown concepts, or differentiating between the text and hashtagged concepts. Removing the hashtags will treat the terms like any other word in the text. A pattern matching rule, however, will be able to match and recognize any concept beginning with a hashtag, allowing for the analysis of the body text separately from the hashtags.
While special characters, in all their eccentricity, may cause many text analytics headaches, the choice whether to remove and ignore or include and manage can prove to be a valuable tool in analyzing text.