Bias in the Machine
Let’s say you upload a photo of yourself and the software automatically tags your face as a gorilla. Your first reaction might be you find this humorous. On the other hand, your reaction might be you find it deeply offensive. You may also think nothing of it, attribute it to errors in the machine, and re-categorize your photos correctly. This example is based on a true story which has made the rounds after someone’s pictures were incorrectly tagged using Google Photos. Google quickly apologized, but, to date, has eliminated the category rather than addressing the real issue: why did the software categorize dark-complected people as gorillas?
What happened? Is Google designing racist software? Is automatic image categorization horribly inaccurate by nature? Did a software engineering team not do its job?
While Google doesn’t make its algorithms readily available, the mis-categorization of photos was likely due to statistically biased, unclear, or unrepresentative training data. In this case, there may have been insufficient examples of people with a variety of skin tones training the “people” category. Likewise, there may have been an overabundance of ape and monkey sample images. The final result is a machine learning algorithm which can more easily distinguish a light skin-toned individual as a person while not being able to distinguish between the human-like faces of apes and monkeys and the darker skin tones of human faces.
While humans can distinguish between people of various skin tones and animals, trained machine learning models cannot. You have perhaps seen the same type of mis-categorization when “faces” are detected in photos, but these faces are really patterns in the picture which have the characteristics of a face. Although these examples don’t indicate computer bias, they do indicate usually unintentional human bias.
Bias in the Content Cycle
Believe it or not, bias can creep into your organization’s content cycle even when your content seems relatively innocuous. Ideally, a virtuous circle includes using content to build a taxonomy, using that taxonomy to tag content using controlled values, and retrieving content in search based on those tags and text keywords. In addition, organizations are starting to use more machine learning, which should be based on good training sample content tagged by a taxonomy. The virtuous circle can easily turn into a vicious circle in a variety of ways.
One source of bias in the cycle comes from the content itself. Content is usually generated by people, and people have biases. The language in the document particular to the writer, the thoughts expressed in the content, or the amount of content generated about one subject and not another are all biases feeding into the content cycle. The content may be heavily skewed toward one area and minimal in another, presenting an inaccurate and slanted view of the subject matter. Intentionally or not, content may also carry political baggage, expressing a viewpoint shared by a minority in the organization or a viewpoint supported by inaccurate assumptions. The content may also be outdated, expressing concepts in terms which are no longer acceptable or accurate. If you are modeling the current world on old information, there is bound to be a disconnect.
Bias in the Taxonomy
Another point of fault in the virtuous circle of content creation and retrieval may be the taxonomy. The way taxonomies can be built may affect their coverage and show a glaring bias: since they are often originally built based on content, then they by nature reflect that content. In this case, see the issues raised about content: bias in, bias out.
For example, I have built taxonomies for very large organizations. These were built section by section, either as content became available or as groups within the organization were onboarded. In this scenario, what should be a general enterprise taxonomy may have a lot of depth and coverage in the financial area but have no coverage for product development. As a result, automatic or manual tagging restricted to terms in the taxonomy will tag some documents extremely well while other content is not tagged well or at all. If the search engine relies on this metadata, you may find a lot of information on one subject and nothing on another. For many end users, they are willing to accept the results, although inaccurate, and assume there is nothing to find.
Even if the content of the taxonomy maintains a layer of separation from the content, the taxonomy builder may have biases about information organization based on his or her role as a central point for metadata creation. His or her perception about the way information should be organized and the concepts used for that organization may be biased by the information organization principles themselves.
Bias in the Text Analytics
There are many layers of potential bias in text mining and analytics processes rooted in the complexity of language and the tools used to analyze language. For example, the natural language processing techniques used may mean the difference between picking up concepts which are two word phrases but are better represented by the longer three word phrase in which they appeared.
Similarly, finding terms and phrases by frequency without any kind of adjustments or weighting may present a skewed notion of what content is actually about. A term may appear a lot in a document, but that doesn’t necessarily mean that’s the most important takeaway from the content.
The frequency or importance of concepts can be further diluted when captured and made part of a taxonomy. If a concept appears 1000 times in 10 documents and another term appears 100 times in those same 10 documents, they can both become a single instance of a concept in the taxonomy. A leveling effect like this, while creating potential equality of concepts in the taxonomy, doesn’t translate the importance back to the content when applied as metadata.
Depending on what content is used for concept extraction, the nature of the content can have a huge impact on what concepts are identified, how often they appear, and their importance to the overall content. Including thousands of short texts with simple vocabulary alongside lengthy academic publications as sample concept extraction sets will probably result in a lot of variance.
Bias in the Algorithms
When sample content is tagged with metadata from a biased taxonomy based on biased concept extraction from biased content, it’s easy to see all that can go wrong. As more organizations turn to machine learning algorithms to extract meaning and insights from very large bodies of content, the training data tagged with taxonomy values becomes important. As with garbage in, garbage out, so too with bias in, bias out.
In the example cited at the beginning of this blog, for instance, a taxonomy may be based on content with many concepts covering the animal world but only one for “people”. When this taxonomy is then applied to many images of animals with only a skewed sampling of representing what constitutes “people”, the result is an algorithm which overgeneralizes images of apes and monkeys and more narrowly recognizes what “people” look like. The software appears racist, when in fact the software is only tagging based on what it knows. In this case, the software end product may not necessarily take the blame as there was a prior chain of bias and ethics failures.
Bias in the Real World
While most of the examples of “racist” or otherwise biased software are offensive but without many real-world consequences, it’s easy to shrug off the results as something that will improve with time. However, as these algorithms are increasingly investigated for use in crime and terror prevention, identification of high-risk financial borrowers, or identification of surroundings by an autonomous vehicle, the real-world stakes become high indeed.
As with the concern over “fake news”, extracting insights based on biased input can result in these biases being transferred to actionable, real-world applications. It’s our job as information professionals to identify bias in the content cycle and aim to improve the outcomes.