Skip to main content

Comparing A to B

While recently discussing taxonomy versioning with a few of our clients, they cited A/B testing as one of their use cases for creating taxonomy versions. In addition to creating point-in-time versions of their taxonomies for archival purposes, they also want to be able to have a dated snapshot as a point of reference for ongoing changes.

“A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective” (Wikipedia). Being able to compare one version of a taxonomy to another allows us to determine performance over time as the taxonomy changes. 

Let’s dig into this a little more. 

A/B Testing in Taxonomy Construction

A/B testing is often used in taxonomy construction. In this scenario, comparing Taxonomy A to Taxonomy B isn’t really about seeing changes in context and viewing audit logs listing all changes to a concept or scheme. Instead, the taxonomy end users consider two versions of a taxonomy structure and provide feedback about whether Taxonomy A or Taxonomy B will work better to help them navigate to concepts (or, more specifically, concepts representing products or content). On a retail website, for example, users may be asked to view top-level categories and determine where they would most likely be able to find what they are looking for. Thus, you might have something that looks like

Taxonomy A 


          Outdoor Furniture  

Taxonomy B 

     Lawn & Garden 

          Outdoor Furniture 

Users, without being able to see the full structure until they navigate through, would be asked where they would expect to find “Outdoor Furniture”. A/B testing in this regard is about designnaming, and context  for competitive advantage. The user should be able to find “Outdoor Furniture” quickly through navigation and search without leaving the site (and going to the competitor). For the client, making this single change could potentially drive a large amount of revenue. Magnified across an entire navigational structure, a retailer may stand to gain or lose a lot of traffic, and eventual sales, through a well-structured and thoroughly tested taxonomy.

A/B Testing for Performance Analytics

A/B testing can also be used to benchmark performance analytics against measurable key performance indicators (KPIs). Using A/B testing in this manner creates baselines for continuous improvement and quantifiable metrics justifying the ROI of the taxonomy and associated person and software costs.

For example, we’ve already mentioned retail as a case for navigable or searchable taxonomies. Imagine using real KPIs, such as time spent on a page or whether the shopper purchased an item, to determine if one version of the taxonomy or another performed better. These taxonomies could be point-in-time snapshots compared at two different time periods or even two versions of the taxonomy rolled out simultaneously in a randomized, live testing environment in production. 

Another real-world example is how long it takes to resolve a customer service call center issue. This may be the time the call center representative spent resolving the issue or the time the user spent self-resolving. In fact, the end user and the internal customer service rep may not even be working from the same taxonomy since external and internal language is dependent on the use case and level of familiarity. Either taxonomy version, or both, could be tested against prior versions to establish which navigational structure or concept names were more effective.

A/B testing for taxonomy performance in machine learning may involve evaluating the performance of a taxonomy as a whole, including all of its concepts and differences between versions. For example, if a machine learning model consumes Taxonomy A and offers a recommendation to a user, will another machine learning model using Taxonomy B make the same recommendation or a different one? How will each of these recommendations fare based on click-throughs or final product purchases?

Machine learning models are complex, and it will take reviewing the model output from two taxonomy versions and the resulting proximity to the desired outcome to understand if the cumulative changes had a positive or negative impact. It will also rely on transparency in the model to understand which of many changes were responsible for the change in results.

A/B Taxonomy Testing Challenges

Whereas taxonomy testing at the top level or focused only on a few differences may work with live users, providing feedback on multiple changes across multiple taxonomy versions is difficult. The more complexity in the taxonomies, including relationships within and between taxonomies, relationships between taxonomies and “things” (content or products), and the number of changes made in any given period, the more difficult it is to understand the changes made over time and how they made an impact.

As taxonomies are continuously changing, it’s important to be able to take point-in-time snapshots of the taxonomy in use and measure specific changes against well-defined time windows and their associated analytics. For example, from August 1-7, Taxonomy A was in use and from September 1-7, Taxonomy B was in use. During those two weeks, you must have a snapshot of both taxonomies, the results of metrics on the website, and be able to verify that no changes were made to either taxonomy during the window of testing. Bringing all of these components together to run an effective A/B test takes careful planning and coordination.

When conducting A/B testing on taxonomies, stay focused on specific areas of development or improvement to provide clear results and course of action. Know exactly which areas of the taxonomy you’d like to address and be sure you are comparing like items: testing changes to an appliances navigation against changes to the furniture navigation won’t provide clear answers. The changes must be known and must be comparable. 

Similarly, attempting to compare two taxonomies as they continue to change does not provide clear point-in-time references. Make sure there is a taxonomy freeze for the same period of time on two separate occasions to establish two definite versions at work for a measurable time.

While coordinating taxonomy freezes, versioning, and external metrics tools can be challenging, A/B testing against established KPIs provide immense benefits for the taxonomy program and the organization as a whole.