Semantic Relationships
I wrote a blog, “Building Healthy Relationships (in Ontologies)” last year, and I’m returning to this content to explore the development of semantic relationships in more detail. Specifically, I wrote then about some guiding principles of relationship development, but I’d like to further expand with some proposed methodologies.
Identifying the Things
I believe the first step in modeling controlled vocabularies and their ontology frameworks is knowing what you have. What are the things? Speculating on what you might have or could have in the future is much more difficult than assessing what it is you have today. That’s one reason standard ontology frameworks often need modification for use when adopted by an organization. While they are built by experts in the covered domain, the speculative nature of the ontology framework can’t possibly take every actual scenario into consideration.
Identifying the things in the organization can be a task taken up internally or with the assistance of consultants. A knowledge audit of the information landscape should take into account three basic pillars you’ve probably heard: people, process, and technology. However, these three pillars leave out some other items which need to be audited and could best be identified by basic question words: who, what, when, where, why, and how.
Let’s look at some example questions and merge these with the three pillars of people, process, and technology.
Who is using information (content, data, etc.)?
This is the people question. We typically think about people in terms of proper names, and we need to ask ourselves if specifically identified people are important in our semantic domain modeling. Proper names for people are important to human resources and may be required for assigning responsibility for content. In some fields, a proper name is important, such as who authored a research paper. Whether proper names are part of the semantic domain really depends on what domain you are in; proper names of famous authors, living and dead, may be managed in an authority file while proper names of employees may be managed in a human resources database and not managed as part of the semantic model.
Moving up one level, we may find that roles or groups of roles are better suited for inclusion in our modeling. For instance, content authors or database administrators may be systems roles which dictate permissions to controlled vocabularies, concept attributes (properties), or what content can be accessed, viewed, or edited in a content management system. Mapping these roles to systems, content, and controlled vocabularies may be more valuable to the semantic model and more stable as specific employees may move in and out of roles or the organization.
What are they using this information for?
The “what” question may hit on people, process, and technology. For instance, a person or role identified by the “who” question may be creating content for use by another person. Once this is identified, the question then moves to “what process” and “what technology”: in other words, who is creating content during (or for) which process and in what systems? Whether you start with the person, process, or technology, the line of questioning quickly leads to identification of the other pillars.
When do they need this information?
The “when” is typically a point in the process. For instance, an invoice is generated by a person or a system in the point of a process after a service or milestone has been completed. The “when” may be an action that a person performs, the point in time specific information is needed, or the scheduled task in a system.
It is notoriously difficult to model processes in hierarchical models such as taxonomies. Processes change and steps can be skipped, repeated, or modified. Additionally, many processes may share the same step or steps and the hierarchical model has to account for the same or similar steps in different processes and at different hierarchical levels. In general, the process is part of the technology landscape: an action is performed by the person or technology at a point in time not determined by the hierarchical levels in a controlled vocabulary. That said, processes can indeed be modeled in graph databases with associated trigger events. Whether these events are considered part of the semantic model is debatable.
Where is this information?
In information landscapes, the “where” is typically going to be an electronic repository of some kind. These include content, document, digital asset, and records management systems on premise or in the cloud; file shares (the Z: drive or something similar); communication platforms (email or instant messaging); and hard drives on laptops, among many other possibilities. And let’s not forget that physical items can be counted among our information resources and these also have a location: records management storage facility, warehouse, parking lot, office, or any other physical space.
Why is this information important?
The “why” question covers a lot of ground in our information landscape, but it’s often about a process, even if that process is abstracted to include concepts like company ethics or goals. For example, some information may be covered under a records retention schedule and the “why” has to do with local, national, and international regulations. The “why” may be quickly and easily accessible sales information for someone in the field: the customer is asking about a specific make and model of a vehicle and the salesperson needs to know where it is, whether it is available, and the listed price.
The “why” is also pertinent to the “who”. Information critically important to a manufacturing engineer is probably not directly important to the sales team. Although working toward the same goal, manufacturing and selling automobiles, certain individuals and teams need particular information for their roles and processes.
How is this information created (managed, accessed, disposed of, etc.)?
For many information workers, the “how” question is centered on access and retrieval of information. Most likely this is through search, but it could be information which is pushed to users based on trigger events or time. For example, information about benefits sign-up is probably relevant to all employees in an organization at a certain time and can be pushed to them in a general information campaign.
The “how” question may also involve information architecture decisions about which systems have which information to assemble content. Going back to our invoice example above, an invoice requires customer information like name and address, product information such as a textual description, identifying number, and price, and descriptive metadata information included in and describing the invoice itself. This information is likely stored in multiple platforms dedicated to managing particular types of information, such as a customer relationship management system, product information management system, and a centralized taxonomy management system. These systems all provide information which come together to coalesce as a piece of content and its metadata.
Relating the Things
I’ve covered a lot and haven’t even started talking about relationships and how to relate all these things you’ve discovered to each other. The basic question words can be applied again here to track down and relate content.
Let’s take our invoice example: Who needs the invoice, for what process, when do they need it, where is the information located, why is it important, and how is the information provided?
We can answer these questions using simple subject-verb-object sentences including the things we’ve already identified. The things are nouns, which are called subjects or objects in the world of semantic standards. We can connect these noun with verbs, which are called predicates or relationships. We are then beginning to model the concepts, properties, and relationships needed in an ontology model.
So:
Sales Team has content type Invoice
Sales (process) has content type Invoice
Sales (process) has content type Invoice (at date, time, or triggering event)
Invoice has system Customer Relationship Management System
Invoice has system Product Information System
Invoice has system Taxonomy Management System
Sales (process) has record Invoice
Sales (process) has system Invoicing System
Sales Team has system Invoicing System
There are a lot of uni-directional relationships possible, but I’ve narrowed it down to three, more general relationships which can be reused in a variety of contexts. And this brings me to some general principles of designing and managing relationships: reduce, reuse, recycle.
Reduce
In general, I think it’s good practice to reduce the number of relationships you identify and use to connect content. Reducing the number of relationships makes management and application easier across systems and helps to ensure you are using the same semantic relationship versus nearly synonymous named relationship variants.
For example, do we need the relationship has Sales (process) content type to distinguish what one group or process needs versus another? I doubt it. Every group and process across your organization will need content and information to perform their jobs. If content types are identified as part of a taxonomy dedicated to electronic and physical assets, the use of hierarchical and associative relationships will determine the context of something like an “Invoice”. There is no real need to be more specific when the subjects, predicates, and objects will do the work for you. In one organization I worked in, employees wanted to specify types of meeting minutes: sales meeting minutes, engineering meeting minutes, financial meeting minutes, etc. But why? Are meeting minutes inherently different based on who is having the meeting or what the meeting is covering? Not really, so the more general “meeting minutes” can be used with a relationship to a specific team and/or topic. There is no need to pre-coordinate concepts in the subject, predicate, or object when they can be split into more elemental concepts and related to each other with simple, reusable relationships.
Reducing the number of relationships is probably going to save you a lot of governance work, but you don’t need to be stingy. Related to is a useful associative relationship and every possible relationship could be rolled up into this to reduce the number of relationships used. However, related to is vague and not semantically descriptive. How a subject and object is related is far more useful.
Reuse
Part of the work of reducing the number of relationships is reuse. As shown above, a single relationship like has topic can be used for a variety of content types without the need for specifying things like has finance topic, has engineering topic, or has information technology topic. Rather, these topics are contextualized by their location in the Topics taxonomy providing tagging terminology. Financial concepts are in taxonomy branches with other financial topics. It is this context which makes it clear which topics are being applied to content.
Where reuse can fall short, however, is in the design of reciprocal relationships. Many taxonomy management systems do not support the reuse of the same one-way relationship in multiple inverse relationships. For example, it is possible to have something like Character has film Film and Actor has film Film, but the single has film relationship cannot have both a has character AND has actor as a reciprocal. Each of these must be set up as single, one-way relationships.
Reducing and reusing relationships requires planning and governance. While it’s not possible to predict every use case for your ontology model, especially as domains change or expand over time with events like mergers or new markets, careful planning of relationship policies will help to mitigate future problems.
Recycle
Finally, consider recycling relationships. Recycling is slightly different than reusing. Relationships defined in system schemas often disappear when the system is sunset or removed. Salvage these relationships before the system disappears and recycle them for use in connecting content relocated from that system into a new location.
Another possibility for recycling relationships is to apply them in new applications. While controlled vocabularies and their systems should not be a solution searching for a problem, they can often be used in applications beyond what they were originally developed or purchased to do. Advocating for repurposing of existing controlled vocabularies and their semantic relationships for new uses often results in having to do extension work to existing semantic structures rather than building from the ground up…much like using existing material to recycle into a new product.
Which Things do We Include?
Some of the most common controlled vocabulary and ontology modeling challenges are deciding what among your many things and verbs should be represented in which way. For most things, identifying them as subjects and objects will be the best modeling choice. These subjects and objects will have label names as a descriptor or preferred label. When the thing is a less common or less desirable name for a thing you already have, it is a synonym, connected by a used/used for relationship or by an alternative label field.
What if a thing describes another thing within your domain? Then we have modeling choices to make. It could still be represented as concepts connected by relationships, such as Shirt has color Blue. However, this descriptive information could be an attribute of a concept. In this case, the concept “Shirt” would have a metadata field called “Color” which could be populated by a dropdown list of colors or by using a free text field.
I mentioned briefly above that things like proper names may have many use cases in which they are not managed as part of a controlled vocabulary. There are other obvious concepts which are better left unmanaged in a controlled vocabulary: dates, addresses, some product information, rapidly changing data values, etc. There are always exceptions. Managing “09/11” as a concept is not the same as managing that information as a date. Similarly, “1600 Pennsylvania Avenue” is a valid synonym for the concept “The White House”.
Deciding what not to manage can be important as deciding what to manage. Not everything in your semantic model is part of a controlled vocabulary, but data and content can be connected to those controlled vocabularies by relationships mapped out in an ontological domain model.