4 million terms, global users, standard quality measures? No problem!
Establishing quality assurance in an enterprise terminology database
As with any database, the usefulness of a terminology database, or termbase, is directly related to the information it contains. The old adage “garbage in, garbage out” still holds true.
I have written earlier about the question of what terms should go into a termbase (see my earlier blog on this topic). The question I would like to mull over in this piece is how to ensure that the correct information is entered in the right place without errors?
The Terminology@SAP team manages the SAP termbase, SAPterm (available publicly at www.sapterm.com). Our termbase contains over 150,000 concept entries with 4.0 million terms. On top of that, it’s available across 40+ languages. Thousands of users located in all corners of the world maintain, view and use this content every day. Information from countless subject areas and industries is brought together in one single termbase.
Quality control of the termbase content is an on-going effort. It’s an activity that never ends like mowing your lawn, shovelling snow or picking up after your kids. The big question, the solution we are striving for, is how to bake quality control into the system itself. How can we ensure only quality information goes into our termbase?
One particular area of a termbase that lends itself to this type of quality control is in the choice and use of data categories.
If you have a moderately complex termbase, you are likely to have database fields (or data categories) of several types. System generated data categories are automatically generated by the tool. Closed data categories are limited to a set of possible values, or predefined pick lists. Complex data categories are text fields that are open for the user to add information. Each data category type offers opportunities for quality assurance within a termbase.
System generated data categories are an easy way to ensure that data is entered correctly. Users should never be required to enter their names or dates manually when creating or updating term entries. The latter can be particularly important if your termbase is used by colleagues around the globe: does 12.02.11 mean “11 February 2012”, “12 February 2011”, or “2 December 2011”? By always having the system generate this type of information you avoid problems with inconsistency in how date formats or user names (first name + last name or last name + first name) are entered.
The use of predefined pick lists is invaluable to ensure that information is entered consistently. By limiting the users’ choices to a predefined set of possible values we can ensure reliable search and filter on key data categories. Some common examples include part of speech, gender, and term status. However, any information type which can be distilled down to a predefined set of possible values should be implemented as a pick list whenever possible. For example geographical usage (country or locale), project identification, term type and subject field are all data categories that can be implemented as closed data categories.
It is unlikely for a termbase of some complexity to not contain at least a couple of open data categories such as the term itself, definition, context, examples or usage notes. What these have in common is that the information is entered as free text and the content is not predictable. How then can you ensure consistent quality information in these types of data categories?
The first line of defence is clearly defined specifications of the information that is allowed in each of these data categories. How are terms entered into the termbase (capitalization, canonical form, use of parenthesis)? What is the permitted format for definitions? What information can be included in usage notes?
The second step is training and support for the users who are tasked with entering data into the termbase based on these specifications. This in turn is closely followed by the requirement that each entry be reviewed by a second person before being released.
Another important quality tool is the availability of spelling and grammar checking of all text fields. If your termbase is multilingual, these tools should be available in all languages.
In addition to making sure that information is free from spelling mistakes, the ability to estimate the accuracy of text fields is extremely useful. It is beneficial to be able to search and flag open data categories for:
- Double or single space after full stop;
- Hard line breaks (makes things complicated on export and conversion);
- Capitalization of terms (acronyms, proper nouns, common nouns);
- Consistency between part-of-speech and definition (definitions of nouns begin with article, verbs with infinitive marker);
- Definitions, contexts or examples that contain numbers only;
- Definitions, contexts or examples with extremely short or long texts.
Can we make the grass mow itself, the snow fall in neat piles, or make the dirty socks always hit the laundry basket when fired from across the room? Maybe not, but building quality checks into your data categories can help ensure better quality of information from the moment it is entered into your termbase.
I’d like to hear from others how they handle quality assurance and data verification in their termbases. What mechanism do you have in place to manage quality in your termbase? To what extent do you automate the quality checks in your terminology management solution? What are some areas where you would like to add or improve quality checks or measures?