OF QUALITY AND CABBAGE: THE LAYERED APPROACH TO BUILDING LANGUAGE QUALITY METRICS

 

Leonid Glazychev, Ph.D., CEO, Logrus IT

Creating translation quality metrics that are applicable IRL

In a series of previous articles, I have introduced the 3D Quality Triangle methodology for measuring translation quality that combines holistic evaluation factors with traditional, atomistic ones, suggested an approach for quickly and economically measuring quality on low-budget or community projects, covered the idea of using the best existing published atomistic issue catalogue developed by MQM as the basis for an atomistic quality metric and outlined some limitations of the MQM issue hierarchy, and discussed the optimal way to build metrics for measuring holistic factors.

Now that the methodology is already in place, and we have clearly defined holistic quality scales to use in holistic metrics, the only missing piece is a general approach to creating atomistic quality metrics, and a complete, ready part bin to select from when creating quality metrics in real life (IRL).

In case you missed it: atomistic quality measurements concentrate on unit-level, traditional issues, such as missing tags, improper grammar, corrupted formatting, etc. A typical atomistic quality metric is concerned with quantifying the findings. We log and count issues of all types, assign weights to these issues, summarize results and compare them against expectations.

As already mentioned, any atomistic metric is comprised of three major components:

  • The error typology (issue catalogue) represents the hierarchy and spectrum of all potential issues with translated or localized content
  • The severity scale comprises the full list of issue severities and weights associated with these severities. The more serious the issue is (depending on its adverse effect on overall quality), the higher its severity and relative weight
  • Evaluation criteria. These criteria set expectations for the total, combined weight of all issues discovered per 1,000 words. In its simplest form the evaluation is reduced to a pass/fail criterion: the translation passes if the combined relative weight of all found issues does not exceed a certain threshold
  • I strongly advise adding the fourth component to each metric, i.e. default severities/weights assigned to each issue type. This minimizes natural variance created by reviewers assigning varying severities to otherwise similar errors

Both the severity scale and evaluation criteria are relatively simple, so the only major obstacle that stands in the way of creating a practical and solid atomistic quality metric is the selection or creation of a viable error typology (issue catalogue).

I am an avid admirer of the MQM issue framework that is also at the foundation of the ASTM work item WK46396. At the same time, the MQM issue framework as is has several problems serious enough to make it almost unusable for practical purposes, when it comes to creating viable quality metrics applicable to real-life projects. Luckily enough, it’s not so hard to rectify all these problems.

In the subsequent sections, I discuss the metric “wish list”, MQM applicability problems, and ways to resolve them. I am also suggesting the Cabbage Head approach as a practical way to build nested, upward-compatible atomistic issue catalogues of varying complexity suitable for all potential uses, from a super-quick review to an in-depth evaluation of translated content. I provide an overview of the resulting, modified MQM+ issue framework fully compatible with the Cabbage Head structure, and a ready “parts bin” for creating atomistic quality metrics.

This parts bin already has all components you need to create real-life metrics of any depth/complexity. I am also presenting a detailed overview of the two most usable, simple metrics based on this approach. One of them can be used for quick reviews (including crowdsourcing ones), and the other represents a more balanced compromise between usability and the level of detail. The latter metric is suitable for most other cases.

 

Where MQM and DQF overshot, did not go far enough or got it wrong

Unmanageable error typology

During each quality review each atomistic issue encountered needs to be classified. Given that the MQM issue framework contains approximately 190 different categories and subcategories, this means that each reviewer needs to know all these 190+ definitions by heart to avoid mislabeling the issue, which is a challenge in itself. Given the multi-layered hierarchy (the MQM catalogue includes up to 4 different subcategory levels), it also complicates finding each particular subcategory (one needs to know the path), let alone the time spent on assigning the right subcategory.

Automating preliminary issue categorization in such a bulky structure is also rather challenging, let alone the fact that this initial categorization most often needs to be verified by a human…

Example

Let’s assume that there is a trivial error with date format in a translation from US English to a European language (month and date were not rearranged appropriately). First, you will need to find the category down there, and then figure out that at least two apply: Accuracy – Mistranslation – Date/time or Locale Convention – Date format. It’s just too much searching, too many decisions or clicks in many cases…

To summarize, the full MQM catalogue is very detailed and well documented, but contains too many line items, is complex and tedious to navigate, and is, overall, overkill when it comes to practical applications.

Under most circumstances, nobody needs such a detailed classification of each issue. In reality, people are using a whole number of much smaller, compact error typologies. Regrettably, there are too many different flavors, definitions are not as polished, and none of these typologies is compatible with the full MQM structure. Each such typology was usually built for a particular, narrow purpose and cannot be applied as a universal basis.

Confusing and misleading issue hierarchy

Below, I use the MQM as an example, as it is published and accessible to all, and DQF uses the MQM error typology anyway.

For the issue catalogue to be both usable and viable, its hierarchical structure needs to be logical and must also correlate with workflow steps involved in both translation and quality assurance processes.

  • It is best to separate issues that can only be revealed or confirmed by a native speaker (or a person who knows the target language) from purely technical issues that can be found by engineers in a centralized manner or by automation.
  • The same applies to separating checks related to translation and subsequent formatting. (Formatting is often not required at all or is simply unavailable during translation and review stages and cannot be checked/verified).
  • Common logic requires us to place all closely related issue categories together. For instance, I would expect all categories dealing with locale issues to be grouped in the same place.

Regrettably, MQM is lagging behind when it comes to the logic of its hierarchical structure. For example:

  • Sorting errors are an inalienable part of other locale-related issues, but in the original MQM issue catalogue Sorting is located under Fluency, not under Locale Convention
  • The high-level Fluency category is extremely eclectic and combines everything, from sorting issues to grammar and consistency issues to broken links (a purely technical item)

Flexibility instead of Scalability

Multiple existing quality frameworks, including both the MQM and DQF, talk about the need for flexibility, but dig deeper, and you discover that by flexibility they simply mean utilizing one and the same error typology and simply changing weights (severities) assigned to particular issues. For instance, we can ignore grammar errors in one metric, and treat them with more respect in another.

A truly useful and universal quality model should be not simply flexible, but also scalable. The previous section has already emphasized usability-related drawbacks of the bulky MQM structure. True scalability means that depending on the case (budget, speed, area, etc.) we can apply issue catalogues of different complexity. Complexity in this context reflects the overall number of hierarchical levels, categories and subcategories. The goal is to avoid both extremes, i.e. to neither oversimplify the classification, nor increase QA effort and catalogue size without necessity.

 

The “cabbage head approach” to error typology

In my view, fixing all of the problems outlined above and creating a clean, logical and scalable basis for building viable quality metrics is a relatively simple endeavor. We simply need to keep the good part of MQM (a comprehensive, well-documented error typology that is soon to become an ASTM standard) and fix its hierarchical structure to make it solid, logical, easy to navigate and, most importantly, scalable.

Here’s the idea. It’s easiest to achieve scalability by applying the “cabbage head approach”: We simply rearrange the hierarchy in such a way that will allow us to peel off unnecessary outermost “leaves” (hierarchical levels) as needed, leaving smaller “heads”. Each smaller cabbage head with outer leaves peeled off still represents a fully functional, but more compact error typology. The model provides scalability and at the same time guarantees full upward compatibility.

Removing a certain number of lower hierarchical levels reduces the level of detail but does not impair functionality or usability. We just select the best-suited issue catalogue size for each project line, type of material, budget or timing restrictions.

Going back to the original MQM error typology, there is actually not so much to fix.

  1. First, we fix the structure (issue hierarchy) to make it fully scalable and compatible with the cabbage head approach, i.e. ready for peeling off outermost levels while preserving full functionality. All issue categories in the new typology were grouped to reflect the underlying logic of both translation and quality assurance processes.
  2. Then we introduce minimal changes to the issue categories themselves, adding some missing issues, removing redundant or duplicate ones, renaming misleading categories and fixing some definitions and comments.

Added categories included partially translated segments (that are neither fully translated, nor untranslated), errors related to violation of special guidelines or instructions provided by the client, etc.

Changes are summarized in the table below.

 


The final structure that can be called MQM+ comprises four distinct detail levels (layers), which is sufficient for all practical QA purposes and applications. A viable error typology cannot go more than three levels deep, it’s simply impractical to go further.

 

Atomistic Error Typology Cabbage Head chart

 

 

  • The full cabbage head represents the unabridged version of the MQM+ error typology.

It comprises 4 hierarchical levels (3 in most cases), 9 major upper-level categories + 3 optional ones, 63 second-level categories and ~120 lower-level subcategories.

Do we actually need a super-fine structure in each case, with the full catalogue applied in its entirety? It is highly improbable that anyone will start analyzing respective numbers of issues at the third hierarchical level, for example, differentiating between the following two types of typographical errors: Punctuation and Unpaired quote marks or brackets.

For most people and applications classifying the issue as Typography is sufficient. Further differentiation is insignificant for almost any practical purpose except for academic research.

  • Peeling off two outermost hierarchical levels (#3 and #4) we get the Moderate Detail error typology.

Its structure comprises top 2 levels of the full MQM+ catalogue, i.e. 9 major categories + 3 optional ones, and 63 second-level categories.

  • Peeling off one more hierarchical level brings us to the Optimal error typology.

This one has a flat hierarchy and represents the topmost level of the full MQM+ catalogue. Its 9 major and 3 optional categories are clearly defined, including examples and areas covered. (This top level has changed most, because it is exactly the place where structural changes were required).

Nine major categories include: Adequacy, Intelligibility, Language issues, Terminology, Locale convention, Technical issues, Market compliance (Verity in the original MQM), Style, and Design.

More detailed definitions and examples are provided in this Table.

Three optional categories include: Internal Inconsistency, Internationalization (typically caused not by localization, but by source code and/or content problems), and Other (reserved for any requirements or violations going beyond the full MQM+ typology).

Internationalization is optional because it deals not with localization or translation quality or errors, but rather with issues originating in software or document original design. (Like blocking postal codes that do not follow US traditional 5+4 pattern). It’s not useless, but needs to be applied by the client before localization, during the Internationalization Audit to prevent potential issues. Marking issues as Internationalization-related during translation quality checks doesn’t make much sense, because translators typically cannot fix them.

Similarly, I strongly advise to conduct a Market Compliance Audit on the client end prior to localization, because most issues falling under this category cannot be easily revealed by translators or third-party reviewers. You can find more details on this division in an earlier article dedicated to this topic.

Despite its apparent simplicity, this 9+3, flat error typology is sufficient for most real-life applications; finer levels only pose interest in special cases.

  • Stripping the cabbage head to its core represents the ultimate level of simplicity, when all categories are reduced to a single category-that-embraces-it-all.

In this case we apply a quantitative, sentiment-based scale similar to the scales used for purely holistic factors. On a 0 to 9 scale, 9 means impeccable quality, while 0 indicates either an overwhelming number of various errors (spelling, grammar, broken tags, wrong formatting, etc.) that generate the overall impression of a quality disaster, or presence of ultra-serious, showstopper-level errors each of which makes the content disastrous or unpublishable.

This effort-frugal approach works perfectly for community and crowdsourcing projects with limited funds and/or resources, and also for cases with severe timing limitations. You can find more details and particular recommendations in my earlier article dedicated to this ultimate case.

The Cabbage Head approach eliminates compatibility issues and erases ideological differences between super-quick, community evaluations of atomistic quality and full-scale evaluations based on the full, multi-level MQM+ error typology. Both cases represent opposing ends of the same, scalable atomistic error typology family that covers the full spectrum of our potential needs and comprises four downward-compatible issue catalogues of varying size and complexity. The approach is the same, and the only difference is represented by the level of detail you select.

 

Building optimal 3D, hybrid quality metrics in real life

Now you can create your own 3D, hybrid quality metrics from scratch in minutes using ready building blocks provided in this article. Any 3D quality metric comprises the following components:

  • Three tolerance thresholds that define expectations
  • Two holistic scales for measuring holistic quality
  • An atomistic error typology
  • A severity scale for assigning weights to atomistic-level issues

Just follow the steps below to add/select all of these components.

1.    Set tolerance (PASS) levels for all three 3D quality factors

 

 

Tolerance levels for each quality factor (3 in total) are based on expectations, including:

  • Time
  • Budget
  • Subject area
  • Visibility

For holistic quality factors (Adequacy and Intelligibility) expectations can vary between 0 and 9 (see details in holistic quality scale definitions provided earlier). Atomistic quality is best measured on a 0 – 9 (for the simplest evaluations) or a 0 – 100% scale (see details in subsequent sections). Higher thresholds mean higher expectations in all cases.

2.     Add provided Logrus IT Holistic Quality scales for holistic quality evaluations

Logrus IT Holistic Quality Scales make it possible to assess both Intelligibility and Adequacy of each contiguous piece of content with the highest level of objectivity.

3.     Select the required level of detail (error typology) for atomistic quality measurement

Use the Atomistic Error Typology Cabbage Head chart to make your choice. All subsets are backward compatible with the complete MQM+ catalogue and with each other. Below please find some recommendations.

The simplest, singular one-category-embraces-it-all, holistic evaluation

  • Makes the quickest, most economic metrics
  • Does not include categorizing or counting atomistic-level issues at all

Best uses:

  • Community and/or crowdsourced projects
  • Tight budgets or restrictive deadlines

The Optimal level works for most applications

  • Contains 9-12 primary categories, including Locale convention, Terminology, etc. Each is clearly defined, including examples and areas covered
  • It is a good compromise between the zero-detail holistic approach above and full-fledged metrics

This simple, flat error typology structure still makes it possible to adequately present the spectrum, number and severity of atomistic issues encountered. It reveals all major issues and high-level problems and at the same time allows us not to spend excessive time or money on minor things.

More detailed error typologies

Only apply to most critical materials with high visibility.

4.     Add the selected Logrus IT Atomistic Quality Error Typology

Open the complete Logrus IT MQM+ Atomistic Quality Error Typology and select only the columns required (Levels 1 through 4).

You will not need to modify the error typology under most circumstances. Still, you can add categories specific to your needs and/or eliminate categories that are not relevant. (Redundant categories can either be removed from the catalogue or assigned zero weights).

5.      Use the Logrus IT Severity Scale for Atomistic Issues provided below

You can use either a full Logrus IT Severity Scale for Atomistic Issues that includes preferential and minor issues, or only concentrate on higher severities (short list).

You will not need to modify the severity scale under most circumstances.

Your metric is ready – you are good to go!

Do not forget to adjust tolerance levels based on the results obtained and weak spots revealed.

 

Applying the 3D, hybrid metric you built

Below please find guidelines for quality evaluation using the metric you created or selected.

1.      Evaluate general Adequacy and Readability of each contiguous text piece. These evaluations typically take less time compared to more in-depth atomistic evaluations.

It is recommended to select pieces to be reviewed randomly. These pieces need to be big enough to be perceived as contiguous content that makes sense as a whole (sections, groups of paragraphs, small documents or similar segments of audio/video materials), and at the same time not too big to produce multiple independent evaluations. The practical recommendation is to select pieces between approximately 150 and 3,000 words.

Ask the reviewers to specify the location of each fragment and assign numeric values for both Adequacy and Readability ratings using the scales described above.

Calculate average Adequacy and Intelligibility scores across all separate pieces reviewed. If the results are below expectations compared to tolerance thresholds you selected for the metric, the review can be stopped, and content can be returned for rework.

2.      Run the Atomistic quality review of the content using the created/selected error typology and severity scale.

Prior to starting manual reviews, make sure that all reviewers are well familiar with the typology and severity scale to make sure that they categorize issues and assign weights properly. Assign or specify default severities for each type of issue to minimize variance in evaluations by different reviewers.

For each encountered issue, instruct the reviewers to select the appropriate error category from the typology used in the metric and assign appropriate severity using the severity scale. Sometimes they need to specify both the primary category and one or more subcategories.

In case severity differs from the default value, the reviewer must provide an explanation for selecting a different severity (i.e. the issue is minor, but its location is extremely conspicuous). Recommendations and preferential issues have zero weights and should not affect the overall rating.

It is not sufficient to simply mark the issue and categorize it. For each issue reviewers also need to specify location, source segment, the translation that needs to be changed, the correction they are suggesting, and the reason for change.

Remember that people who do not speak the target language may need to review the results. Remind the reviewers to enter all comments in the requested language (most often English), except places where examples in the target language are essential for understanding or illustrating the problem.

If one and the same issue is encountered more than once (is a repetitive one), it should not necessarily be listed each time. But if the reviewer lists the same error more than once, it needs to be marked as a duplicate. In most cases it makes sense to count all duplicates as a single error, because they can all be eliminated by simply adding a single instruction for the translator.

IMPORTANT.  Showstoppers need to be counted and described under any circumstances, because they affect overall perceived quality in a major way. In all “regular” error typologies this is done in a natural way, i.e. a special Showstopper severity level is assigned to all such issues. This severity level automatically makes the overall score unacceptable. At the same time, the simplest error typology contains a single category, issues are not counted, and weights are not assigned. To close this gap, for the single-category (singular) error typology only we not only evaluate Adequacy, Intelligibility and Atomistic quality, but we also count and describe showstoppers.

3.      Calculate the Atomistic quality rating.

At this moment, you have a complete registry of atomistic quality issues, including the category, subcategories and severity (weight) for each one of them. The raw error rating simply equals the sum of all weights assigned to discovered issues, which is then divided by the total number of reviewed words, R = , where Wi represents the issue weight, and V is the volume in words. A zero raw rating reflects zero logged issues. The higher this rating, the worse the quality.

To make the atomistic error rating more user-friendly we need to normalize it:

  • The rating should fall within a predefined range, like 0 to 9 or 0 to 100%
  • Higher ratings need to reflect higher quality, like elsewhere

I can suggest a simple formula for normalizing the rating that we use at Logrus IT:

A = 1 – (R * 20), where A is the adjusted error rating.

In the ideal case A = 1 (100%); in case there are a lot of errors in the content, and A goes below zero, just assign a zero rating (A = 0).

EXAMPLE.  We review a 1,000-word text.

  • We find a single medium-severity issue (weight = 2, see the Severity scale). In this case R = 0.002, and A = 0.96 (96%), which is generally a very good result
  • In case the text contains 10 medium-severity issues (one per 100 words), A = 0.6 (60%), which sounds more borderline.
  • If we find a single showstopper instead, A immediately drops below zero, and we assign a zero adjusted error rating. To make sure a single showstopper error still renders the content unacceptable even for large volumes, you can assign a higher weight to it, like 1000.

Reasonable quality expectations for adjusted atomistic quality for web content and documentation are above 65-70%, which is equivalent to less than one medium-severity error (or two minor errors) per 100 words. For firmware, software and other highly exposed content this threshold needs to be raised.

 

Summary

This article completes the 3D, hybrid approach to measuring translation quality series. It introduced:

  • The Cabbage Head approach to building truly scalable, backward compatible atomistic quality metrics covering the full spectrum of human needs, from the simplest to the most sophisticated ones
  • Required modifications to the MQM error typology that transform it into a fully scalable MQM+ typology, and the actual complete MQM+ issue catalogue that can be used for your metrics
  • Detailed step-by-step instructions on building your own translation quality metrics from scratch, including advice on setting tolerance thresholds and calculating atomistic quality ratings

The article also provided you with a full set of building blocks to create your own customized, scalable 3D hybrid translation quality metrics in no time, including:

I welcome everyone to use the material presented in this article for your own purposes, including business. The only requirement is to always provide a clear and direct reference to the author, links to my work and my company, Logrus IT, when describing or specifying the origin of the methodology and/or metric building blocks.

 

This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy