While topics like quality issue classification or particular quality metrics have attracted plenty of attention for years, the methodology of Language Quality Assurance (LQA) has received significantly less focus. In other words, much was said about concrete issues and metrics and very little about how to build these metrics, i.e., what is essential, what are typical misconceptions people might have, and typical errors we make, etc.
This paper addresses the abovementioned cornerstone question and discusses general principles of building LQA metrics. It also explains why language quality measurement cannot be reduced to a single rating assigned to the text (however convenient it appears on paper) without significant distortion, proposing, instead, the Quality Triangle methodology.
The methodology relies on three independent quality apexes: Holistic Readability, Holistic Adequacy, and Atomistic Quality.
General Considerations for Building an LQA Model
When talking about general principles of building LQA models, it makes sense to start with a list of essential expectations that the methodology needs to address. In my opinion, this list includes at least the following four line items:
- Reflecting the perception and priorities of the target audience.
- Concentrating on factors producing the strongest impression and
- Separating global (holistic) and local issues, understanding that the former are typically more important and play a bigger role
- Universal applicability
- Covering the whole spectrum of potential uses, subject areas, and materials
- From slightly post-edited MT to ultra-polished manual translations
- Common approach
- Same approach to knowledge bases and marketing leaflets
- Only adjusting acceptance criteria/thresholds based on expectations
This one sounds simple, but is not always easily accepted in practice. The truth is, we are all humans and, irrespective of what exactly we are looking at, be it a restaurant menu or drug usage guidelines, we are making our first judgment about text quality using exactly the same criteria. We do not need a different approach or a completely new metric for each subject area or type of content. In reality, the only thing that requires adjustment is tolerance level. We are ready to accept a barely comprehensible menu translation, but expect perfect clarity and lack of ambiguity in the medical area. In technical terms, this means that we are still measuring the same thing, i.e. readability/clarity, but with different expectations, and this approach applies to all other criteria.
- Viability of the methodology
- Clear, not overly complicated
- Process-oriented, i.e. reasonably economic and applicable in the real world
- Flexibility of approach
- Concentrating on methodology rather than particular cases/uses
- Criteria/issue classification is not an inalienable part of the methodology, but rather an add-on component. It can be taken from elsewhere, for instance:
- Based on MQM or other public source
- Based on legacy criteria used/provided by the client
- Weights assigned to particular issues are expected to vary within a wide range depending on the goals set, subject matter, type of material, etc. Certain issues might simply prove irrelevant for the job or area of focus, which results in zero weights being assigned to these issues.
What’s Most Important and Where to Start in Real Life
Quality Cornerstones: Holistic Adequacy and Readability
The idea that two major criteria for any translation, irrespective of its origin, target audience, brand impact etc., are holistic Adequacy and Readability (Fluency) first appeared as early as the 1960s in a venerable ALPAC report, Computers in Translation and Linguistics (1966). Expecting translated content to be readable and convey the meaning of the original adequately is an absolutely essential requirement, independent of the subject area, type of material, or anything else, for that matter. One simply doesn’t need to go through various criteria, sophisticated technical details or error counts if the text is either unreadable (incomprehensible) or inadequate (inaccurate).
While it is hard to dispute the universal nature of readability and adequacy, expectations as such might vary dramatically for both of these factors. These expectations are reflected in an answer to a different question: HOW readable or adequate the translated content should be.
Defining Readability and Adequacy
Without pretending to present something that is carved in stone or perfect, I would like to define both fundamental notions of readability and adequacy based on their intuitive meaning, and keep these definitions as concise and simple as possible. We will need them to better define the model.
Readability of translation defines how easy it is to read and understand the target text (sentence, string, etc.). In other words, readability measures how clear, unambiguous and well-written the target text is. Zero readability means that the text is unreadable or incomprehensible, i.e. it represents a senseless sequence of words. Perfect readability means that you can easily read the text without stumbling over words or phrases, the meaning of the text is absolutely clear and unambiguous, and no additional pondering is required to get to this meaning.
Readability goes far beyond translation quality as such, and formally applies to any text, including monolingual text in any language.
Important! The mere fact that you can easily read and comprehend the text does not mean that its meaning is correct. That is why we also need to measure adequacy.
Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by the source text (sentence, string, etc.). In other words, adequacy measures whether the translation process resulted in any discrepancies between source and target texts (except for intended ones), including plain translation mistakes, omissions or additions. Zero adequacy means that after translation the meaning of the source text was distorted beyond recognition. Perfect adequacy requires that both the meaning of and message conveyed by the source text were preserved in their entirety, without any deviations.
Adequacy only applies to a combination of both source and target texts (units, strings, sentences, etc.). You will need a bilingual resource to assess it. Analyzing translation would reveal a certain number of potential issues, but they cannot be verified without access to the source, let alone cases where translated text is smooth enough, but incorrect.
Important! It is worth noting that neither of the fundamental quality concepts described above comprises sub-concepts. Each is a standalone, “elementary” semi-objective quality factor.
Emphasis on Holistic Assessment
It is important to reiterate that we are talking about global content characteristics dealing with the perception of translated text (piece of software, website, leaflet, etc.) as a whole. That is, any potential reader/consumer is primarily interested in holistic readability and adequacy of the whole piece, and only then in readability or adequacy of particular sentences (units). The latter is, of course, important as such, but not at high level. It’s one of the universal laws of nature: The whole is always more important to us than its constituents, and its properties can’t be fully revealed or described based on these parts alone.
There is a direct material analogy here: When I need a hammer, I am primarily concerned with whether the object in question resembles a hammer and can be used as one, not concentrating on various manufacturing imperfections.
Below, please find some more translation-related reasons:
- Natural human perception. A translated tourist guide article on local restaurants can serve as a good illustration of holistic adequacy and readability dominating everything else.
As far as the piece provides necessary information about local landmarks, it is much better than the complete lack thereof in a foreign country with an unfamiliar language, even if translations are imperfect and some pieces are incomprehensible. Particular mistranslations (like recommending the “cancer shakes” at a certain restaurant in St. Petersburg, Russia – meaning “crabmeat”) would certainly be considered errors, but local ones, limited to a particular part of the text and not seriously affecting the perception of the document as a whole.
If the overall number of errors is not excessive, we would probably come up with a conclusion like this one: “The translation is useful and relatively acceptable, but there is a certain number of errors that need to be fixed to improve overall perception”. On the other hand, if one couldn’t make sense of the text as a whole, particular mishaps would not be as essential any longer, because the high-level diagnosis would be quite different: “The translation is incomprehensible. It does not make sense to discuss individual errors as far as the whole piece requires complete retranslation from scratch.”
- Patchy, out-of-context translation of small standalone pieces is the fact of life in the modern world, and it’s becoming more and more ubiquitous. Separate translated segments might look perfect by themselves, but taken together often create clumsy, controversial, incongruous texts.
- The “cunning translator” phenomenon.
Anyone with experience in the industry has observed it multiple times. When a translator does not completely understand the source text due to limited subject knowledge or some other problem (this, regrettably, is not an exception given the number of subject areas), he/she often tries to make translated sentences (segments) as “round”, ambiguous and fluent as possible. Each particular sentence sounds nicely and might pass the editing/proofing stage without corrections when dealt with on a sentence-by-sentence basis (that’s exactly what they are counting on). But taken together the piece makes no sense whatsoever.
One important consequence: Quality assurance cannot be complete or accurate if there is no way of making holistic evaluations. There is a certain similarity here with trying to make a judgment about the object based on its atomic (molecular) structure alone: Even if it’s just iron, that doesn’t tell us anything about its state (liquid or solid) or the object’s shape (does it look like a hammer?). That’s exactly the reason why, for instance, simply going through software strings or sentences one by one is absolutely insufficient to make general conclusions about overall usability or quality of a software product or a web portal. On the other hand, looking at screen shots representing essential parts of the functionality would result in a much more accurate overall evaluation.
Semi-Objective Nature of Holistic Criteria, and How to Deal with It
Neither of the two major translation quality criteria are completely objective. In an ideal world, where we could hire a whole expert panel to assess each piece, there would still be an inevitable variation in ratings assigned to one and the same text by individual panel members. This would happen even in a case when all of these people have similar backgrounds, undergo similar training, and use the same reference materials, guidelines and instructions. There is an unavoidable tint of subjectivity to these assessments.
At the same time, ratings produced by the panel would not be completely arbitrary either. In reality, they typically produce a normal opinion curve around the average rating for each document rather than white noise.
A couple of real-life examples are presented in the charts below. As many as 17 professional translators rated one and the same translated portal, assessing overall holistic readability and adequacy of translation. The horizontal axis represents rating values (on a scale between 0 and 10), and the vertical axis reflects the number of reviewers who came up with each particular rating. Bigger overall numbers of reviewers produce more reliable statistical results that are following the normal distribution.
Fig. 1. Readability and Adequacy Sample Charts
Even in this case, with relatively few evaluators (by statistical measure), standard deviations in ratings are reasonable (much smaller than average values), which serves as additional proof of these holistic ratings being far from arbitrary. That is why I am calling both criteria (and associated grades) semi-objective, and we always need to remember that they are NOT too accurate by design.
The question is, how can we deal with this lack of complete objectivity in a real-world scenario, when no reference translations are available, there is a single reviewer who can only look at a certain percentage of the overall content, and we still need to evaluate and grade translated texts?
The solution lies in evaluating each of the two major holistic criteria (readability and adequacy) separately on a PASS/FAIL basis.
Why? Look at the charts again! Your particular reviewer could either be good-natured and relatively tolerant, and would assign a near-perfect rating to the text (content), or turn out to be extremely fastidious or simply in bad humor on that day… The very same content could potentially get a rating anywhere under the dome of the Gaussian curve. And, in the real world, we do not even know up front how demanding this or that reviewer is…
All of the above considerations are true, but it doesn’t make the whole situation hopeless. Being realistic and judging based on existing statistical data, we can more or less safely assume that extremely low or high ratings deviating very far from the median value are a rarity (bad luck). In the majority of cases where reviewers are qualified enough, unbiased and trained properly, actual holistic ratings for both readability and adequacy will not be completely uniform, but are rather concentrated within a limited range around the average value.
Thus, the logical thing to do is establish an acceptance threshold that would correspond to the lower end of the statistical range described above. A rating below that threshold would mean that the translation is unsatisfactory for our needs.
That way, we can take into account the natural and unavoidable variance in semi-objective ratings. The range above the threshold would accommodate the majority of potential expert opinions (the bulk of the normal distribution curve). In other words, when setting expectations relatively high, for instance around 8 out of 10 on both scales, we should remember about the variance in ratings assigned by reviewers: A substantial share of reviewers might rate the text that “deserves” an 8 as a 7 or even a 6.
Set the threshold too high, and you run the risk of failing a significant share of good translations just because the reviewer in that particular case was too strict or didn’t get all instructions/guidelines. (The risk of accepting a number of so-so translations just because the reviewer was lax is always there, irrespective of the approach).
Fig. 2. Acceptance Threshold Illustration
One important and direct consequence of this approach is that the scale used for holistic translation ratings should be at least between 0 and 10, and by no means smaller. Otherwise, it will prove too narrow to accommodate the real-life rating variance (the Gaussian curve will simply not fit), and there will be no choice left for setting acceptance thresholds depending on the requirements, subject matter, specifics, etc.
Assuming that a 10-point scale is used (0 meaning unreadable/completely inadequate and 10 meaning a perfect text), we can consider the following typical scenarios:
- For a marketing text, one would expect to have acceptable grades between 8 and 10. We do allow some variance, but any reviewer needs to consider the content well-translated (even though it could be either an 8 or a perfect 10).
- For a knowledge base, our requirements would become considerably more moderate. It is normal to set the acceptance threshold at 5.
- In each scenario, the acceptance threshold is defined by the area, visibility of materials, time constraints, target audience, etc.
It is important to understand that each of these two major criteria should be evaluated separately:
- Accurate but hardly readable texts are as useless as fluent but inadequate ones.
- One can’t simply summarize or combine these two factors by any means – these are two independent “coordinates on a holistic quality plane”. Depending on the circumstances, one might easily have different and independent expectations for each of the holistic criteria. For instance, we might tolerate marginal readability, but still expect very high adequacy for sophisticated technical content targeted at experts only.
For the reasons stated above, it is simply impossible to combine readability and adequacy requirements into a single unified criterion or formula. Any such attempt results in significant quality assessment distortions.
Closing the Quality Triangle: Atomistic Quality
First of all, let me say that the term “atomistic” might sound somewhat out of place, but, in this context, it serves as an opposite to holistic. It describes all quality issues encountered at and limited to the “atomic” level of the content, i.e. sentences, strings, translation units, etc. These issues are numerous and include such things as incorrect or inconsistent terminology, style guide deviations, incorrect formatting, broken tags or missing placeholders, etc. One can find an example of a comprehensive issue framework that includes both classification and definitions in the MQM document.
Continuing the material analogy, at the atomistic level we no longer consider the hammer’s shape or basic functionality, but rather concentrate on alloy structure and purity, handle quality, etc. This analysis complements holistic usability/quality evaluation and makes it possible to answer questions about the tool’s potential durability and internal structural defects, disposition towards rusting, and other such things.
Besides having an essentially local nature, atomistic quality issues, unlike holistic ones, are mostly objective, because cumulative atomistic quality ratings are expected to be very similar irrespective of the reviewer’s personality. A typo is still a typo, an error in country standards is still an error, and all reviewers will notice and classify all such issues in a similar way, given proper training and background. Everything depends on issue classification and the weighting system applied. The only potential sources of discrepancy are attention lapses on the part of the reviewer or minor differences in assessing readability or adequacy of particular translated sentences (units).
There is a price for achieving objectivity in atomistic quality evaluation. Atomistic LQA results will only be uniform if reviewers are professionals in the area and were specially trained for the job (leaving emotions aside and adhering to guidelines), issue classification is comprehensive and clear, and all ancillary materials are provided, including a complete and approved terminology glossary, style guides, special requirements, etc. As far as any of these components are missing, evaluation objectivity starts to vanish. This is a typical case, for instance, when client representatives who know the language start reviewing translations without access to glossaries, guidelines and TMs…
For professional LQAs issue categories can be based on MQM or other public source, on legacy client-sourced criteria or other criteria. The resulting atomistic quality rating for any text (content) is calculated based on a simple and straightforward approach. For each issue category/type the number of issues found is multiplied by the relative weight assigned to that issue type. Resulting values are summarized across all issue types, and the sum is then divided by the number of words reviewed for normalization purposes: QA = ∑i (Ni * Wi) / V, where QA is the atomistic quality rating, Ni is the number of issues of type i found in the text, Wi is the relative weight assigned to this type of issues, and V is the volume reviewed (typically in Kwords – thousands of words).
The Fourth Apex: Showstopper Problems
In some cases, it makes sense to use an additional quality dimension to the three described earlier. Showstopper problems are discovered at the atomistic level, but stand out due to their overall impact. This subcategory comprises issues that could result in dramatic distortions in the text meaning, serious factual errors or political incorrectness, use of pejorative text, etc. The issues as such can belong to any branch in the issue classification, but need to be treated separately, with utmost attention, because they could seriously and negatively affect overall user perception and/or result in incorrect user actions.
In technical terms, relative weights of showstopper problems are determined not by their category within the issue framework, but by their negative impact, and that’s exactly the reason why this additional quality criterion is required. For example, typos typically have a relatively low weight in any translation quality metric compared to other issues, such as country standards violations, tag corruption, etc. But a typo in a major headline on a news website, especially the one that can result in complete distortion of the meaning, would definitely need to be treated in a completely different way.
Under normal conditions, no showstopper errors are allowed in any content, irrespective of the volume. These need to be eliminated at the editing/reviewing stage before publication.
Building the Quality Assurance Metric
The universal methodology suggested in this paper allows to build any number of metrics suitable for various subject areas, types of content and expectation levels with minimal adjustments.
The whole process of creating a metric looks like this:
- Select the quality model type (Quality Triangle or Quality Square) depending on your needs.
Fig. 3. Quality Triangle and Square Model Illustration
2. Select issue classification/framework for measuring atomistic quality (for instance, MQM).
3. Select acceptance thresholds for the two holistic factors (adequacy and readability) and one for the atomistic quality, weights for each issue within the framework, and the scale for the ratings (such as 0 to 10, 0 to 100, etc.). The choice depends on multiple factors, including market segment, subject area, type of material, target audience, time limitations, brand impact, overall expectations, etc.
One can find multiple examples that are publicly available. Typically, all issues are divided into three to four severity categories, and the same weight is used for all issues belonging to a certain severity category. For example, a serious error can be two or three times more “important” than a minor one, a major error might be two times more important than a serious one, etc.
Each set of acceptance thresholds and issue weights represents a so-called “quality vector” that completely describes and defines our expectations in each case. You can create
- One quality vector for marketing content (extremely high expectations in all areas, limited number of potential issues),
- One for software (high expectations, bigger number of issues, including software-specific ones),
- One more for user assistance and web content (reasonable expectations, adding content-specific issues),
- An additional one for knowledge bases and user forums (very limited expectations), etc.
A limited number of quality vectors will easily cover the whole spectrum of translated materials. As far as the chosen model and issue classification used in each case are the same (which is highly recommended), quality vectors will be the only components differentiating one metric from the other.
The methodology fully covers all types of translated content, including those produced using MT and/or MT + post-editing.
Applying the Created Language Quality Metric
Schematically, the LQA process based on the metric created will include the following:
- Selecting the quality vector prepared earlier as part of the metric and applicable for the selected type of content.
- Applying semi-objective holistic quality criteria on a pass/fail basis.
- The assessment as such is relatively quick and economic.
- Often scanning through the text is sufficient, especially so when quality is really low.
- Each text receives two separate quality ratings (readability and adequacy) that are compared to acceptance thresholds defined earlier as part of the quality vector.
- Failing texts are sent back for improvement or retranslation, saving the unnecessary effort of logging and counting atomistic-level errors.
Important. As mentioned earlier, these two quality ratings cannot be combined into a single integral factor as far as they are dealing with two content characteristics that are not only independent, but also not too accurate by definition.
Important. One needs to be bilingual or have a bilingual expert ready just in case
- Applying objective atomistic quality criteria. Only content that passes on both holistic accounts is further analyzed for technical imperfections, which removes the unnecessary workload of marking numerous errors in already disqualified texts.
- Each text gets the atomistic quality rating calculated based on the number of issues of each type found and their relative weights.
- The atomistic quality rating is compared to the acceptance threshold defined earlier as part of the quality vector.
- The complete list of issues can be very long and detailed, but not all of them are taken into account. Some issues are irrelevant within the context and have zero weight.
- Showstopper errors are counted separately if required. Their presence typically means that the text failed the QA and needs to be fixed before publishing. Alternatively, these errors can be counted as regular atomistic errors, but with a higher relative weight.
Important. Each QA result includes three independent quality ratings (or four in case when we also separate showstopper errors) that present a complete, 3D “lossless” quality picture. Semi-objective, holistic ratings cannot be combined with the atomistic one due to their incompatible nature. Any formula combining these ratings would produce a highly unstable result that is too much dependent on the reviewer’s personality due to the natural variance in holistic quality evaluations.
In Part II, I will present a simplified quality metric built using the Quality Square methodology and also the accompanying process developed specifically for LQAs carried out through crowdsourcing. The whole approach is illustrated by the results of a real-life quality assurance project carried out using this metric and process.
Computers in Translation and Linguistics, a report by the Automatic Language Processing Advisory Committee (ALPAC), Publication 1416, National Academy of Sciences, National Research Council, Washington, D.C., 1966
Multidimensional Quality Metrics project (Primary contact: Dr. Aljoscha Burchardt, DFKI GmbH)
Multidimensional Quality Metrics (MQM) Definition (Editors: Arle Lommel, Aljoscha Burchardt, Hans Uszkoreit; Contributors: Kim Harris, Alan K. Melby, Attila Görög, Serge Gladkoff, Leonid Glazychev)
Multidimensional Quality Metrics (MQM) Issue Types (Editors: Arle Lommel, Aljoscha Burchardt, Attila Görög, Hans Uszkoreit, Alan K. Melby; Contributors: Serge Gladkoff, Leonid Glazychev, Kim Harris, Dale Schultz, Jean-François Vanreusel)