Sharing is Caring: Introducing a Universal Holistic Translation Quality Metric
By Leonid Glazychev, Logrus IT’s CEO
It’s been a while since I published a series of previous articles dedicated to language quality assurance (LQA). These articles introduced the universal, hybrid 3D Quality Triangle (Quality Square) methodology, which is based on three cornerstones: holistic adequacy, holistic intelligibility (also called readability in earlier papers), and atomistic quality. The articles discussed both the approach itself and its applications—from building LQA metrics and carrying out “traditional” LQA for technical translations to carrying out quick, holistic LQAs using crowdsourcing to quality assurance for computer games and other creative content.
I am grateful for the interest in these publications, as well as for the questions and feedback that followed. Both the public and some of our clients seemed to share a common line of inquiry: What’s next? What sort of concrete metric lies beyond the conceptual approach (methodology)? How exactly do we go about measuring the holistic quality of all contiguous content (intelligibility and adequacy)?
These questions raise a fair point: Only a well-defined metric allows us to move from highly subjective quality evaluations to more predictable and uniform ones, thereby minimizing the effect of subjective variables like a reviewer’s unique taste and personality. There are, at present, no such holistic metrics available to the public in any capacity.
So, why not take our ready-to-use metric public? From a business perspective, making the metric accessible to everyone seemingly removes the ground from underneath the hard-working folks who developed the concept and metric in its entirety and who currently provide paid LQA services curated around this unique approach. After putting our heads together, the team at Logrus IT decided that the advantages of making the metric available to the community far outweigh the negatives.
Brief Quality Metric History (or the lack thereof)
Numerous publications offering or describing quality frameworks are concentrated on quality issue typology and/or tools for counting atomistic (string/segment-level) errors. At the same time, they usually do not pay much attention, if any, to presenting or discussing the principles of building quality metrics. This issue is typically left beyond their scope. For instance, the most detailed and well-crafted MQM framework only suggests a very basic, static metric (without recommending it) that has limited potential applications and raises a number of questions. The TAUS DQF error typology has been harmonized with MQM in the last couple of years, but publicly available materials do not discuss quality metrics, let alone the principles of creating them.
The situation has not improved with the increasing number of proprietary quality metrics developed by the industry’s big players over the years. Some of these metrics are rather elaborate, but still not publicly available, often protected by NDAs, and frequently heavily customized to the company’s needs.
At present, no publications address holistic quality measurements that are critical to human perception and indispensable for quickly evaluating the quality of large volumes of translated content.
At the same time, the need for a universal, flexible quality metric has grown steadily over the years with the multiplying volumes of translations and the global penetration of MT. It is our hope that this model will be able to, at least partially, substitute the numerous uncoordinated attempts at re-inventing the wheel at the time when the industry’s expectations have significantly outgrown older models, including the venerable LISA quality model. Until now, we were facing a big void in the public domain.
This article represents a humble attempt at partially filling this void and presenting a simple, clear and fully defined holistic quality metric.
The metric is built around two cornerstones:
- The original Quality Triangle methodology
- Detailed scales making it possible to measure both holistic adequacy and intelligibility of translated materials
The Holistic Quality Metric – Basics
Let me start with a couple of non-rigorous, but otherwise clear and concise definitions that will revive your memory:
- Holistic evaluations treat the content as a whole and are aimed at evaluating the general impression for each category
- Adequacy of translation measures how closely the translation adheres to the meaning and message conveyed by the source text
- Intelligibility of translation tells us how easily target content is read (watched, heard) and understood. (Intelligibility is a more universal term applicable for multimedia content compared to readability, that was used earlier. If it sounds too long or too formal, you can also call it clarity)
To learn more about holistic quality assessments and their importance, as well as the 3D nature of translation quality, I humbly direct you towards my article in Linguistics and Literature Studies Vol. 5(2) or previously published insights on Building Language Quality Assurance Models, Part I and Part II).
A holistic quality metric is both simpler and more complex than an atomistic one.
Each atomistic quality metric needs to utilize a detailed error typology (which can be quite extensive) and a system of weights assigned to each issue category from this typology. A holistic metric only requires two things: a well-defined scale that is easy enough to use and a threshold that effectively segregates acceptable and sub-standard content.
Thresholds are fully dependent on content type, area, market, project, etc. Scales require a more intricate approach than atomic issue definitions or weights.
Let’s say, for instance, that we use a 10-point scale (between 0 and 9) to evaluate both holistic intelligibility and adequacy. To simply state that the highest value on the scale denotes a perfect text and the lowest value corresponds to an abhorrent one would be insufficient. A more detailed scale that clearly qualifies the range of hues between “the best” and “the worst” is a necessity, especially given the inevitable subjectivity of holistic quality factors.
In order to develop a viable quality metric, each point on the scale between 0 and 9 must be clearly defined to ensure that the person reading (watching, listening to) and evaluating translated content can easily formulate a response to the following questions:
- What does an intelligibility or adequacy score of 7 mean for a particular text?
- How is a score of 7 different from a 6 or an 8?
The next section describes these scales in detail.
Scales for assessing primary holistic factors: Intelligibility and Adequacy
Both of the scales presented below provide a way to move from subjective, emotion-based holistic quality evaluations to evaluations based on a well-defined spectrum of reference points. These points serve as beacons, illuminating the previously uncharted waters between flawless and worthless content for reliable navigation.
Both scales were developed based on principles outlined within the venerable ALPAC publication. Currently, both scales are part of Logrus IT’s 3D, hybrid translation quality metric family.
Holistic Intelligibility (Clarity) Scale
Holistic Adequacy Scale
Applying the metric
This is a holistic metric. Its goal is to provide a quick evaluation of translation quality based on factors most important for general human perception, i.e. holistic intelligibility of translated content and its adequacy (accurately conveying the meaning and message of the source).
The metric is only applicable to reasonably-sized, contiguous pieces of content like web pages, articles, sections or large paragraphs within a document, standalone documents, video clips, games, audio books, etc. It cannot be used for materials like software strings or voice prompts, where separate small segments are not related to each other and are not expected to produce a general, holistic impression on the user.
- Set expectations for both intelligibility (clarity) and adequacy of translation on both holistic scales
Important. All materials are evaluated using the same scales. Only tolerance levels are adjusted.
Select acceptability thresholds for both holistic intelligibility (clarity) and adequacy based on the type of content, its visibility and importance, legal considerations, etc. These expectations are typically higher (8 or more) for marketing materials, website home pages, video and audio content, etc. For ancillary and reference materials it makes sense to set the bar lower to avoid incurring extra cost where it’s not critical.
- Select pieces to be reviewed
These pieces need to be representative enough for holistic assessment, i.e. not smaller than several sentences (~50 words / 30 seconds or more). At the same time, they should not be so big as to run the risk of applying a single grade to content where translation quality is not uniform or consistent enough. (For instance, it’s recommended to analyze a typical document by paragraphs or sections, rather than in its entirety).
If the content is big enough (like a huge website), doing a holistic QA on a full volume is too time-consuming. In such cases it is recommended to review a representative subset of randomly selected pieces spread across the full volume. The rule of thumb is to review at least 7-10% of the content (depending on the overall volume), but preferably no less than several thousand words (or no less than 30 minutes) in total.
When a website contains multimedia (like video clips, embedded presentations or audio samples), we need to review both the text and these multimedia materials, because this is exactly how regular visitors will consume the content.
- Review each piece separately and evaluate it based on holistic scales provided and the impression materials produce.
Important. Be sure to provide time for reviewers to familiarize themselves with the whole concept of holistic reviews and both scales. This will reduce the margin of error caused by insufficient understanding or knowledge of the metric.
Important. Holistic reviews do not require detailed logging of all issues or errors encountered by the reviewer. This would be an unsubstantiated waste of time. The reviewer needs to come up with two holistic ratings for each analyzed piece and provide a brief general comment. This comment needs to present the overall impression together with the most conspicuous and/or systemic errors provided as examples.
It is recommended to create a separate, standard QA form that would contain basic principles, both holistic scales and an area for entering review results, including content references, the area for general comments, and grades assigned by the reviewer to each piece. At Logrus IT, we have both an Excel template for QA reviews and a server-based portal that allows for reviews to be done online, including filling in review forms and analyzing results.
- Calculate averages across all pieces analyzed and produce final review results (pass/fail), if needed.
Averages are compared to preset tolerance levels for each factor (translation intelligibility or adequacy).
Multimedia content (videos, presentations, audio clips) is often localized or produced separately from content translation. It is recommended to grade it separately from translations. That way you will have two or more independent evaluations, one for translation, one for video clips, one for presentations, etc.
This article offers a complete, ready-to-use, simple, two-factor holistic translation quality metric. This metric allows to quickly and reliably evaluate overall quality of translated material as perceived by the user.
Based on the results, one can start an in-depth investigation of areas or projects where results fall below expectations, or save time and money by skipping more in-depth reviews for areas where quick evaluations produce positive results.
Experimental results based on real-life LQAs carried out using a more complex hybrid metric combining the abovementioned holistic metric and an atomistic metric for assessing technical quality of translations have statistically confirmed reliability and practicality of quick evaluations.
These advantages become even more pronounced when it comes to quickly assessing large volumes of translations done through crowdsourcing and/or for the public sector, where budgets are often very tight, and the deadline was yesterday.
In the second part of the article, which I hope to publish soon, I plan to introduce a couple of complete, ready-to-use 3D hybrid (holistic + atomistic) metrics that provide an optimal combination of usability, cost and level of detail. I am also planning to present an analysis of results obtained during a test project that utilized these 3D hybrid metrics. I will provide some insight on how all three quality “coordinates” produce a 3D picture, and how each of them contributes to the overall evaluation.