Leonid Glazychev, Ph.D., CEO, Logrus IT
Part 1: Introduction - Quality measurement landscape fragmentation - Automated and manual approaches used - What we truly need to measure
MT editing productivity vs. holistic QA scores (Intelligibility, Adequacy)
This correlation allows us to make reliable forecasts about expected MT editing speed. Moreover, holistic evaluations are quick and relatively reliable as far as we apply universal Logrus IT Holistic Quality Metrics and Scales instead of relying on a reviewer’s “gut feeling”, which results in a huge variance of results and general subjectivity. See details on holistic quality scales and metrics in my previous article, Sharing is caring: Introducing a Universal Holistic Translation Quality Metric.
MT editing productivity vs. atomistic QA scores
There is no correlation between MT editing speed and atomistic (traditional) LQA scores. Atomistic LQA scores for raw MT are abysmal in most cases.
N-gram-based, automated quality scores are unreliable and generally useless
These scores, no matter what particular method is used (BLEU, hLEPOR, METEOR, etc.) have a general design-level issue. Namely, they do not treat various correct translations of one and the same source equally. They are all highly biased towards texts used for MT development/training and edited versions of text produced by raw MT (compared to alternative targets).
Overall, n-gram-based, automated quality scores:
Under normal (not laboratory) circumstances we do not have a reference target translation. In order to obtain one, we need to do one of the following. We can translate a statistically sound piece of content (several thousand words) from scratch to create the reference, and it obviously requires more effort compared to carrying out a holistic quality review on a similar volume of raw MT. Alternatively, we can extract a reference translation piece from an existing TM, but as I indicated earlier (the 2018 Intento research), these TMs very often have rather questionable quality, which defeats the purpose and creates the risk of evaluating something based on an absolutely unreliable reference.
Editing distance is not a reliable measure of editing effort
Editing distance shows the difference (“distance”) between two texts. If one of them is an edited version of the other (raw MT and post-edited MT), editing distance will indeed show mechanical difference between the two, and it will equal zero in case no changes are introduced. But I would strongly advise against using editing distance as a measure of editing effort.
It is a well-known fact that translators tend to introduce as few changes to MT during editing as possible, and this conclusion is supported by benchmarks presented earlier in this article. At the same time, editing distance is a purely mechanical measure that by no means reflects the level of change in either the meaning or the message of the content, or intellectual effort required to introduce this change. Changes to the text may be minimal, but the impact these changes make can be overwhelming, and time required to analyze the original translation, compare it to the source, and edit/fix it in the most economical way may be substantial.
Just imagine an absolutely realistic case when NMT omits a small, but essential part of the sentence or a “no”. The resulting translation may look smooth, but its meaning will either be far off or diametrically opposed to the source. It takes time to unearth the error (you need to make a full comparison between the translation and the source) and fix it, even if this fix is essentially tiny.
First and foremost, let’s reiterate that the approach to making forecasts depends on the MT/crowdsourcing use case:
The underlying reason is that evaluation criteria in these two cases are different.
The approach to making forecasts depends on the MT/crowdsourcing use case: The content is published “as is” or expected to be edited by a human.
We start with common things common to both approaches.
Use the Logrus IT 3D, hybrid “Quality Triangle” model.
It’s best to test at least 3,000 – 5,000 words for each engine, translator, language pair, subject area and/or vertical market (in case of crowdsourcing or “traditional” human translation). One and the same MT engine can perform quite differently under varying circumstances (language pairs, subject area, etc.)
The evaluations we need to carry out depend on the use case/scenario, as explained above. So there are two versions of step 3 (it’s not a formatting mistake).
For contiguous content
For non-contiguous content
– AND –
While everything depends on your particular circumstances and fastidiousness, below please find some general recommendations on setting acceptance thresholds/tolerance levels.
Logrus IT holistic quality scales include 10 distinctive quality levels each, from 0 to 9. I strongly recommend to use them and only invent your own bicycle when you are already well familiar with the topic. If resulting holistic quality is:
Just to remind, we count serious errors only. For good human translation we expect normalized atomistic error rating to exceed 60%, which is equivalent to 1 medium-severity error per 100 words.
When do you need this forecast/evaluation
Below please find the quick list of cases when this quick evaluation methodology can prove useful.
– OR –
The suggested method works, and forecasts are reliable because MT editing productivity is in fact correlated with holistic QA scores (Intelligibility, Adequacy).
Automation requires human-sourced quality reference content which is more expensive to produce compared to the evaluations suggested earlier. Using isolated parts of existing large TMs as reference content is risky, because very often translation quality in these TMs is well below expectations for reference content.
There is no correlation between these scores and either MT editing speed or LQA results. All n-gram-based, automated quality scores are highly biased towards texts used for MT development/training and edited versions of text produced by raw MT (compared to alternative targets).
There are two more commonly applied methods that do not work when we need to predict or assess MT editing speed (evaluate editing effort).
The primary reason is that most atomic-level errors actually do not matter when it comes to editing speed, as they hardly affect it at all. Atomistic LQA scores are also highly sensitive to particular engine defects. For raw MT they are abysmal in most cases.
It is a purely mechanical indicator that by no means reflects the level of change in either the meaning or the message of the edited content, or intellectual effort required to introduce this change. Changes to the text may be minimal, but the impact these changes have, and time required to analyze the original translation, compare it to the source, and edit/fix it in the most economical way may be substantial.
I want to extend special thanks to my dear friend and colleague, Mr. @FedorBezrukov who led the team that meticulously conducted all experiments described in this article and staunchly accommodated all my change requests. Fedor also contributed to the MT topic in a number of other ways.
I welcome everyone to use the material presented in this article for your own purposes, including business. The only requirement is to always provide a clear and direct reference to the author, links to my work and my company, Logrus IT when describing or specifying the origin of the methodology, approaches, metrics, advice and/or metric building blocks.
Thank you for reading!
Thoughts? Questions? Send me an email or contact me on LinkedIn:
www.linkedin.com/in/leonidglazychev/