BLOG

How to Efficiently & Affordably Evaluate MT and Crowdsourcing Translation Quality (Part 3)

 

Leonid Glazychev, Ph.D., CEO, Logrus IT

Part 1: Introduction - Quality measurement landscape fragmentation - Automated and manual approaches used - What we truly need to measure

Part 2: A brief ode to critical thinking - Remaining big questions – The experiment goals, approach, setup and details

 

Results and conclusions

Editing effect – Trustworthiness of results – Holistic vs. atomistic evaluation

Table 2 below presents LQA results for human translation as well as both raw and edited MT. We did a complete 3D, hybrid (holistic + atomistic) evaluation in each case.

As a brief reminder, holistic factors reflect high-level quality (at the level of the text as a whole, section or paragraph), while atomistic quality, true to its name, is measured at the atomic level, such as sentences or strings. All in all, we measured three independent factors in each case:

  • Holistic Adequacy is measured on a 0 – 9 scale and reflects
  • How closely the translation follows the meaning of, and the message conveyed by the source
  • If there are discrepancies between source and target texts (except for intended ones)
  • Holistic Intelligibility is measured on a 0 – 9 scale and shows
  • How easy it is to read/listen/view and understand the target content
  • How clear, unambiguous and well-presented the target content is
  • Atomistic quality covers
  • Quality issues encountered at and limited to the “atomic” level of the content (sentences, strings, units, etc.)
  • Wrong terminology, broken tags, incorrect links, grammar or locale issues, double spaces, etc.

Altogether holistic adequacy and intelligibility and atomistic quality form the Quality Triangle presented in the picture below.

 

In each case, holistic quality is averaged across five 1,000-word pieces that comprise the text. Atomistic quality is best represented by a normalized LQA score. This score is calculated based on the error rating, which is a weighted total of all atomic-level issues discovered in the translation.

You can find considerably more details on holistic and atomistic quality and creating your own 3D, hybrid quality metrics, including everything related to measurement scales, error typologies, issue severities, error ratings and normalized atomistic quality ratings in my previous article, Of Quality and Cabbage: The Layered Approach to Building Language Quality Metrics.

 

Table 2. Manual quality evaluations (LQA) before and after editing

*Holistic quality is measured on a 0 – 9 scale, where 9 means perfect quality and 0 reflects a disaster.

**Atomistic quality ratings are normalized so that a perfect result (no issues) produces a rating of 100%, while a good atomistic LQA score for typical human translation is above 60% (error rating <= 0.2).

First and foremost, we can conclude that quality and productivity estimates that we follow are reliable. LQA scores for both human translation and edited MT (columns with green background on the left) are similar, consistent and realistic for all three quality “coordinates” (two holistic and one atomistic).

Let us look at the table and draw the first portion of conclusions.

  1. First and foremost, we can conclude that we are indeed comparing apples to apples, i.e. quality and productivity estimates that will follow are reliable. LQA scores for both human translation and edited MT (columns with green background on the left) are similar, consistent and realistic for all three quality “coordinates” (two holistic and one atomistic). In other words, quality of edited MT as evaluated by humans is similar to the quality of purely human translation in all cases. If it was not the case, we would not be able to draw any productivity-related conclusions, because we can only compare productivities in case we get similar “end product” quality under both scenarios.
  2. Editing has definitely improved translation quality, especially when it comes to atomistic LQA scores (see the red arrows). Holistic quality improvement is only essential for Google SMT, and is very moderate for both NMT engines. The latter is obviously due to the fact that holistic quality of raw NMT output is already good enough for this particular text.
  3. Raw SMT is unusable without post-editing; it only provides separate, partially reusable “building blocks”. (See raw SMT quality results in the two cells with the yellowish background at the bottom). This is by no means a revelation…
  4. Correlation between holistic and atomistic quality is limited. In other words, these “quality coordinates” are independent to a certain degree. Just look at the figures inside the blue frame on the right. Human translation has a much better atomistic LQA score (is much cleaner in terms of stupid errors) compared to raw MT for all engines. At the same time holistic quality evaluations are similar for human translation and raw output produced by both Google and DeepL NMT engines.
  5. NMT provides better holistic results because it ensures what is known as better translation “smoothness”.
  6. At the same time atomistic, segment-level quality after NMT still leaves much to be desired and falls well below acceptable levels for human translation. Normalized atomistic LQA scores for all three MT engines are deep in the red zone (negative values around -100% or worse), while good human translations produce normalized atomistic quality ratings of 60% and higher.

An in-depth look at automated (n-gram-based) quality evaluations

Now that we are confident in the reliability of results let us take a closer look at automated quality evaluations that we ran for all three different combinations:

  • Edited MT vs. raw MT
  • Raw MT vs. human translation
  • Edited MT vs. human translation

Table 3 below presents LQA results produced by automation (hLEPOR scores) for all three scenarios listed above (columns on the left) along with the results of 3D, hybrid human LQA for both raw and edited MT, and reference human translation.

Table 3. Automated (n-gram-based) quality evaluations and their reliability

We need to completely discard automated methods as a reliable indicator of MT quality, because these methods are extremely sensitive to correlations between MT output and the target text. As they say, even a broken clock displays correct time twice a day.

Let us look at the table and summarize the conclusions related to the validity and reliability of automated quality evaluations.

  1. Automated evaluations only produce reasonable results when edited MT is compared vs. raw MT for the same engine (see first column on the left, results in green color exceed 70%).
  • Despite the fact that both raw DeepL and Google NMT output is good enough judging by LQA scores (see Holistic Intelligibility and Adequacy ratings in the brown frame on the right), hLEPOR scores of these translations are inexplicably low when compared to human translation. If we judged translations by hLEPOR scores alone, both would have failed miserably, which shouldn’t be the case.
  • Even more strikingly, edited MT output, which is definitely good judging by human LQA results, also fails if we look only at the hLEPOR score (gray frames).
  1. As expected, correlations or lack thereof show that translators always try to apply minimal changes while editing MT. This fully explains much better correlation between edited and raw MT compared to other, unrelated targets.
  2. This can only mean one thing: We need to completely discard automated methods as a reliable indicator of MT quality, because these methods are extremely sensitive to correlations between MT output and the target text. Results vary wildly when MT output is compared to a different, unrelated target. As they say, even a broken clock displays correct time twice a day.

MT editing speed and its correlation with LQA results

As mentioned earlier, editing speed is the only indicator of MT quality when we are considering or applying the scenario under which raw MT is then edited to produce human-like quality. We do not care how MT looks or feels in its raw form, but are only interested in how much time it takes to bring it up to human-sourced content standards.

Table 4 below emphasizes the editing productivity angle and allows us to look for correlations between MT post-editing productivity (column on the left) and all three LQA ratings (holistic adequacy and intelligibility, and atomistic quality rating) for the raw output produced by each engine.

Table 4. MT post-editing productivity compared to all three LQA ratings for raw MT

*Important!  While in our case of a relatively generic, contiguous text on IT security MT editing productivity exceeds human translation speed by a factor of 2+ for NMT engines, this particular result doesn’t guarantee similar increases for other subject areas, language pairs, etc. In our particular research we concentrated not on productivity increase as such, but rather on LQA-related correlations.

There is a clear correlation between holistic quality evaluations, such as overall adequacy and intelligibility, and MT post-editing productivity.

Let us look at the table and try to find correlations, if any, between raw MT editing speed and LQA evaluations of raw MT output.

  1. There is a clear correlation between holistic quality evaluations, such as overall adequacy and intelligibility (second column), and MT post-editing productivity. (See the figures within the blue frame). Higher holistic evaluations reflect better smoothness and overall adequacy of translations, and we can reliably expect faster editing.
  • The same general conclusion applies to not only MT, but also any other material that needs additional polishing or editing, like crowdsourced translations.
  • Editing speed depends on both holistic factors equally, because issues with either general adequacy or intelligibility (comprehension) require translators (editors) to spend more time reinterpreting and/or reconstructing the sentences. In practice, I would advise to use the average of two holistic ratings as a measure of expected [post-]editing speed.
  • Important! The existence of this correlation was actually proven/tested on a bigger set of MT engines and language pairs. I did not present all these results here to avoid confusion caused by excess figures, because our primary goal was by no means MT engine comparison as such.
  1. While it could seem that there is some correlation between the atomistic error rating (right column) and MT post-editing speed (left column), it is non-existent. Indeed, we already mentioned the lack of any correlation between atomistic quality scores and holistic factors (emphasized by the comparison with human translation, for which holistic evaluations are similar, but the atomistic error rating is way better). Moreover, when we look at more MT engines (see above), error ratings for them vary significantly depending on particular terminology, grammar, and other errors these engines introduce. All such errors are very quick and easy to fix, and they do not affect editing speed nearly as much as holistic factors.

 

Part 4: Summarizing correlations – MT editing productivity vs. holistic QA scores (Intelligibility, Adequacy) - N-gram-based, automated quality scores are unreliable and generally useless - Editing distance is not a reliable measure of editing effort - How to make forecasts - Practical advice: How to set acceptance thresholds – When do you apply this approach – Executive summary

 

BACK
This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy