BLOG

How to Efficiently & Affordably Evaluate MT and Crowdsourcing Translation Quality (Part 4)

 

Leonid Glazychev, Ph.D., CEO, Logrus IT

Part 1: Introduction - Quality measurement landscape fragmentation - Automated and manual approaches used - What we truly need to measure

Part 2: A brief ode to critical thinking - Remaining big questions – The experiment goals, approach, setup and details

Part 3: Editing effect – Trustworthiness of results – Holistic vs. atomistic evaluation – An in-depth look at automated (n-gram-based) quality evaluations – MT editing speed and its correlation with LQA results

 

Conclusions

Summarizing correlations

MT editing productivity vs. holistic QA scores (Intelligibility, Adequacy)

This correlation allows us to make reliable forecasts about expected MT editing speed. Moreover, holistic evaluations are quick and relatively reliable as far as we apply universal Logrus IT Holistic Quality Metrics and Scales instead of relying on a reviewer’s “gut feeling”, which results in a huge variance of results and general subjectivity. See details on holistic quality scales and metrics in my previous article, Sharing is caring: Introducing a Universal Holistic Translation Quality Metric.

  • Only hybrid, 3D LQA distinguishes between NMT and SMT.
  • “Smoother” translation and adequacy results in
  • Higher holistic scores
  • Higher editing speed

MT editing productivity vs. atomistic QA scores

There is no correlation between MT editing speed and atomistic (traditional) LQA scores. Atomistic LQA scores for raw MT are abysmal in most cases.

  • These scores are highly sensitive to engine defects.
  • Most atomic-level errors actually … do not matter when it comes to editing speed!

N-gram-based, automated quality scores are unreliable and generally useless

These scores, no matter what particular method is used (BLEU, hLEPOR, METEOR, etc.) have a general design-level issue. Namely, they do not treat various correct translations of one and the same source equally. They are all highly biased towards texts used for MT development/training and edited versions of text produced by raw MT (compared to alternative targets).

Overall, n-gram-based, automated quality scores:

  • Demonstrate no general correlation with either LQA results or editing effort. Score variations depending on the reference (target) translation are too massive
  • Only make sense when comparing edited vs. raw MT produced by the same engine
  • Are actually more time-consuming than human holistic evaluations

Under normal (not laboratory) circumstances we do not have a reference target translation. In order to obtain one, we need to do one of the following. We can translate a statistically sound piece of content (several thousand words) from scratch to create the reference, and it obviously requires more effort compared to carrying out a holistic quality review on a similar volume of raw MT. Alternatively, we can extract a reference translation piece from an existing TM, but as I indicated earlier (the 2018 Intento research), these TMs very often have rather questionable quality, which defeats the purpose and creates the risk of evaluating something based on an absolutely unreliable reference.

Editing distance is not a reliable measure of editing effort

Editing distance shows the difference (“distance”) between two texts. If one of them is an edited version of the other (raw MT and post-edited MT), editing distance will indeed show mechanical difference between the two, and it will equal zero in case no changes are introduced. But I would strongly advise against using editing distance as a measure of editing effort.

It is a well-known fact that translators tend to introduce as few changes to MT during editing as possible, and this conclusion is supported by benchmarks presented earlier in this article. At the same time, editing distance is a purely mechanical measure that by no means reflects the level of change in either the meaning or the message of the content, or intellectual effort required to introduce this change. Changes to the text may be minimal, but the impact these changes make can be overwhelming, and time required to analyze the original translation, compare it to the source, and edit/fix it in the most economical way may be substantial.

Just imagine an absolutely realistic case when NMT omits a small, but essential part of the sentence or a “no”. The resulting translation may look smooth, but its meaning will either be far off or diametrically opposed to the source. It takes time to unearth the error (you need to make a full comparison between the translation and the source) and fix it, even if this fix is essentially tiny.

How to make forecasts

First and foremost, let’s reiterate that the approach to making forecasts depends on the MT/crowdsourcing use case:

  • Content is published “as is”
  • Content is expected to be edited by a human.

The underlying reason is that evaluation criteria in these two cases are different.

The approach to making forecasts depends on the MT/crowdsourcing use case: The content is published “as is” or expected to be edited by a human.

We start with common things common to both approaches.

  1. Set quality acceptance thresholds relevant for the subject area, content visibility, exposure, and sensitivity, etc.

Use the Logrus IT 3D, hybrid “Quality Triangle” model.

  1. Select sample(s) to be checked.

It’s best to test at least 3,000 – 5,000 words for each engine, translator, language pair, subject area and/or vertical market (in case of crowdsourcing or “traditional” human translation). One and the same MT engine can perform quite differently under varying circumstances (language pairs, subject area, etc.)

The evaluations we need to carry out depend on the use case/scenario, as explained above. So there are two versions of step 3 (it’s not a formatting mistake).

  1. For content that is going to be edited by a human [at a later stage]:

For contiguous content

  • We only need to run a Holistic LQA and evaluate translation Adequacy and Intelligibility.
  • Then we compare results to acceptance thresholds set earlier (step 1).

For non-contiguous content

  • We are limited to atomic-level LQA only (done segment by segment), but it is sufficient to apply a simplified approach as follows.
  • We only check Adequacy and Intelligibility for each unit (sentence, segment, line, etc.).
  • We simply ignore all other types of issues.
  • We then calculate averages for both Adequacy and Intelligibility across all units checked.
  • Finally, we compare these values to acceptance thresholds set earlier (step 1).
  1. For content published “as is”:
  • We run a holistic LQA (if applicable, i.e. for contiguous content only). This is done at a global level, i.e. for each contiguous sample as a whole.
  • At this stage we evaluate both holistic translation Adequacy and Intelligibility.

– AND –

  • We then run a limited atomic-level LQA that concentrates on serious issues only on a unit level. We only log two types of atomic-level errors in this case and ignore all others.
  • Adequacy and Intelligibility for each unit (sentence, segment, line, etc.).
  • High-severity atomic errors, such as typos in page or section headings, inappropriate language, including pejorative, inciting or politically incorrect speech, etc.
  • We then calculate averages for all logged errors across all units checked.
  • Finally, we compare these values to acceptance thresholds set earlier (step 1).

Practical advice: How to set acceptance thresholds – When do you apply this approach

Setting acceptance thresholds

While everything depends on your particular circumstances and fastidiousness, below please find some general recommendations on setting acceptance thresholds/tolerance levels.

  • Holistic adequacy and intelligibility.

Logrus IT holistic quality scales include 10 distinctive quality levels each, from 0 to 9. I strongly recommend to use them and only invent your own bicycle when you are already well familiar with the topic. If resulting holistic quality is:

  • 6 or better, it is good for both scenarios (either post-editing the content or using it “as is”);
  • 5 or better, we can expect improvements in editing productivity;
  • 4 or less, it typically means that we should not place high hopes on pre-translated content. Most probably, it is not good enough for publishing “as is”, and post-editing is unlikely to be faster than translation from scratch and using only TMs.
  • Atomistic evaluation (when the translation is used “as is”).

Just to remind, we count serious errors only. For good human translation we expect normalized atomistic error rating to exceed 60%, which is equivalent to 1 medium-severity error per 100 words.

  • A reasonable expectation for MT in this scenario is above 0%, which is equivalent to:
  • 1 high-severity error + 1 minor error per 100 words, given that there are
  • No egregious (critical-level) errors

When do you need this forecast/evaluation

Below please find the quick list of cases when this quick evaluation methodology can prove useful.

  • You use
    • Crowdsourcing
    • MT (with or without editing)
    • Ultra-cheap translation of questionable quality
  • You want to check the quality of existing materials
    • MT corpus
    • TMs
  • You are in the process of selecting the best MT:
    • Engine(s) for your particular needs or context or language pairs
    • Use model (“as is” or editing)
  • You care about the quality of your translations!

TL; DR

  1. The article presents a viable and economic way to make reliable MT (or crowdsourcing) quality forecasts (or evaluations) that indicate:
  • Potential savings/productivity increase for the post-editing scenario.
  • Holistic quality ratings of 5 or more usually mean that we expect to save money/effort by applying (MT + post-editing) compared to translation from scratch.
  • Expected productivity increase is proportional to these holistic quality ratings.

– OR –

  • The potential level of issues with content published “as is”
  • Under this scenario only the most severe issues truly matter, such as unintelligible or inadequately translated segments, or critical atomic-level errors.

The suggested method works, and forecasts are reliable because MT editing productivity is in fact correlated with holistic QA scores (Intelligibility, Adequacy).

  1. The outlined approach is both quick and affordable.
  • Both holistic and limited atomistic LQAs are not time- or effort-consuming, and volumes that need to be checked are limited.
  • This approach is actually faster and cheaper than automated evaluations.

Automation requires human-sourced quality reference content which is more expensive to produce compared to the evaluations suggested earlier. Using isolated parts of existing large TMs as reference content is risky, because very often translation quality in these TMs is well below expectations for reference content.

  1. Despite a common belief, n-gram-based, automated quality scores (BLEU, hLEPOR, METEOR, etc.) are useless under both scenarios, i.e. for predicting (or evaluating) MT editing speed, or assessing MT quality when used “as is”.

There is no correlation between these scores and either MT editing speed or LQA results. All n-gram-based, automated quality scores are highly biased towards texts used for MT development/training and edited versions of text produced by raw MT (compared to alternative targets).

There are two more commonly applied methods that do not work when we need to predict or assess MT editing speed (evaluate editing effort).

  1. Atomistic (traditional) LQA scores show no correlation to either MT editing speed or overall translation content quality.

The primary reason is that most atomic-level errors actually do not matter when it comes to editing speed, as they hardly affect it at all. Atomistic LQA scores are also highly sensitive to particular engine defects. For raw MT they are abysmal in most cases.

  1. Editing distance is not a reliable measure of MT editing effort.

It is a purely mechanical indicator that by no means reflects the level of change in either the meaning or the message of the edited content, or intellectual effort required to introduce this change. Changes to the text may be minimal, but the impact these changes have, and time required to analyze the original translation, compare it to the source, and edit/fix it in the most economical way may be substantial.

I want to extend special thanks to my dear friend and colleague, Mr. @FedorBezrukov who led the team that meticulously conducted all experiments described in this article and staunchly accommodated all my change requests. Fedor also contributed to the MT topic in a number of other ways.

I welcome everyone to use the material presented in this article for your own purposes, including business. The only requirement is to always provide a clear and direct reference to the author, links to my work and my company, Logrus IT when describing or specifying the origin of the methodology, approaches, metrics, advice and/or metric building blocks.

 

Thank you for reading!

Thoughts? Questions? Send me an email or contact me on LinkedIn:

leonidg@logrusit.com

www.linkedin.com/in/leonidglazychev/

 

BACK
This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy