BLOG

How to Efficiently & Affordably Evaluate MT and Crowdsourcing Translation Quality (Part 1)

 

Leonid Glazychev, Ph.D., CEO, Logrus IT

 

Introduction

In a series of previous articles, I've introduced:

With the methodology and metric-building blocks in place, we can get into practical applications. Heavy reliance on machine translation (MT) and crowdsourcing in today’s translation industry has inevitably increased the share of terrible translations and moved quality-related concerns to the front burner. While in many cases publishers (or consumers) do not require or expect near-perfect quality, they still want translated or localized content to be intelligible, adequate, and generally usable/tolerable.

We need to be able to quickly and affordably assess the quality of translated content and make reliable quality forecasts. Quality forecasts are vital in selecting the best MT engines for a particular project or deciding on the applicability of the crowdsourcing model, and more.

While the goal may seem obvious, and the challenge by no means new, existing solutions are quite diverse. Regrettably, our understanding of what exactly each of these approaches measures is limited, let alone their applicability areas. There is also little proof that assessments we make are reliable.

In this article, I’ll present two major quality assessment scenarios, analyze existing translation quality measurement methods, and provide their direct comparison based on the results of a carefully staged experiment. I propose concrete, viable, and economical quality measurement and forecast recipes for each of the two use cases.

 

Landscape fragmentation

How we approach quality assessment is dependent on a number of factors, like our background, field of work, etc. When it comes to the translation and localization industry, the simplified breakdown goes something like this:

  1. Machine translation (MT) researchers, developers, trainers
  • Tend to use n-gram-based metrics exclusively
  1. The translation and localization industry
  • Relies predominantly on traditional LQA (various metrics, typically limited to atomic-level evaluations)
  • At the same time, it semi-blindly trusts n-gram-based methods (BLEU, METEOR, hLEPOR, etc.)
    • Big and midsize LSPs often require discounts from freelancers or other suppliers based on n-gram scores. Sometimes editing distance is calculated.
  1. Enthusiasts, managers, lone wolves, etc.
  • Measure editing effort, i.e. time required to edit the translation unit after MT or another person.
    • This approach allows to assess the efficiency gain, if any, when we first apply translation memory (TM) and then MT, compared to human translation that relies on TM alone. The same applies to the comparison between applying MT and then crowdsourced editing and traditional human crowdsourcing.
    • Few CAT tools support this measurement directly. The only popular tools I know that provide time measurement capabilities are MemoQ and MateCat. With other tools, we have to rely on independent human measurement. In practical terms, this means that each translator or editor diligently starts the timer and works without breaks or distractions until it becomes absolutely unbearable, at which point they stop the timer, do what they need to do, and repeat the whole process until the work is complete. This provides us with the total time it took to complete a certain volume of translation or editing.

Let us quickly discuss advantages, disadvantages, and most importantly concerns associated with each of the three quality measurement approaches: automated n-gram-based; traditional, human LQA; and editing effort.

Variations of n-gram-based methods, like BLEU, METEOR, hLEPOR, etc.

These automated methods are popular with MT developers and researchers. They were historically used to compare various MT engines or evaluate engine improvements and/or training efficiency. The primary reasons for their popularity are cost and speed. Evaluations are automated and relatively quick. Their biggest drawback is the necessity to use a human-sourced sample translation.

There is a general (though not well substantiated) belief in the MT industry that n-gram-based methods are reliable and produce higher ratings for better-translated texts, which are expected to be closer to the original human-sourced reference. At the same time, there are a number of obvious, concept-level concerns related to all n-gram-based quality assessment methods.

The biggest concern: An alternative, absolutely correct translation that heavily uses synonyms will get a rather low score. Given that multiple absolutely correct translations are available for most sentences/units (except for the most trivial ones, like “Click OK”), n-gram-based methods will produce assessments that are completely off the mark in all cases when the translation is correct, but does not closely resemble the sample.

In theory, this deficiency can be overcome if multiple samples (3-4) are used, and we select the best score for each translation. In other words, if the translation is close to at least one of the references available, it’s good. We do not expect a limitless variety of options for each case. In reality, no one will ever have multiple reference translations available. Producing them would be costly and defeat the purpose.

The other concern: the questionable quality of reference translations themselves. They are most frequently taken from big, unverified public TMs. A particular piece of the TM is “isolated” and used as a reference, while the remaining bulk is used to train MT engines. Regrettably, the quality of such references is often rather low, so we're using low-quality references to assess the quality of machine translation. This particular trap is very real, and you can find a good example in the Intento research mentioned earlier. Human reference translation extracted from a large English-German TM in the medical patents area scored 7th among 14 MT engines in a blind LQA comparison!

To summarize, general confidence in results obtained using n-gram-based translation quality assessment methods is low. Moreover, it is not entirely clear what type of quality assessment these methods produce, which I’ll discuss in detail in the next section.

LQA (Language Quality Assurance)

This is the approach of choice to evaluate quality of final or pre-release materials. When it comes to MT quality, LQA produces accurate results (in terms of quantifying human sentiment) only if the final product (translation) uses raw MT.

LQA is done by a human and should be always based on a metric. The traditional, atomic approach (analyzing each translated unit/segment and logging all issues) is more time-consuming than automated evaluations. At the same time, it adequately highlights all issues with the translation and allows us to evaluate any translated content. No sample reference is required.

One big question is whether detailed, atomic-level LQAs reflect real human sentiment. Many technical issues, such as missing commas, dual spaces, or even slightly incorrect terminology contribute to the error rating, but are often perceived as nothing more than a minor irritant. These issues do not always have a serious effect on human perception or comprehension of the translated content. A text with numerous minor issues may receive an objectively low atomic quality rating, but still read as completely acceptable to a human.

This particular issue and the 3D, hybrid approach to quality is discussed in detail in a separate article, Reliably Measuring Something That Isn't Completely Objective: The Quality Triangle Approach to Translation Quality Assurance.

Editing effort

The editing time approach measures exactly what you think, i.e. time required to edit the MT (or crowdsourcing results) and bring the translation to the human-level standard. If we also record the time required to translate the same material from scratch, we can easily calculate the direct efficiency gain achieved. This particular approach represents the most pragmatic, objective method of evaluating MT efficiency in cases when it is later edited by a human.

Still, this method is only targeted at a single scenario when we expect to edit the content obtained using less expensive alternatives. Its applicability beyond that scenario is questionable. Let’s also not forget that we need to compare these measurements against similar results for human-only translation that still needs to be done.

 

What do we need to measure?

When it comes to translation quality, this is the million-dollar question.

The correct answer depends on the content consumption scenario.

There are two basic types of content consumption:

  1. Translations, including MT or crowdsourced ones, are published “as is”

In this case, we need to measure human perception of the content as is, with all issues and errors.

Running a manual LQA is the most accurate method. I’ll discuss optimal metrics later in this article.

  1. Translations are edited by a human prior to publishing

In this case we need to measure productivity gain.

When MT is used as the first pre-translation step and edited by a human, we are interested less in human sentiment and more in the effort required to finalize the translation. For example, an ugly translation that can still be fixed quickly means a significant efficiency gain. In other words, the quality can be excellent from the narrowly focused “editability” vantage point, while the translation as such would allegedly not pass even a lax LQA.

The most accurate approach is to measure editing effort.

Selection of the proper quality measurement method is dependent on the content consumption scenario. There is no evidence that the two methods above (LQA vs. editing effort) provide similar (or even correlated) results.

 

Part 2: A brief ode to critical thinking - Remaining big questions – The experiment goals, approach, setup and details

Part 3: Editing effect – Trustworthiness of results – Holistic vs. atomistic evaluation – An in-depth look at automated (n-gram-based) quality evaluations – MT editing speed and its correlation with LQA results

 

BACK
This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy