BLOG

How to Measure the Quality of Almost Anything

How to Measure the Quality of Almost Anything

Leonid Glazychev, Logrus IT CEO

Quality evaluation is needed everywhere. Whether original or localized, content across platforms, genres, and industries can benefit from reliable, actionable quantitative assessment. How informative and appealing are your marketing materials? How effectively do client managers communicate? How intuitive is your UI? How well are goods presented on an online marketplace? Regrettably, no universal approach yet exists to span across these scenarios. Can we create a modern astrolabe to help us successfully navigate the oceans of content and make reliable, educated judgment in each case?

This article is my attempt at creating a universally applicable Multidimensional Quality Approach to evaluate various quality aspects of content, both original and localized. This approach is applicable not only to uniform content, but also to complex (structured) objects, such as product listings, personalized ads, scientific articles, etc. The Multidimensional Quality Approach is highly customizable by definition; each metric can be created or fine-tuned to emphasize and measure only the factors that are important or relevant to the specific review goals, content subject matter area, structure, etc. The concept itself is the next step in developing the original hybrid holistic/atomistic Quality Triangle model first presented in 2014.

The First Small Step

When I first proposed the original hybrid holistic/atomistic Quality Triangle model to evaluate translated content, several clients inquired about a similar, universal, scalable and customizable approach for original content they or their users create.

Original content is added at an incredible pace worldwide, it is naturally diverse, and quite a significant share is never translated or localized.

When I began modifying the original 3D, hybrid evaluation model to accommodate original content, I found that required adaptation was minimal, and the modified Quality Triangle approach works just as well for “traditional” content created by clients, such as documentation, marketing materials, etc.

Let us look at modifications required for all three quality factors, including the two holistic ones, and atomistic quality.

  • Holistic Intelligibility measures how easy content is to read, view, or listen to. High Intelligibility means that the original (or translated) content is clear, unambiguous, and well-presented.

    This factor is either used “as is”, similar to its prototype for translated content, or is replaced by Holistic Engagement. The latter is used when we need to understand how attractive/engaging, clear, and easy to understand the content/communication is.

  • Holistic Adequacy is not applicable to original content, as there is nothing to compare. It can be seamlessly replaced by Holistic Relevance or Informativeness that indicates:
    • How well the content corresponds to user expectations for the product or service category,
    • How informative and structured the content is, and
    • How well it agrees with other descriptions or information, if available.

    In either case quality scales required for holistic measurements do not differ much from their prototypes designed for translated content.

  • The third quality factor, Atomistic Quality can also be applied unchanged to original content.

It summarizes quality issues encountered at and limited to the “atomic” level, such as sentences, strings, lines, pieces, etc. Deviations/issues include broken tags, incorrect links, grammar, locale, or spelling errors, double spaces, formatting problems, etc.

Example

To illustrate how the modified evaluation model works, let us evaluate this article! We’ll use the same three quality factors listed above as follows:

  • Engagement indicates how appealing, clear, and easy to understand the article is, and how well I have presented the content.
  • Relevance/Informativeness measures how informative and structured the content is, and how well it corresponds to your expectations.
  • Atomistic quality will reflect the abundance and severity of missing or broken links, grammar, spelling, and formatting mistakes, poor image quality, etc.

The Simple Part – Direct Applications and “Traditional” QA

Minimal metric fine-tuning allows us to extend coverage to an extensive number of real-life cases. This approach is also highly scalable. For instance, the evaluation can be limited to holistic factors only, which makes it both faster and cheaper.

This approach works well for “traditional” evaluation scenarios dealing with:

  • Verbal/video communications, including educational or promotional videos, presentations, sales performance, customer interactions (at banks, retail stores/chains, etc.),
  • Original corporate materials, such as Web and marketing content, documentation, scripts, etc.

The goal of “traditional” quality assurance (QA) is to understand overall content quality and either learn from mistakes or fix issues/rework content. This includes:

  • Producing one or more quality ratings, holistic and/or atomistic;
  • Logging all issues. One or more issues are logged for each imperfect unit, and nothing for error-free units;
  • Evaluating overall content quality – OR – supplier performance (vs. expectations);
  • Comparing obtained quality ratings to predefined thresholds (getting a PASS or a FAIL for each factor).

Digging deeper – Where the Simple Evaluation Approach Does Not Work

The “traditional” evaluation approach described above works well for:

  • Uniform content like user assistance or documentation, software, etc., where evaluated objects are all similar, such as sentences, paragraphs or sections in a text, software strings, web pages, etc.
  • Traditional publishing scenarios involving professional authors/creators, a certain level of review and/or quality control prior to publishing, and fixing at least the most egregious errors.

But the world of content is considerably more multi-faceted, and plenty of real-life scenarios call for a model that is more flexible and diverse. Let us consider some of them.

  • Complex content, where evaluated objects have an internal structure and/or contain multiple different components, and numerous diverse requirements apply. This includes product listings (e-commerce storefronts), personalized ads, journal articles, use case scenarios (purchasing or returning a product), etc.

    Each such object has a structure (title, keywords or search words, abstract, selling points, in-depth description, use cases, graphics and/or videos, reviews or feedback, consecutive steps, etc.). Requirements for each component are often quite specific and typically  unrelated to other components. There are also expectations for the object as a whole, like its structure, total size/volume, etc.

  • Huge volumes of unrelated content subject to continuous, quick changes. Typical examples include online marketplaces, social media, or personalized ads. This content is often produced by a multitude of authors, and there is no reason or timeline to fix things. Obsolete pieces are simply forgotten, removed, or replaced with new or updated ones independently, without any systemic approach.

A significant share of such content is created by non-professionals, and some of it is generated by AI or MT. Platform control over this content is cursory at best. It is typically limited to general guidelines and a set of red flags that initiate a removal procedure (or a dispute).

Defining Perceived Quality

It is not easy to produce a universal, high-level definition of perceived quality.

  • Objects we need to evaluate can be quite diverse, including ads (goods, services, …), social network posts, articles, verbal communications, screenshots, use case scenarios, videos, etc.
  • Quality expectations can be numerous and totally unrelated, especially for complex objects with an inner structure. For instance, requirements related to shipping or return policies for goods are not at all related to requirements for graphics and/or videos.
  • Quality criteria and measurement scales are often context-sensitive, and they may need to be customized for each area or vertical. For example, formal structure requirements are very typical when it comes to articles or product listings, but do not apply to screenshots.

I have come up with the following, relatively broad definition that works well for content and is even applicable to material objects.

The content needs to:

  • Attract your attention in the first place (Immediate Traction);
  • Produce an Excellent first impression when you interact with it (click the product, ad, or article link; start streaming a movie or a clip, talk to a person, etc.);
  • Keep your attention/interest long enough (Higher chances to sell), so that you read or watch to the end, purchase the product or service, reply to an ad, etc.
  • Present a picture of quality at both high (structural) and low (atomic) levels. Nothing should be missing or look or feel unprofessional, including overall organization and/or structure, design, formatting, tags, links, all other relevant technical areas (like video smoothness or clean audio tracks), grammar and spelling, terminology, etc.

The BIG QUESTION is whether it is possible to create a universal quality evaluation model that would produce genuinely useful, actionable result sacross content types. I posit the answer is affirmative. If this intrigues you, read on…


The QA model evolution – general principles

Let us first do a quick review of the general evaluation principles prior to presenting the new, universal model.

  • Approach the content as a consumer or general expert, not as a QA engineer, linguist, or any other specialist concentrating on a narrow scope.
  • Work with multiple diverse and unrelated factors in parallel. A generic product listing can have a perfect structure but fall short in language and/or visual areas.
  • Prepare to deal with holistic evaluations because most factors in play (content structure, informativeness, relevance or intelligibility, let alone new categories evolving for specific cases) simply cannot be evaluated analytically.
  • Prepare to develop well-defined, clear quality scales for each holistic category (or a whole set of these categories). Each score on this scale has a detailed definition which makes differentiating between GOOD, MODERATE, and BELOW MODERATE quality much easier (and considerably more objective). Without these scales, evaluations can easily become biased; their dependence on the personality and background of the reviewer immediately renders them useless.

IMPORTANT. Holistic evaluations do not mean that scores are arbitrary and/or highly subjective. I’ve covered this issue in detail in a previous article: Reliably Measuring Something That Isn't Completely Objective: The Quality Triangle Approach to Translation Quality Assurance.


  • Heavily customize the quality metric (quality categories, measurement scales and expectations) for each project or use case while sticking to the same general approach and metric structure. This is inevitable given the incredible diversity of available content.

Meet the Quality Polygon

The new, universal evaluation approach builds on the Quality Triangle foundation and takes the concept further.

  • For each project or subject matter area we create a highly customized, multifaceted metric that encompasses multiple holistic quality categories to help us evaluate various object properties. These properties can include such things as informativeness, intelligibility, structure, technical quality, quality of video and graphics, etc.
  • Each object (article, product listing, ad, webpage, document, paragraph or even string, etc.) is evaluated against ALL holistic categories included in the metric.

IMPORTANT. We need to treat each independent evaluation separately, and always look at the entire picture to make proper decisions, setting an acceptance threshold for each individual category. Perfect structure cannot compensate for poor image quality or lack of clarity in shipping or return policies. This is a fundamental concept, especially for complex objects. There is no way to combine a multitude of independent evaluations into a single “all-in-one” score without losing most of the information. Over-simplifying everything will mean that we throw the baby out with the bathwater.


We need to accept that object quality is multidimensional. The Quality Polygon approach is a universal solution


IMPORTANT. Holistic evaluations do not prevent the inclusion of atomistic errors to the picture as they are encountered during evaluation. For example, we can evaluate product listings or ads against multiple holistic criteria. At the same time, for each object (product listing, ad, etc.) we also log all individual grammar, spelling, formatting, and other errors that would provide additional information about the object’s quality on the atomic level (Atomistic Quality rating). Good overall language quality does not automatically mean that there are no minor low-level issues such as dual spaces, improper capitalization, broken tags or links, etc.


Just as a reminder, the original Quality Triangle model combines two “fixed” holistic quality categories and an Atomistic quality rating as shown on Fig. 1 below.

Fig. 1. The Quality Triangle evaluation approach

The new multidimensional Quality Polygon model requires us to evaluate each object against multiple independent and unrelated quality criteria (categories). Compared to the original Quality Triangle model, the changes are follows:

  • Holistic Quality Categories replace “traditional” error typologies
  • Holistic Quality Scales replace “traditional” severities
  • The Quality Triangle itself morphs into a Quality Polygon (see Fig. 2 below)

Fig. 2. The Quality Polygon evaluation approach

Applying the Quality Polygon model – Goal-oriented metrics

Alternative evaluation types

As mentioned earlier, the “traditional” quality approach generally answers a single question: How good/bad is the content/supplier performance? The Quality Polygon model requires goal-oriented customization that strongly depends on the evaluation type.

Three major types of evaluation are as follows:

  1. Decision-making/sorting evaluation. We go through the content and decide what to do with each object.
  • Goal – Approach – Result: Decision/Diagnosis (action suggestion) for EACH object.
  • Evaluation = Decision/Suggestion: Leave as is (everything fine), Proofread / edit, Fix graphics, Rework completely, etc.
  1. Evaluation for Statistical analysis. We assess each object using multiple custom-selected categories, calculate stats for each category and/or score, and make conclusions based on the results.
  • Goal: Analyzing trends, Revealing systemic issues, Counting perfect (or terrible) units separately, etc.
  • Approach – Result: EACH object evaluated against multiple custom-selected criteria.
  • Evaluation categories: Informativeness, Relevance, Engagement, Intelligibility, Language, Structure, etc.
  1. Evaluation for AI training. In this scenario we evaluate content generated or improved by AI or MT and come up with actionable recommendations that engineers can use to improve or fine-tune engines. (AI will not read comments or fix errors).
  • Goal – Result: Make it useful for developers/engineers for engine improvement or fine-tuning.
  • Approach: Concentrate on limited areas and be very detailed and specific with issues. Feedback that is too generic is not actionable.
  • Evaluation categories: Word choice, Grammar, Spelling, Capitalization, Typography, etc.

The resulting general Holistic Multidimensional Evaluation model is presented in Fig. 3 below. Holistic evaluation categories are selected based on the goal/evaluation type, content nature and subject matter area, etc. Quality scales comprise either a set of scores between abhorrent and excellent (evaluating quality for subsequent analysis) or a variety of action recommendations for each object (decision-making or improving AI).

Fig. 3. Holistic Multidimensional Evaluation model.

Practical applications of the Quality Polygon approach

This section is 100% based on actual projects that we have completed for multiple customers, with a wide variety of goals and approaches. It serves as a “storefront” that demonstrates how universal and flexible the Quality Polygon approach is, and what you can do with it. All client-specific details have been omitted for confidentiality purposes.

Simple case: Evaluating “traditional” original content

Goal: Evaluate original content created by humans and/or generated by AI.

Solution: Using 3 Holistic evaluation criteria combined with Atomic-level evaluation.

Holistic quality categories apply to reasonably sized pieces as a whole (titles/headings, sentences, strings, paragraphs, …)

  • Intelligibility. How easy to read, listen to, view, and understand the content is. Ideally, the content is unambiguous and well-presented.
  • Engagement. How appealing, clear, and easy to understand the content/communication is.
  • Relevance/Informativeness. How well the content corresponds to user expectations for a particular [product, service, etc.] category, how informative and structured the content is, and how well it agrees with other descriptions, if available.

Atomistic quality measures the abundance and severity of quality issues encountered at the “atomic” level throughout the content. These include more “local” issues, such as broken tags, incorrect links, grammar, locale, or spelling issues, double spaces, formatting, etc.

If the budget or allocated timeframe is insufficient, we can limit the evaluation to holistic factors only.

Applicability: The approach is applicable with minimal modifications to a whole spectrum of original content, including but not limited to:

  • Oral/video communications (sales pitches, speeches, customer interactions at banks, retail stores/chains, etc.)
  • Original materials, such as web content, marketing, technical documentation, scripts, etc.

Evaluating the work of the AI language improvement tool

Goal: Evaluate how seriously the tool improves original texts created by non-professionals. (AI is not just annoying CAPTCHAs…)

Solution:

  • Independently evaluate and then compare results for the original, human-edited, and AI-edited versions of all pieces.
  • Multidimensional, customized holistic evaluation only.
  • Evaluate two “traditional” holistic factors: Holistic Intelligibility + Holistic Adequacy (edited version vs. the original) for each piece.
  • Evaluate an expensive grid of specific language-related factors, such as Word choice, Grammar, Spelling, Capitalization, Typography, etc. This makes it possible to produce actionable results for the engineers to fine-tune the AI engine.
  • Holistic Quality levels used: Excellent – Good – Passable – etc.
  • Finally, we run cross-comparisons, analyze averages by quality category, etc.

Applicability: The approach is applicable with minimal structural modifications to a whole spectrum of projects targeted at improving AI output.

Blind MT evaluation of product listing headlines

Goal: Evaluate headlines in target language independently, treating them like trans-created units. Source materials in English provided for reference only, and we are NOT assessing translation quality as such.

Solution:

  • Blind approach. Provided units represent a mix of items translated by humans or MT, and the reviewer is clueless about their origin.
  • All headlines in the target language are evaluated from the prospective of a reader/buyer who is a native speaker of the target language interested in similar products (but is NOT a linguist).
  • A quick Holistic-only review. We look at each translated product listing and assess two factors: Engagement (how attractive the headline is) and Relevance/informativeness (how well the headline corresponds to the actual product and product category). The latter is assessed by comparison with the original title and listing contents in English.

Applicability: The approach is applicable with minimal structural modifications to a whole spectrum of projects targeted at evaluating localized online store versions and MT output.

Assessing complete product listings (complex objects)

Goal: Assess complex objects (product listing) thoroughly, against a multitude of essential criteria, to obtain the overall multifaceted quality picture.

Solution:

  • Create a comprehensive set of essential quality categories for evaluating product listings. All in all, we created a metric comprised of 13 categories including Informativeness, Quality of art pieces and videos, Language, Appropriateness, Info on Shipping and Returns, etc.
  • Some quality categories had to be divided into multiple subcategories. For example, graphic evaluation comprises independently assessing the quality of the graphics and piece relevance and uniqueness…
  • Multiple evaluations for each article, one for each quality category/subcategory.
  • Evaluations vary from Excellent to Abhorrent (10 different quality scores between 0 and 9).
  • Each rating below Excellent is substantiated by a brief comment.
  • Stats are analyzed by product category/subcategory, quality category, score, etc.

Applicability: The approach is applicable with minimal structural modifications to a whole spectrum of projects targeted at evaluating content quality for complex objects, such as product listings (online marketplaces), personalized ads, articles, etc.

Summary

The Quality Polygon multidimensional holistic evaluation approach makes it possible to assess the quality of virtually any type of content, including original or translated content, written and oral communications, etc. However diverse the content, you can apply the same highly flexible and customizable model and obtain actionable results.

I strongly believe in open-source ideas and concepts, so this methodology is public and available for you to apply, including commercial applications. I would humbly ask that you remember to credit the original content to its rightful author when referencing it.

You will need to:

  • Formulate the evaluation goal, including a clear understanding of the type of objects you will be dealing with.
  • List all essential, independent quality factors that need to be assessed for each object within the given context. These will become Quality Categories. (Spoiler alert: It’s not as easy as it may seem… Each category needs to reflect a distinct perceived object quality in the eyes of the user or expert. Quite a few Quality Categories are required for complex objects, such as product listings. Too many Quality Categories create an unmanageable mess…)
  • Create one or more Quality Scales that will make it possible to uphold a certain level of objectivity and uniformity when each object is evaluated against each category. (You’ll need to clearly define each score and make sure that it is easy enough to distinguish between this score and the two adjacent ones.)
  • Set acceptance thresholds for each Quality Category (these depend on expectations).
  • You are ready to proceed with the evaluation!

The Catch

My fondness of sharing does not prevent me from keeping important details for myself and my company, Logrus IT. Details do matter! As already mentioned, to carry out a truly useful or actionable multidimensional holistic quality evaluation one needs to:

  • Create, select, and/or fine-tune quality criteria essential for the content in question.
  • Organize these criteria into a set of Quality Categories with clear and detailed definitions.
  • Define holistic evaluation or decision-making scales to easily differentiate between all shades of quality between Excellent and Abhorrent or cover all action recommendations, such as Leave as is, Rework completely, etc. Comprehensive, straightforward definitions of each score/recommendation are essential.
  • Write customized, well-structured instructions for reviewers that will define objectives, explain the metric and the process, establish expected productivity, etc.
  • Manage projects, including finding qualified resources, explaining how to use the metric, providing the full workflow and internal quality control, etc.
  • Automate the process, making it easy for reviewers to understand the metric, including Quality Category definitions and Quality Scale scores, access instructions and guidelines, and score each object against multiple criteria.
  • Process and present statistics and general conclusions or advice.

Luckily, Logrus IT already has all the required know-how and automation required to help you. Our team is experienced in this approach=, and we have already developed a fully automated, E2E Quality Portal solution.

  • We will create or fine-tune a set of metrics for your projects. This includes inventing or selecting essential Quality Categories specific to the evaluated content/objects and customizing Quality Scales.
  • Each metric will target its own context, such as evaluation of websites, marketing materials, instructions, use case scenarios, screenshots, Q&A databases, videos, online stores and marketplaces, social media and networks, personalized ads, written or oral communications, etc.
  • We already have a large set of ready-to-use solutions/building blocks for various industries and types of content and can adapt these blocks and solutions for new challenges quickly and efficiently.
  • We will provide a comprehensive automated workflow on the Logrus IT Quality Portal.

The portal supports creation and modification of custom multidimensional, hybrid quality metrics, custom quality categories and scales, error typologies, etc. It is simple to use and makes it possible to involve all parties, including the client, Logrus IT (PM, metric development, customizations), content reviewers (provided by the client or Logrus IT), and even content creators or translators that can provide valuable feedback for arbitration purposes. You can find more information in the presentation: The Logrus IT Quality Portal.

BACK
This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy