Text-to-Speech (TTS)

Text-to-Speech (TTS) is a technology that transforms text into audible speech. TTS is used in a number of different fields. For example, it’s used to develop speech assistants, navigation systems, e-learning courses, video games, and much more.

The Advantages of Text-to-Speech

  • You don’t need to rent a recording studio, hire an audio director, or perform additional audio processing.
  • The cost isn’t based on the number of voices (i.e. actors).
  • Artificially generating speech fragments takes less time than recording humans in a studio — it takes about 40 hours to produce 50 hours of professionally-processed speech (about fifty minutes per hour).
  • Corrections are basically free and can be reinserted into synthesized speech much more easily than retakes recorded by an actor.

Note: We don’t recommend using any form of TTS in big-budget games, videos, or ads with high requirements for sound quality, intonation, etc. Recording actors in a studio is still the best option for these projects.

Types of TTS

There are two different types of TTS: standard TTS and neural TTS. Standard TTS is used when the quality of the voice doesn’t have to be that high. In this case, incorrect word stress, unnatural intonation (or lack thereof), artificiality, and a certain “mechanical” quality are acceptable.

Neural TTS is a relatively new technology that transforms text into speech using a neural network. This makes it possible to achieve a more realistic imitation of a human voice — it sounds authentic, conveys simple emotions well, and avoid incorrect word stress. Only a professional can tell the difference between a voice generated by neural TTS from a human voice recorded in a studio.

Logrus IT has extensive experience working with TTS. We work with standard text-to-speech technology, as well as neural-network-based solutions.


Examples Neural TTS

English French
Male Male
Female Female

Comparing the Efficiency of TTS vs. Neural TTS

Neural TTS has a number of advantages over standard TTS. The most important of these is that recording engineers don’t have to manually configure word stress for individual words, correct intonation, or add punctuation — the neural network does all of this on its own. Because of this, neural TTS generates speech more quickly.

Here’s an example. It takes about 200 hours to prepare 6,000 lines of text using standard TTS, but the same text can be transformed into speech in only 100 hours using neural TTS.

If you still choose standard TTS, you’ll need to hire a highly qualified engineer with extensive knowledge of TTS technology to get good results. Another expert will be required for multilingual projects, namely a native speaker to advise the engineer. Needless to say, it would be ideal if the engineer were also a native speaker, but this person can often be very tricky to find. There might not be that many engineers with TTS experience, or they might be too expensive.

In order to improve the sound of the speech (by configuring realistic intonations and word stresses), every text needs to be processed manually by experts.

Because of this, efficiency when using standard TTS technology is highly dependent on a number of factors, including:

  • the target language
  • the cost of hiring a native speaker who is also an expert
  • the voice selected
  • the uniformity of the text

In the end, compared to standard TTS technology, neural TTS not only makes text-to-speech more cost-effective, it also produces superior results.

This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy