Speech Recognition and Text to Speech Technology

TMCnet - World's Largest Communications and Technology Community

October 30, 2006

Achieving Perfect TTS Intelligibility

By TMCnet Special Guest
Paolo Baggia, Leonardo Badino, Davide Bonardo, Paolo Massimino, Loquendo

(This paper is an overview of latest technological developments in TTS and demonstrate their flexibility and ease of use in a wide variety of application contexts). Originally presented at the AVIOS Technology Symposium, SpeechTEK West 2006.
The key objective of TTS engines is expressive, life-like and intelligible speech. High-quality, expressive, multilingual Text-To-Speech technology provides new advanced features, including:
  • support of multiple languages;
  • rendering of text that mixes more than one language, using the same voice with TTS Mixed-Language Support;
  • achieving expressive TTS voices, i.e. synthetic voices with an emotional expression, just like humans;
  • customization of existing TTS voices – editing and adjusting timbre, rate and pitch to create new, tailored personas, in order to better fit customer needs;
  • complete control of audio sources (TTS Audio Mixer), to allow a high degree of synchronization between TTS speech and background music.
A second fundamental aspect is the role and usefulness of a powerful development tool, which is easy to use and allows a user-friendly introduction to the above mentioned advanced features.
Loquendo TTS is a flexible engine, efficient and platform-independent, based on multi-language external knowledge-bases. It performs Text-To-Speech conversion as a real-time “software-only” process. The number of channels that can be simultaneously served depends on CPU power and the chosen voice’s database size.
Various voices are available, consisting of labelled speech signals: the larger the speech database, the higher the voice quality. A compressed audio format is also available, supporting several bit-rates for optimum disk space, versus audio quality optimisations, whenever needed.
Loquendo TTS is available both as a dynamic library (.DLL or .SO) for Windows, Linux and Solaris, and as a static library. It has been integrated into several embedded OS and is also compliant with Microsoft (News - Alert) Speech SDK (SAPI) 4.0 and 5.0.

The Loquendo TTS engine supports W3C SSML 1.0 markup (to enforce VoiceXML based speech applications), and accepts both ANSI and UNICODE text formats. Flexibility is one of its important features: voice, language, audio format, user lexicon, etc., can be set at run-time, on a per-channel basis. APIs, SSML markup, and control tags allow speech parameters to be modified, such as speaking rate, pitch range and volume, and reading styles and pronunciation to be controlled (word-by-word, spelling, dates, phonetic input, etc.). In order to tailor speech output to the intended application, Loquendo TTS provides advanced user lexicons with context grammars and phonetic transcriptions for managing user exceptions, and exploits the corpus-based technique to allow domain-dependent acoustic add-ons to the base voices. Loquendo TTS is conceived as a multilingual system where language-dependent knowledge is kept as far as possible separate from core algorithms. Together with its development tools, it can be viewed as a dynamic system, allowing incremental addition of new voices and languages.

Advanced Features of Loquendo TTS
Loquendo was the first player on the speech market to present a milestone in synthetic speech: Expressive TTS, and also to introduce the innovative features of Mixed-Language Support, and Audio Mixer integrated into the TTS engine.
1 - Expressive TTS
Loquendo was the first company to bring expressive synthetic speech to the market. Indeed, Loquendo TTS voices now come with a repertoire of "expressive cues", which enable TTS users to enliven their voice prompts. This is the first concrete attempt in the direction of expressive synthetic speech. Developing natural-sounding synthetic speech has been one of Loquendo's goals for several years; research has covered both language modeling aspects and techniques to make the TTS software efficient, without compromising the naturalness achieved through Unit Selection TTS.
Loquendo TTS is now offering its customers the possibility to make their vocal messages even more lifelike and expressive. Just like human conversation, expressive intention is conveyed through conventional formulas and interjections, which are pronounced with a natural and colourful intonation. As a result, the entire message becomes more expressive.
Loquendo TTS's repertoire of "expressive cues" contains conventional figures of speech, such as greetings and exclamations ("hello!", "oh no!", 'I'm sorry!"), interjections ("Oh!", "Well!", "Hum"..) and paralinguistic events (e.g. breath, cough, laughter, etc.), which suggest expressive intention (to confirm, doubt, exclaim, thank, etc.). The same elements can appear in several variations to obtain the greatest degree of naturalness. Some phrases are available in different styles and intonations, from neutral to emphatic, from sad to amazed.
2 - Customized voices
Loquendo TTS provides a number of user controls allowing prosodic aspects of the output voice to be modified: varying its speaking rate and volume, raising or lowering the pitch. The timbre of the voice can also be modified, allowing new personas to be generated from a single base voice, e.g. a child’s voice from a female one, or an elderly voice, a cartoon-like character, etc.
3 - Phoneme Mapping for Mixed-Language Support
Loquendo TTS system is conceived according to a multi-lingual modular architecture: a language-independent engine performs TTS conversion by applying language-specific functions and knowledge bases, available in separate dynamic libraries. This design principle allows switching between languages on the fly, and even mixing functions from different language libraries.
Such a flexible architecture makes it possible to realize the special feature of Foreign Pronunciation (Phoneme Mapping), which makes any Loquendo synthetic voice capable of reading foreign language words or sentences with a correct pronunciation, while keeping its native accent. This feature can be very useful in mixed-language documents, where changes occur frequently at sentence and phrase level, as in the reading of Internet content, e-mails, movie titles, proper names or addresses. The multilingual text is read without having to change voice at every language change.
Reading a multi-lingual text without changing voice can be achieved with two different approaches:
  1. Producing multilingual vocal databases created with bilingual or multilingual speakers capable of reading several languages with mother tongue quality.
  2. Applying the foreign language grapheme-to-phoneme transcription rules to the foreign text, and then mapping the transcribed phonemes onto those of the voice's native language in order to access its acoustic units.
Loquendo has applied both approaches, by building up some bilingual voices (e.g. the Castilian/Catalan voice) and by implementing the more general Foreign Pronunciation feature. While the first approach makes it possible to obtain "perfect pronunciation", it is however restricted to a small number of languages for a single voice, whereas the second one provides an approximate pronunciation, which is applicable to any foreign language. This doesn't mean that the first method is better than the second; on the contrary, in many real-life situations, the approximate approach is the most realistic, especially when the foreign text is embedded in a main language text. In such cases, a human speaker generally prefers to maintain his native-tongue phonological system. This choice is due to co-articulation, economy of effort and also to psychosocial factors, as adopting the correct pronunciation may be regarded as an undue sophistication and, as such, rejected in common usage.
The key point in the Loquendo TTS Foreign Pronunciation feature is the Phoneme Mapping algorithm. While the flexible Loquendo TTS architecture can provide phonetic transcriptions where each word is transcribed according to its language, a further step is required in order to obtain their pronunciation by a single-language voice. Phonemes that do not belong to the native phonological system of the voice must be replaced by the most similar sounds available in the voice acoustic database. To this end Loquendo TTS implements a quite general and language-independent algorithm intended to convert a string of foreign language phonemes (L2) into the closest native language phoneme string (L1).
The Foreign Pronunciation feature can be activated by explicitly marking the foreign text portion with a proprietary control tag. Alternatively, it can be activated by the Language Guesser facility, which automatically tries to detect the language of the text.
4 - Audio Mixer & Stereo Capabilities
Loquendo Audio Mixer allows the definition of extremely high quality TTS prompts. The Audio Mixer is integrated inside the TTS; in this way, using simple control tags embedded in the input text, users can reproduce audio files and music synchronized with the text pronounced by synthetic voices. Commands such Mix, Play, Stop, Pause, Resume, Loop and Fade allow users to have complete control of the audio sources. Thanks to explicit tags in the text, synchronization between audio files and speech is made easier, even if the text is modified in the future, without the cumbersome overhead of re-recording a set of prompts.
Every sound effect is treated as an independent track, with independent timeline, volume and sample rate. No more offline audio re-sampling is required: the sample rate frequency of the audio sources is automatically converted according to the frequency of the synthetic voice in use.
Another new feature is the use of stereo audio: the user can define the desired balancing via control tags, and so cause a voice to move from left to right and from right to left.
Authoring TTS
The authoring of input text for TTS or SSML markup can be achieved by different means but, as soon as more powerful features were introduced, it was clear that a development tool would have greatly simplified the extensive use of these features. This section will introduce the Loquendo TTS Director development tool and the recently released Lexicon Editor module.
Loquendo TTS Director is a multi-platform Java development tool, which supports users in designing effective prompts for their applications. This unique approach provides an intuitive, user-friendly environment in which to create lifelike, expressive, synthetic speech material. Combined with Loquendo’s unique expressive TTS technology, the new approach opens up a world of possibilities for every user.
Figure 2 shows a snapshot of the TTS Director graphical interface. Text is written in the edit box and interactively refined through a "listen & edit" procedure, which allows fine-tuning for even better TTS performance (Play/Stop buttons).
Figure 2 – Snapshot of TTS Director graphical interface
Prompt designers can effortlessly select the TTS voice and reading mode, they can set acoustic and prosodic parameters – i.e. sampling frequency and coding, pitch, speaking rate and volume - and save their edited prompts - both in text and audio formats. With Loquendo TTS Director, the control tags can be typed directly in the text. The easy-to-use interface for prompt editing will suggest a repertoire of "expressive cues" available for each voice. The list is structured according to intuitive linguistic categories, so that appropriate formulas can be rapidly and easily found.
Here are a few examples: the ControlTags menu provides a structured access to the available Loquendo TTS control tags, see Figure 3. The control tags are grouped according to their categories, so it is easy to select the one required. The categories are: Audio, Bookmarks, Language, Pronunciation, Prosody, Reading Modes and Voice. The selected tag is automatically inserted in the edit box, at the caret position which indicates the point at which text or graphics are to be inserted. If the control tag needs further data from the user, a wizard appears, or it is marked by a brightly coloured text in the edit box, asking for the details required. E.g.: \voice=<insert a valid voice name>
Figure 3 – Snapshot of Control Tags menu
The Effects menu is a guide to the advanced features of Expressive Cues and Plug-in lexicons, see Figure 4. If the selected voice is provided with special add-ons, this menu allows the selection of the desired effect. The usage of such formulas can make vocal messages lifelike and expressive. The linguistic formulas are listed in the SpeechActs submenu, according to intuitive linguistic categories. The paralinguistic events are accessible from the Extras submenu. The selected expression is directly inserted in the edit box. Each SpeechActs and Extras element is played when the mouse pointer passes over the loudspeaker icon, so providing faster access to the effect required. The Plugin submenu allows the activating/deactivating of the plug-in lexicons available for the current voice (i.e. SMS, e-mail).
Figure 4 – Snapshot of the Effect menu.
Loquendo TTS can manage two kinds of language-dependent lexicon files for exception handling:
1.                  Plug-in lexicons
2.                  The user lexicons
Plug-in lexicons are provided together with the Language Library for improving the Loquendo TTS ability to read specific kinds of texts (i.e. SMS, e-mails) that may present idiosyncratic words, abbreviations, emoticons, and so on. The available plug-in lexicons can be activated by the Effects menu (see Section 4.1), by a control tag inserted in the input text, or by the <lexicon> element in SSML. User lexicons are optional (and provided by the user). They should contain user exceptions and transcription rules.
Loquendo TTS User Pronunciation Lexicons can be created and modified by means of a Lexicon Editor, a graphical interface that helps in suggesting and listening to the pronunciation. This application can be used as a stand-alone tool, or can be activated by means of the Tools menu of the TTS Director.
When opening a lexicon file, the contents of the file are listed in the editor, as shown in Figure 5:
Figure 5 – Lexicon Editor graphical interface
The L and the P icons stand for literal transcription or phonetic transcription, respectively. Double-clicking a lexicon entry in the list, you can edit it through the lexicon dialog box depicted in Figure 6.
Figure 6 – Editing a lexicon entry
The supported pronunciation alphabets[1] are: Loquendo proprietary alphabet, International Phonetic Alphabet, IPA, and SAMPA , including some SAMPA dialects used in car navigation systems.

[1] All these formats are supported by the Loquendo TTS. Lexicon Editor supports the proprietary alphabet only.

» Speech Recognition and Text to Speech Technology
» See All Feature Articles

Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: tmc@tmcnet.com.
Comments about this site: webmaster@tmcnet.com.


© 2021 Technology Marketing Corporation. All rights reserved | Privacy Policy

Speech Recognition and Text to Speech Technology