Fine-tuning speech recognition and synthesis with phonemes

Last reviewed: 7/8/2022

HOW Article ID: H072221

The information in this article applies to:

  • LexiconKit 8

Summary

Manage word pronunciations comprised of phonemes with LexiconKit for enhancing speech recognition accuracy and speech synthesis clarity.

More Information

A lexicon is a collection of word pronunciations that a speech recognition engine (i.e., recognizer) uses to improve recognition accuracy and a speech synthesis engine (i.e., synthesizer) uses to enhance the quality of its pronunciation.

Lexicons play an important role in the accuracy of speech recognition. A speech recognition engine (i.e., recognizer) uses lexicons in the process of recognizing speech. Lexicons consist of the words that a recognizer understands and returns as recognized speech. Since it's impractical for a recognizer to maintain every possible word and context in its spoken language, you enhance the accuracy of speech recognition by extending its lexicon.

Lexicons play an important role in the quality of text-to-speech playback. A text-to-speech engine (i.e., synthesizer) uses lexicons to obtain pronunciation information associated with words to generate the appropriate speech sounds for the word. For example, with a lexicon you may ensure "record" is pronounced correctly when used as a noun and when used as a verb.

Generating, Editing, Speaking, and Persisting Pronunciations

The LexiconKit management class is designed to provide you a lot of flexibility and minimize the programming necessary to manage lexicon word pronunciations. Lexicon word pronunciations are comprised of phonemes or basic units of sounds. Phonemes collectively are represented in an alphabet format. Speech engine vendors have unique phoneme and pronunciation formats (i.e., alphabets) and may support International Phonetic Alphabet (IPA).

For example, the following table illustrates the differences in default pronunciations for the word tomato across speech engines.

IPACepstral SwiftMicrosoft SAPI 5Microsoft Universal Phone Set (UPS)
təme͡itot ah0 m ey1 t ow0t ax m ey t owT AX M EI T O

LexiconKit handles the complexities of dealing with these differences for your application.

Pronunciations may be declared in markup imbedded with text.


<!-- This is an example of IPA alphabet using character entities in W3C SSML -->
<phoneme alphabet="ipa" ph="təme͡ito">tomato</phoneme>

<!-- This is an example of sapi alphabet using character entities in W3C SSML -->
<phoneme alphabet="x-microsoft-sapi" ph="t ax m ey t ow">tomato</phoneme>

<!-- This is an example of UPS alphabet using character entities in W3C SSML -->
<phoneme alphabet="x-microsoft-ups" ph="T AX M EI T O">tomato</phoneme>

<!-- This is an example of sapi alphabet using character entities in SAPI SSML -->
<PRON SYM="t ax m ey t ow">W</PRON>

Create, edit, and test pronunciations in the Developer Workbench with SAPI and W3C SSML.

SSML Editing
SSML Editing: Edit Acapela TTS Tag, SAPI 5, and W3C Speech Synthesis Markup Language (SSML) faster with built-in intelliprompt that suggest valid markup syntax.

Pronunciations may be declared collectively in a lexicon document.


 <?xml version="1.0" ?>
 <lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
            http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
         alphabet="x-microsoft-sapi"
         xml:lang="en-US"
         version="1.0">
    <!-- This is an example of phonemes in W3C .pls lexicon -->
    <lexeme>
        <grapheme>tomato </grapheme>
        <phoneme>t ax m ey t ow </phoneme>
    </lexeme>
</lexicon>

# Cepstral Lexicon
#
# Type word, space, and part of speech (currently not implemented, so use '0') to generate default pronunciation
# cepstral 0 k eh1 p s t r ah0 l 
<!-- This is an example of phonemes in a Cepstral lexicon -->
tomato 0 t ah0 m ey1 t ow0

Create, edit, and test lexicon pronunciations in the Developer Workbench.

PLS Lexicon Editing
PLS Lexicon Editing: Edit word pronunciations faster using XML with built-in intelliprompt that suggest valid syntax and with built-in phoneme generation and editing tool windows.
Cepstral Pronunciation Editing
Cepstral Pronunciation Editing: Create and edit Cepstral default pronunciations faster by using the auto generation feature. Simply type the word, space, and 0 to generate the default pronunciation for editing.

Applications can manage phonemes with LexiconKit. To generate a lexicon word pronunciation, simply pass the word, the word type (i.e., part of speech), language, and alphabet to LexiconKit. To speak a lexicon word pronunciation, simply pass the phonemes, language, and alphabet to LexiconKit.


// Instantiate LexiconKit
NLexiconKit _LexiconKit = new NLexiconKit();
if (_LexiconKit != null)
{
    // Set credentials
    _LexiconKit.SetCredentials("Credentials");
    NSAPI5Synthesizer _Synthesizer = _LexiconKit.CreateSAPI5Synthesizer();
    if (_Synthesizer != null)
    {
        string phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
        _Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi");
        _Synthesizer.Dispose();
    }
    _LexiconKit.Dispose();
}

To provide alternate pronunciations, you may edit lexicon word pronunciations (i.e., change the phonemes) and add them to the lexicon.

Speech engines (i.e., recognizers and synthesizers) support unique lexicon alphabets, formats, and approaches for runtime inclusion.

Speech APIAlphabetsFile Format
Acapela TTSipa, acatts.dic
Cepstral Swift APIswift.txt
Microsoft (SAPI5, Speech Platform, WindowsMedia)ipa, sapi, upsW3C .pls

All synthesizers support speaking phonemes. Acapela TTS synthesizers and Microsoft WindowsMedia recognizers and synthesizers currently do not support generating phonemes.

Use Chant Developer Workbench to edit Cepstral .txt and W3C .pls lexicons.

Once you have saved the lexicon containing lexicon word pronunciations, you can deploy it with your application and install the pronunciations on a target system for your application to load at runtime.

Acapela and Cepstral lexicons are loaded at run time by SpeechKit. W3C .pls lexicons are included as part of text-to-speech (TTS) markup. See VoiceMarkupKit for details.