How do I fine-tune speech recognition and synthesis with phonemes?

Last reviewed: 10/15/2021

HOW Article ID: H072121

The information in this article applies to:

  • LexiconKit 7

Summary

Manage word pronunciations comprised of phonemes with LexiconKit for enhancing speech recognition accuracy and speech synthesis clarity.

More Information

A lexicon is a collection of word pronunciations that a speech recognition engine (i.e., recognizer) uses to improve recognition accuracy and a speech synthesis engine (i.e., synthesizer) uses to enhance the quality of its pronunciation.

Lexicons play an important role in the accuracy of speech recognition. A speech recognition engine (i.e., recognizer) uses lexicons in the process of recognizing speech. Lexicons consist of the words that a recognizer understands and returns as recognized speech. Since it's impractical for a recognizer to maintain every possible word and context in its spoken language, you enhance the accuracy of speech recognition by extending its lexicon.

Lexicons play an important role in the quality of text-to-speech playback. A text-to-speech engine (i.e., synthesizer) uses lexicons to obtain pronunciation information associated with words to generate the appropriate speech sounds for the word. For example, with a lexicon you may ensure "record" is pronounced correctly when used as a noun and when used as a verb.

Generating, Editing, Speaking, and Persisting Pronunciations

The LexiconKit management class is designed to provide you a lot of flexibility and minimize the programming necessary to manage lexicon word pronunciations. Lexicon word pronunciations are comprised of phonemes or basic units of sounds. Phonemes collectively are represented in an alphabet format. Speech engine vendors have unique phoneme and pronunciation formats (i.e., alphabets) and may support International Phonetic Alphabet (IPA).

For example, the following table illustrates the differences in default pronunciations for the word tomato across speech engines.

IPACepstral SwiftMicrosoft SAPI 5Microsoft Universal Phone Set (UPS)
təme͡itot ah0 m ey1 t ow0t ax m ey t owT AX M EI T O

LexiconKit handles the complexities of dealing with these differences for your application.

Pronunciations may be declared in markup imbedded with text.


<!-- This is an example of IPA alphabet using character entities in W3C SSML -->
<phoneme alphabet="ipa" ph="təme͡ito">tomato</phoneme>

<!-- This is an example of sapi alphabet using character entities in W3C SSML -->
<phoneme alphabet="x-microsoft-sapi" ph="t ax m ey t ow">tomato</phoneme>

<!-- This is an example of UPS alphabet using character entities in W3C SSML -->
<phoneme alphabet="x-microsoft-ups" ph="T AX M EI T O">tomato</phoneme>

<!-- This is an example of sapi alphabet using character entities in SAPI SSML -->
<PRON SYM="t ax m ey t ow">W</PRON>

Create, edit, and test pronunciations in the Developer Workbench with SAPI and W3C SSML.

SSML Editing
SSML Editing: Edit Acapela TTS Tag, SAPI 5, and W3C Speech Synthesis Markup Language (SSML) faster with built-in intelliprompt that suggest valid markup syntax.

Pronunciations may be declared collectively in a lexicon document.


 <?xml version="1.0" ?>
 <lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
            http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
         alphabet="x-microsoft-sapi"
         xml:lang="en-US"
         version="1.0">
    <!-- This is an example of phonemes in W3C .pls lexicon -->
    <lexeme>
        <grapheme>tomato </grapheme>
        <phoneme>t ax m ey t ow </phoneme>
    </lexeme>
</lexicon>

# Cepstral Lexicon
#
# Type word, space, and part of speech (currently not implemented, so use '0') to generate default pronunciation
# cepstral 0 k eh1 p s t r ah0 l 
<!-- This is an example of phonemes in a Cepstral lexicon -->
tomato 0 t ah0 m ey1 t ow0

Create, edit, and test lexicon pronunciations in the Developer Workbench.

PLS Lexicon Editing
PLS Lexicon Editing: Edit word pronunciations faster using XML with built-in intelliprompt that suggest valid syntax and with built-in phoneme generation and editing tool windows.
Cepstral Pronunciation Editing
Cepstral Pronunciation Editing: Create and edit Cepstral default pronunciations faster by using the auto generation feature. Simply type the word, space, and 0 to generate the default pronunciation for editing.

Applications can manage phonemes with LexiconKit. To generate a lexicon word pronunciation, simply pass the word, the word type (i.e., part of speech), language, and alphabet to LexiconKit. To speak a lexicon word pronunciation, simply pass the phonemes, language, and alphabet to LexiconKit.


// Instantiate LexiconKit
NLexiconKit _LexiconKit = new NLexiconKit();
if (_LexiconKit != null)
{
    // Set license properties
    //_LexiconKit.SetLicense("LicenseRegistrationNumber", "LicenseSerialNumber");
    // Else, for evaluation, set only evaluation serial number
    _LexiconKit.SetLicense(string.Empty, "EvaluationSerialNumber");
    NSAPI5Synthesizer _Synthesizer = _LexiconKit.CreateSAPI5Synthesizer();
    if (_Synthesizer != null)
    {
        string phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
        _Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi");
        _Synthesizer.Dispose();
    }
    _LexiconKit.Dispose();
}
        

// Instantiate LexiconKit object
CLexiconKit* _LexiconKit = new CLexiconKit();
if (_LexiconKit != NULL)
{
	// Set license properties
	//_LexiconKit->SetLicense(L"LicenseRegistrationNumber", L"LicenseSerialNumber");
	// Else, for evaluation, set only evaluation serial number
	_LexiconKit->SetLicense(L"", L"EvaluationSerialNumber");
	// Create synthesizer
	CSAPI5Synthesizer* _Synthesizer = _LexiconKit->CreateSAPI5Synthesizer();
	if (_Synthesizer != NULL)
	{
        wchar_t* phonemes = _Synthesizer->GeneratePhonemes(L"tomato", L"Noun", L"en-US", L"sapi");
        _Synthesizer->SpeakPhonemes(phonemes, L"en-US", L"sapi");
        delete _Synthesizer;
    }
    delete _LexiconKit;
}
    

// Instantiate LexiconKit  object
CLexiconKit* _LexiconKit = new CLexiconKit();
if (_LexiconKit != NULL)
{
    // Set license properties
    //_LexiconKit->SetLicense("LicenseRegistrationNumber", "LicenseSerialNumber");
    // Else, for, set only evaluation serial number
    _LexiconKit->SetLicense("", "EvaluationSerialNumber");
    // Create synthesizer
    CSAPI5Synthesizer* _Synthesizer = _LexiconKit->CreateSAPI5Synthesizer();
    if (_Synthesizer != NULL)
    {
        String phonemes = _Synthesizer->GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
        _Synthesizer->SpeakPhonemes(phonemes, "en-US", "sapi");
        delete _Synthesizer;
    }
    delete _LexiconKit;
}

var 
    _LexiconKit: TLexiconKit;
    _Synthesizer: TSAPI5Synthesizer;
    phonemes: string;
begin
    // Instantiate LexiconKit object
    _LexiconKit := TLexiconKit.Create();
    if (_LexiconKit <> nil) then
    begin
        // Set license properties
        //_LexiconKit.SetLicense('LicenseRegistrationNumber', 'LicenseSerialNumber');
        // Else, for evaluation, set only evaluation serial number
        _LexiconKit.SetLicense('', 'EvaluationSerialNumber');

        // Create synthesizer
        _Synthesizer := _LexiconKit.CreateSAPI5Synthesizer();

        if (_Synthesizer <> nil) then
        begin
          phonemes := _Synthesizer.GeneratePhonemes('tomato', 'Noun', 'en-US', 'sapi');
          _Synthesizer.SpeakPhonemes(phonemes, 'en-US', 'sapi');
          _Synthesizer.Destroy();
        end;
        _LexiconKit.Destroy();
    end;
end;
    

JLexiconKit _LexiconKit = new JLexiconKit();
if (_LexiconKit != null)
{
	// Set license properties
	//_LexiconKit.setLicense("LicenseRegistrationNumber", "LicenseSerialNumber");
	// Else, for evaluation, set only evaluation serial number
	_LexiconKit.setLicense("", "EvaluationSerialNumber");

	JSAPI5Synthesizer _Synthesizer = _LexiconKit.createSAPI5Synthesizer();
	if (_Synthesizer != null)
	{
        String phonemes = _Synthesizer.generatePhonemes("tomato", "Noun", "en-US", "sapi");
        phonemes = _Synthesizer.speakPhonemes(phonemes, "en-US", "sapi");
        _Synthesizer.dispose();
	}
    _LexiconKit.dispose();
}
    

Dim _LexiconKit As NLexiconKit
Dim WithEvents _Synthesizer As NSAPI5Synthesizer
Dim phonemes As String
' Instantiate LexiconKit
_LexiconKit = New NLexiconKit()
If (_LexiconKit IsNot Nothing) Then
    ' Set license properties
    '_LexiconKit.SetLicense("LicenseRegistrationNumber", "LicenseSerialNumber")
    ' Else, for evaluation, set only evaluation serial number
    _LexiconKit.SetLicense(String.Empty, "EvaluationSerialNumber")
    _Synthesizer = _LexiconKit.CreateSAPI5Synthesizer()
    If (_Synthesizer IsNot Nothing) Then
        phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi")
        _Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi")
        _Synthesizer.Dispose()
    End If
    _LexiconKit.Dispose()
End If

To provide alternate pronunciations, you may edit lexicon word pronunciations (i.e., change the phonemes) and add them to the lexicon.

Speech engines (i.e., recognizers and synthesizers) support unique lexicon alphabets, formats, and approaches for runtime inclusion.

Speech APIAlphabetsFile Format
Acapela TTSipa, acatts.dic
Cepstral Swift APIswift.txt
Microsoft (SAPI5, Speech Platform, WindowsMedia)ipa, sapi, upsW3C .pls

All synthesizers support speaking phonemes. Acapela TTS synthesizers and Microsoft WindowsMedia recognizers and synthesizers currently do not support generating phonemes.

Use Chant Developer Workbench to edit Cepstral .txt and W3C .pls lexicons.

Once you have saved the lexicon containing lexicon word pronunciations, you can deploy it with your application and install the pronunciations on a target system for your application to load at runtime.

Acapela and Cepstral lexicons are loaded at run time by SpeechKit. W3C .pls lexicons are included as part of text-to-speech (TTS) markup. See VoiceMarkupKit for details.