Last reviewed: 3/23/2024 9:17:54 AM

Generating, Editing, Speaking, and Persisting Pronunciations

The LexiconKit management class is designed to provide a lot of flexibility and minimize the programming necessary to manage lexicon word pronunciations.

Pronunciations and Phonemes

Lexicon word pronunciations are comprised of phonemes or basic units of sounds. Phonemes collectively are represented in an alphabet format. Speech engine vendors have unique phoneme and pronunciation formats (i.e., alphabets) and may support International Phonetic Alphabet (IPA).

For example, the following table illustrates the differences in default pronunciations for the word tomato across speech engines.

IPACepstral SwiftMicrosoft SAPI 5Microsoft Universal Phone Set (UPS)
təme͡itot ah0 m ey1 t ow0t ax m ey t owT AX M EI T O

LexiconKit handles the complexities of dealing with these differences for applications.

Generating and Speaking Pronuncations

To generate a lexicon word pronunciation, simply pass the word, the word type (i.e., part of speech), language, and alphabet to LexiconKit. To speak a lexicon word pronunciation, simply pass the phonemes, language, and alphabet to LexiconKit.


// Instantiate LexiconKit
NLexiconKit _LexiconKit = new NLexiconKit();
if (_LexiconKit != null)
{
    // Set credentials
    _LexiconKit.SeCredentials("Credentials");
    NSAPI5Synthesizer _Synthesizer = _LexiconKit.CreateSAPI5Synthesizer();
    if (_Synthesizer != null)
    {
        string phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
        _Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi");
        _Synthesizer.Dispose();
    }
    _LexiconKit.Dispose();
}
    

// Instantiate LexiconKit object
CLexiconKit* _LexiconKit = new CLexiconKit();
if (_LexiconKit != NULL)
{
	// Set credentials
	_LexiconKit->SetCredentials(L"Credentials");
	// Create synthesizer
	CSAPI5Synthesizer* _Synthesizer = _LexiconKit->CreateSAPI5Synthesizer();
	if (_Synthesizer != NULL)
	{
        wchar_t* phonemes = _Synthesizer->GeneratePhonemes(L"tomato", L"Noun", L"en-US", L"sapi");
        _Synthesizer->SpeakPhonemes(phonemes, L"en-US", L"sapi");
        delete _Synthesizer;
    }
    delete _LexiconKit;
}
    

// Instantiate LexiconKit  object
CLexiconKit* _LexiconKit = new CLexiconKit();
if (_LexiconKit != NULL)
{
    // Set credentials
    _LexiconKit->SetCredentials("Credentials");
    // Create synthesizer
    CSAPI5Synthesizer* _Synthesizer = _LexiconKit->CreateSAPI5Synthesizer();
    if (_Synthesizer != NULL)
    {
        String phonemes = _Synthesizer->GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
        _Synthesizer->SpeakPhonemes(phonemes, "en-US", "sapi");
        delete _Synthesizer;
    }
    delete _LexiconKit;
}
    

var 
    _LexiconKit: TLexiconKit;
    _Synthesizer: TSAPI5Synthesizer;
    phonemes: string;
begin
    // Instantiate LexiconKit object
    _LexiconKit := TLexiconKit.Create();
    if (_LexiconKit <> nil) then
    begin
        // Set credentials
        _LexiconKit.SetCredentials('Credentials');
        // Create synthesizer
        _Synthesizer := _LexiconKit.CreateSAPI5Synthesizer();
        if (_Synthesizer <> nil) then
        begin
          phonemes := _Synthesizer.GeneratePhonemes('tomato', 'Noun', 'en-US', 'sapi');
          _Synthesizer.SpeakPhonemes(phonemes, 'en-US', 'sapi');
          _Synthesizer.Destroy();
        end;
        _LexiconKit.Destroy();
    end;
end;
    

JLexiconKit _LexiconKit = new JLexiconKit();
if (_LexiconKit != null)
{
	// Set credentials
	_LexiconKit.setCredentials("Credentials");
	JSAPI5Synthesizer _Synthesizer = _LexiconKit.createSAPI5Synthesizer();
	if (_Synthesizer != null)
	{
        String phonemes = _Synthesizer.generatePhonemes("tomato", "Noun", "en-US", "sapi");
        phonemes = _Synthesizer.speakPhonemes(phonemes, "en-US", "sapi");
        _Synthesizer.dispose();
	}
    _LexiconKit.dispose();
}
    

Dim _LexiconKit As NLexiconKit
Dim WithEvents _Synthesizer As NSAPI5Synthesizer
Dim phonemes As String
' Instantiate LexiconKit
_LexiconKit = New NLexiconKit()
If (_LexiconKit IsNot Nothing) Then
    ' Set credentials
    _LexiconKit.SetCredentials("Credentials")
    _Synthesizer = _LexiconKit.CreateSAPI5Synthesizer()
    If (_Synthesizer IsNot Nothing) Then
        phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi")
        _Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi")
        _Synthesizer.Dispose()
    End If
    _LexiconKit.Dispose()
End If
    

All synthesizers support speaking phonemes. Acapela TTS synthesizers and Microsoft WindowsMedia recognizers and synthesizers currently do not support generating phonemes.

Editing Pronunciations

To provide alternate pronunciations, edit lexicon word pronunciations (i.e., change the phonemes) and add them to the lexicon.

Speech engines (i.e., recognizers and synthesizers) support unique lexicon alphabets, formats, and approaches for runtime inclusion.

Speech APIAlphabetsFile Format
Acapela TTSipa, acatts.dic
class="px-2" Swift APIswift.txt
Microsoft (SAPI5, Speech Platform, WindowsMedia)ipa, sapi, upsW3C .pls
Microsoft Azure Speechipa, sapi, ups, x-sampa(W3C) .pls

Alphabets

A speech engine may support one or more alphabets. This varies by speech language. To determine which phoneme alphabets are support by an engine, enumerate the PhonemesAlphabets property. Use the applicable alphabet for generating, editing, and speaking alphabets.


foreach (string alphabet in _Synthesizer.PhonemeAlphabets)
{
    ...
}

for (int i = 0; i < _Synthesizer->GetPhonemeAlphabets()->GetCount(); i++)
{
    wchar_t* alphabet = _Synthesizer->GetPhonemeAlphabets()->at(i).c_str();
    ...
}

for (int i = 0; i < _Synthesizer->GetChantEngines()->Count; i++)
{
    String alphabet =  _Synthesizer->GetPhonemeAlphabets()[i];
    ...
}

var
alphabet: string;
begin
    for i := 0 to _Synthesizer.PhonemeAlphabets.Count-1 do
    begin
        // Access engine properties
        alphabet := _Synthesizer.PhonemeAlphabets[i]
        ...
    end;
end;

for (String alphabet : _Synthesizer.getPhonemeAlphabets())
{
    ...
}

For Each alphabet In _Synthesizer.PhonemeAlphabets
    ...
Next

In cases where a speech engine support multiple spoken languages (Azure), set the alphabetlanguage property to the language value before enumerating the PhonemeAlphabets property, generating, editing, and speaking phonemes.


// Set the synthesizer language
_Synthesizer.SetProperty("alphabetlanguage","de-DE");
    

// Set the synthesizer language
_Synthesizer->SetProperty(L"alphabetlanguage","50");
    

// Set the synthesizer language
_Synthesizer->SetProperty(L"alphabetlanguage","50");
    

// Set the synthesizer language
_Synthesizer.SetProperty('alphabetlanguage','50');
    

// Set the synthesizer language
_Synthesizer.setProperty(alphabetlanguage","50");
    

' Set the synthesizer language
_Synthesizer.SetProperty("alphabetlanguage","50")
    

Acapela TTS Properties

(Source: Acapela Group Acapela TTS Developer Guide)

PropertyValue
enginepathRuntime library path.
licenseLicense file path.
presetName of the equalizer preset
lexiconA list of user lexicons. Each file name must be separated with a semicolon.
pitchThe baseline pitch expressed in Hz. Man (110) Woman (180). Value ranges between 30 and 500.
speedThe reading speed. Values are in percent of the default speed rate 100. Value ranges between 30 and 300.
maxpitchThe maximum pitch allowed expressed in Hz. Value ranges between pitch and pitch * 2.5.
minpitchThe minimum pitch allowed expressed in Hz. Value ranges between pitch / 5 and pitch.
volumeA ratio percentage of the TTS output volume. Default value is 100. Value ranges 0 to 150.
leadingsilenceThe pause duration at the beginning of speaking in milliseconds. Default value is 50. Value ranges 20 to 5000.
trailingsilenceThe pause duration at the end of speaking in milliseconds. Default value is 500. Value ranges 20 to 5000.
deviceidThe audio device identifier.
readingmodeThe way text is spoken: normal, word at a time, or letter at a time (spelling).
pausepunctThe pause duration for period, exclamation point, and question mark. Value ranges from 0 to 5.
pausesemicolonThe pause duration for semicolon. Value ranges from 0 to 5.
pausecommaThe pause duration for comma and colon. Value ranges from 0 to 5.
pausebracketThe pause duration for quote, braces, and brackets. Value ranges from 0 to 5.
pausespellThe pause duration between letters in spell reading mode. Value ranges from 0 to 5.
usefilterEnable or disable equalizer use: 0 (no) or 1 (yes).
filtervalue1A value corresponding to the attenuation for band 1 filter. Value ranges 0 to 200.
filtervalue2A value corresponding to the attenuation for band 2 filter. Value ranges 0 to 200.
filtervalue3A value corresponding to the attenuation for band 3 filter. Value ranges 0 to 200.
filtervalue4A value corresponding to the attenuation for band 4 filter. Value ranges 0 to 200.
vocaltractThe voice shaping (tone) expressed in percentage of the default value 100. Value ranges 50 to 150.
audioboostpreemphControls the emphasis of medium and high frequencies. The default value is 0. Value ranges 0 to 90.

Cepstral Swift Properties

(Source: Cepstral Swift SDK documentation)

PropertyValue
configfileConfiguration file to use instead of the default.
audiochannelsNumber of audio channels [ 1 (mono) or 2 (stereo) ].
audiodeadairMilliseconds of dead air (silence) to pad at the end of speech.
audiopanLeft-to-Right panning [ -1 = left, 0 = center, 1 = right ]. This must be used with audio/channels=2.
audiovolumeVolume multiplication factor as a percentage. Default value is 100.
lexiconA lexicon.txt file in the voice directory. Review Ceptral Lexicon Editing with [LexiconKit](xref:lexiconkit "LexiconKit")
speechrateSpeaking rate (average WPM).
voicedirDirectory for the voice.
sfxSpecial effects output chain file.

CereProc CereVoice Properties

(Sourcce: CereProc CereVoicec SDK User Guide)

PropertyValue
enginepathRuntime library path.
voicepathVoice files path.
configfileConfiguration file path.
licenseLicense file path.
rootcertificateRoot certificate file path.
clientcrtClient CRT file path.
clientkeyClient key file paht.

Microsoft Azure Speech Properties

(Source: learn.microsoft.com)

PropertyValue
alphabetlanguageThe voice language to use for synthesizers that support multiple languages from which to obtain supported phoneme alphabets.
devicenameThe multimedia device ID that is used by the audio object.
languageThe voice language to use for synthesizers that support multiple languages.
languagesOne or more languages for which to enumerate available voices.
speechkeyThe Azure Speech Services key.
speechregionThe The Azure Speech Services region.

Microsoft SAPI 5 Properties

(Source: Microsoft SAPI5 Help File)

PropertyValue
deviceidThe multimedia device ID that is used by the audio object.
rateThe current text rendering rate adjustment. Value specifying the speaking rate of the voice. Supported values range from -10 to 10 - values outside this range may be truncated.
volumeThe synthesizer output volume level of the voice in real time. Volume levels are specified in percentage values ranging from zero to 100. The default base volume for all voices is 100.

Perisisting Pronunciations

Once the lexicon containing lexicon word pronunciations is saved, it can be deployed it with applications and installed on a target system.

Acapela and Cepstral lexicons are loaded at run time by the SpeechKit class. See the speech synthesis Lexicons topic for details. W3C .pls lexicons are included as part of text-to-speech (TTS) markup. See VoiceMarkupKit for details.