How do I fine-tune speech recognition and synthesis with phonemes?
Last reviewed: 10/15/2021
HOW Article ID: H072121
The information in this article applies to:
- LexiconKit 7
Summary
Manage word pronunciations comprised of phonemes with LexiconKit for enhancing speech recognition accuracy and speech synthesis clarity.
More Information
A lexicon is a collection of word pronunciations that a speech recognition engine (i.e., recognizer) uses to improve recognition accuracy and a speech synthesis engine (i.e., synthesizer) uses to enhance the quality of its pronunciation.
Lexicons play an important role in the accuracy of speech recognition. A speech recognition engine (i.e., recognizer) uses lexicons in the process of recognizing speech. Lexicons consist of the words that a recognizer understands and returns as recognized speech. Since it's impractical for a recognizer to maintain every possible word and context in its spoken language, you enhance the accuracy of speech recognition by extending its lexicon.
Lexicons play an important role in the quality of text-to-speech playback. A text-to-speech engine (i.e., synthesizer) uses lexicons to obtain pronunciation information associated with words to generate the appropriate speech sounds for the word. For example, with a lexicon you may ensure "record" is pronounced correctly when used as a noun and when used as a verb.
Generating, Editing, Speaking, and Persisting Pronunciations
The LexiconKit management class is designed to provide you a lot of flexibility and minimize the programming necessary to manage lexicon word pronunciations. Lexicon word pronunciations are comprised of phonemes or basic units of sounds. Phonemes collectively are represented in an alphabet format. Speech engine vendors have unique phoneme and pronunciation formats (i.e., alphabets) and may support International Phonetic Alphabet (IPA).
For example, the following table illustrates the differences in default pronunciations for the word tomato across speech engines.
IPA | Cepstral Swift | Microsoft SAPI 5 | Microsoft Universal Phone Set (UPS) |
---|---|---|---|
təme͡ito | t ah0 m ey1 t ow0 | t ax m ey t ow | T AX M EI T O |
LexiconKit handles the complexities of dealing with these differences for your application.
Pronunciations may be declared in markup imbedded with text.
<!-- This is an example of IPA alphabet using character entities in W3C SSML -->
<phoneme alphabet="ipa" ph="təme͡ito">tomato</phoneme>
<!-- This is an example of sapi alphabet using character entities in W3C SSML -->
<phoneme alphabet="x-microsoft-sapi" ph="t ax m ey t ow">tomato</phoneme>
<!-- This is an example of UPS alphabet using character entities in W3C SSML -->
<phoneme alphabet="x-microsoft-ups" ph="T AX M EI T O">tomato</phoneme>
<!-- This is an example of sapi alphabet using character entities in SAPI SSML -->
<PRON SYM="t ax m ey t ow">W</PRON>
Create, edit, and test pronunciations in the Developer Workbench with SAPI and W3C SSML.
Pronunciations may be declared collectively in a lexicon document.
<?xml version="1.0" ?>
<lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="x-microsoft-sapi"
xml:lang="en-US"
version="1.0">
<!-- This is an example of phonemes in W3C .pls lexicon -->
<lexeme>
<grapheme>tomato </grapheme>
<phoneme>t ax m ey t ow </phoneme>
</lexeme>
</lexicon>
# Cepstral Lexicon
#
# Type word, space, and part of speech (currently not implemented, so use '0') to generate default pronunciation
# cepstral 0 k eh1 p s t r ah0 l
<!-- This is an example of phonemes in a Cepstral lexicon -->
tomato 0 t ah0 m ey1 t ow0
Create, edit, and test lexicon pronunciations in the Developer Workbench.
Applications can manage phonemes with LexiconKit. To generate a lexicon word pronunciation, simply pass the word, the word type (i.e., part of speech), language, and alphabet to LexiconKit. To speak a lexicon word pronunciation, simply pass the phonemes, language, and alphabet to LexiconKit.
// Instantiate LexiconKit
NLexiconKit _LexiconKit = new NLexiconKit();
if (_LexiconKit != null)
{
// Set license properties
//_LexiconKit.SetLicense("LicenseRegistrationNumber", "LicenseSerialNumber");
// Else, for evaluation, set only evaluation serial number
_LexiconKit.SetLicense(string.Empty, "EvaluationSerialNumber");
NSAPI5Synthesizer _Synthesizer = _LexiconKit.CreateSAPI5Synthesizer();
if (_Synthesizer != null)
{
string phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
_Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi");
_Synthesizer.Dispose();
}
_LexiconKit.Dispose();
}
// Instantiate LexiconKit object
CLexiconKit* _LexiconKit = new CLexiconKit();
if (_LexiconKit != NULL)
{
// Set license properties
//_LexiconKit->SetLicense(L"LicenseRegistrationNumber", L"LicenseSerialNumber");
// Else, for evaluation, set only evaluation serial number
_LexiconKit->SetLicense(L"", L"EvaluationSerialNumber");
// Create synthesizer
CSAPI5Synthesizer* _Synthesizer = _LexiconKit->CreateSAPI5Synthesizer();
if (_Synthesizer != NULL)
{
wchar_t* phonemes = _Synthesizer->GeneratePhonemes(L"tomato", L"Noun", L"en-US", L"sapi");
_Synthesizer->SpeakPhonemes(phonemes, L"en-US", L"sapi");
delete _Synthesizer;
}
delete _LexiconKit;
}
// Instantiate LexiconKit object
CLexiconKit* _LexiconKit = new CLexiconKit();
if (_LexiconKit != NULL)
{
// Set license properties
//_LexiconKit->SetLicense("LicenseRegistrationNumber", "LicenseSerialNumber");
// Else, for, set only evaluation serial number
_LexiconKit->SetLicense("", "EvaluationSerialNumber");
// Create synthesizer
CSAPI5Synthesizer* _Synthesizer = _LexiconKit->CreateSAPI5Synthesizer();
if (_Synthesizer != NULL)
{
String phonemes = _Synthesizer->GeneratePhonemes("tomato", "Noun", "en-US", "sapi");
_Synthesizer->SpeakPhonemes(phonemes, "en-US", "sapi");
delete _Synthesizer;
}
delete _LexiconKit;
}
var
_LexiconKit: TLexiconKit;
_Synthesizer: TSAPI5Synthesizer;
phonemes: string;
begin
// Instantiate LexiconKit object
_LexiconKit := TLexiconKit.Create();
if (_LexiconKit <> nil) then
begin
// Set license properties
//_LexiconKit.SetLicense('LicenseRegistrationNumber', 'LicenseSerialNumber');
// Else, for evaluation, set only evaluation serial number
_LexiconKit.SetLicense('', 'EvaluationSerialNumber');
// Create synthesizer
_Synthesizer := _LexiconKit.CreateSAPI5Synthesizer();
if (_Synthesizer <> nil) then
begin
phonemes := _Synthesizer.GeneratePhonemes('tomato', 'Noun', 'en-US', 'sapi');
_Synthesizer.SpeakPhonemes(phonemes, 'en-US', 'sapi');
_Synthesizer.Destroy();
end;
_LexiconKit.Destroy();
end;
end;
JLexiconKit _LexiconKit = new JLexiconKit();
if (_LexiconKit != null)
{
// Set license properties
//_LexiconKit.setLicense("LicenseRegistrationNumber", "LicenseSerialNumber");
// Else, for evaluation, set only evaluation serial number
_LexiconKit.setLicense("", "EvaluationSerialNumber");
JSAPI5Synthesizer _Synthesizer = _LexiconKit.createSAPI5Synthesizer();
if (_Synthesizer != null)
{
String phonemes = _Synthesizer.generatePhonemes("tomato", "Noun", "en-US", "sapi");
phonemes = _Synthesizer.speakPhonemes(phonemes, "en-US", "sapi");
_Synthesizer.dispose();
}
_LexiconKit.dispose();
}
Dim _LexiconKit As NLexiconKit
Dim WithEvents _Synthesizer As NSAPI5Synthesizer
Dim phonemes As String
' Instantiate LexiconKit
_LexiconKit = New NLexiconKit()
If (_LexiconKit IsNot Nothing) Then
' Set license properties
'_LexiconKit.SetLicense("LicenseRegistrationNumber", "LicenseSerialNumber")
' Else, for evaluation, set only evaluation serial number
_LexiconKit.SetLicense(String.Empty, "EvaluationSerialNumber")
_Synthesizer = _LexiconKit.CreateSAPI5Synthesizer()
If (_Synthesizer IsNot Nothing) Then
phonemes = _Synthesizer.GeneratePhonemes("tomato", "Noun", "en-US", "sapi")
_Synthesizer.SpeakPhonemes(phonemes, "en-US", "sapi")
_Synthesizer.Dispose()
End If
_LexiconKit.Dispose()
End If
To provide alternate pronunciations, you may edit lexicon word pronunciations (i.e., change the phonemes) and add them to the lexicon.
Speech engines (i.e., recognizers and synthesizers) support unique lexicon alphabets, formats, and approaches for runtime inclusion.
Speech API | Alphabets | File Format |
---|---|---|
Acapela TTS | ipa, acatts | .dic |
Cepstral Swift API | swift | .txt |
Microsoft (SAPI5, Speech Platform, WindowsMedia) | ipa, sapi, ups | W3C .pls |
All synthesizers support speaking phonemes. Acapela TTS synthesizers and Microsoft WindowsMedia recognizers and synthesizers currently do not support generating phonemes.
Use Chant Developer Workbench to edit Cepstral .txt and W3C .pls lexicons.
Once you have saved the lexicon containing lexicon word pronunciations, you can deploy it with your application and install the pronunciations on a target system for your application to load at runtime.
Acapela and Cepstral lexicons are loaded at run time by SpeechKit. W3C .pls lexicons are included as part of text-to-speech (TTS) markup. See VoiceMarkupKit for details.