Last reviewed: 3/23/2024 10:15:25 AM
Synthesizing Speech
A speech synthesizer converts text to speech and produces audio bytes for playback or persistence. In addition, events are generated to indicate processing states.
The Microsoft Speech API (SAPI5) and WindowsMedia runtimes are part of Windows that provides application control of the playback and processing of the audio bytes and events of a synthesizer. It optionally provides audio streaming playback and time-sequenced event posting. Microsoft includes synthesizers in many Windows SKUs.
Synthesizers from other speech technology vendors typically render the audio bytes and events but rely on applications and/or SAPI5 to handle playback and event processing. Some support the Microsoft runtime and all provide their own proprietary speech API with SDK and runtimes. See the section Recognizer and Synthesizer Installation for more information about speech technologies.
SpeechKit provides common speech synthesis management for multiple application scenarios across the various speech technology APIs by managing speech synthesis directly with the synthesizer.
SpeechKit includes libraries for the following Speech APIs for speech synthesis:
Speech API | Platforms |
Acapela TTS | x64, x86 |
Apple AVFoundation | ARM, x64, x86 |
Cepstral Swift | x64, x86 |
CereProc CereVoice | x64, x86 |
Google android.speech.tts | ARM |
Microsoft Azure Speech | ARM, x64, x86 |
Microsoft SAPI 5 | x64, x86 |
Microsoft Speech Platform | x64, x86 |
Microsoft .NET System.Speech | x64, x86 |
Microsoft .NET Microsoft.Speech | x64, x86 |
Microsoft WindowsMedia (UWP) | ARM, x64, x86 |
Microsoft WindowsMedia (WinRT) | x64, x86 |
Libraries for the most popular synthesizer speech APIs are included in Chant Developer Workbench. For additional libraries that support different APIs, runtimes, versions, and vendors contact Chant Support.
SpeechKit supports speech synthesis with playback or persistence with a single request.
// Synthesize speech for playback
_Synthesizer.speak("Hello world.");
// Synthesize speech for playback
_Synthesizer.Speak("Hello world.");
// Synthesize speech for playback
_Synthesizer->Speak(L"Hello world.");
// Synthesize speech for playback
_Synthesizer->Speak("Hello world.");
// Synthesize speech for playback
_Synthesizer.Speak('Hello world.');
// Synthesize speech for playback
_Synthesizer.speak("Hello world.");
// Synthesize speech for playback
[_synthesizer speak:@"Hello world."];
// Synthesize speech for playback
_Synthesizer!.speak(text: "Hello world.")
' Synthesize speech for playback
_Synthesizer.Speak("Hello world.")
To know the progress or state of speech synthesis, the application processes event callbacks.
public class MainActivity extends AppCompatActivity implements com.speechkit.JChantSpeechKitEvents
// Set the callback object
// Register for callbacks
public void initComplete(Object o, TTSEventArgs ttsEventArgs)
if (_Synthesizer.getChantEngines() != null)
for (JChantEngine engine : _Synthesizer.getChantEngines())
// Add name to list
// Register Event Handler
_Synthesizer.WordPosition += Synthesizer_WordPosition;
private void Synthesizer_WordPosition(object sender, WordPositionEventArgs e)
if (e != null)
int startPosition = e.Position;
int wordLength = e.Length;
// Register Event Handler
void CALLBACK WordPosition(void* Sender, CWordPositionEventArgs* Args)
if (Args != NULL)
int startPosition = Args->GetPosition();
int wordLength = Args->GetLength();
// Register Event Handler
void WordPosition(void* Sender, CWordPositionEventArgs* Args)
if (Args != NULL)
int startPosition = Args->GetPosition();
int wordLength = Args->GetLength();
// Register event handler
_Synthesizer.WordPosition := WordPosition;
procedure TForm1.WordPosition(Sender: TObject; Args: TWordPositionEventArgs);
startPosition: Integer;
wordLength: Integer;
If (Args <> nil) then
startPosition := args.Position;
wordLength := args.Length;
public class Frame1 extends JFrame implements com.speechkit.JChantSpeechKitEvents
// Set the callback
// Register Callbacks for visual cues.
public void wordPosition(Object sender, WordPositionEventArgs args)
if (args != null)
int startPosition = args.getPosition();
int wordLength = args.getLength();
// Set the callback
[_synthesizer setDelegate:(id<SPChantSynthesizerDelegate>)self];
-(void) rangeStart:(NSObject*)sender args:(SPRangeStartEventArgs*)args
[[self textView1] setSelectedRange:NSMakeRange([args location], [args length])];
// Set the callback
_Synthesizer!.delegate = self
func rangeStart(sender: SPChantSynthesizer, args: SPRangeStartEventArgs)
self.textView1.selectedRange = (NSRange(location: args.location, length: args.length))
Dim WithEvents _Synthesizer As NSAPI5Synthesizer = Nothing
Private Sub Synthesizer_WordPosition(ByVal sender As System.Object, ByVal e As WordPositionEventArgs) Handles _Synthesizer.WordPosition
Dim startPosition As Integer
Dim wordLength As Integer
If (e IsNot Nothing) Then
startPosition = e.Position
wordLength = e.Length
End If
End Sub
To control basic properties of how the synthesis occurs, some synthesizers support property settings. Review the vendor Speech API version-specific documentation for supported properties.
// No properties
// Set the speaking volume
// Set the speaking volume
// Set the speaking volume
// Set the speaking volume
// Set the speaking volume
// No properties
// No properties
' Set the speaking volume
Acapela TTS Properties
(Source: Acapela Group Acapela TTS Developer Guide)
Property | Value |
enginepath | Runtime library path. |
license | License file path. |
preset | Name of the equalizer preset |
lexicon | A list of user lexicons. Each file name must be separated with a semicolon. |
pitch | The baseline pitch expressed in Hz. Man (110) Woman (180). Value ranges between 30 and 500. |
speed | The reading speed. Values are in percent of the default speed rate 100. Value ranges between 30 and 300. |
maxpitch | The maximum pitch allowed expressed in Hz. Value ranges between pitch and pitch * 2.5. |
minpitch | The minimum pitch allowed expressed in Hz. Value ranges between pitch / 5 and pitch. |
volume | A ratio percentage of the TTS output volume. Default value is 100. Value ranges 0 to 150. |
leadingsilence | The pause duration at the beginning of speaking in milliseconds. Default value is 50. Value ranges 20 to 5000. |
trailingsilence | The pause duration at the end of speaking in milliseconds. Default value is 500. Value ranges 20 to 5000. |
deviceid | The audio device identifier. |
readingmode | The way text is spoken: normal, word at a time, or letter at a time (spelling). |
pausepunct | The pause duration for period, exclamation point, and question mark. Value ranges from 0 to 5. |
pausesemicolon | The pause duration for semicolon. Value ranges from 0 to 5. |
pausecomma | The pause duration for comma and colon. Value ranges from 0 to 5. |
pausebracket | The pause duration for quote, braces, and brackets. Value ranges from 0 to 5. |
pausespell | The pause duration between letters in spell reading mode. Value ranges from 0 to 5. |
usefilter | Enable or disable equalizer use: 0 (no) or 1 (yes). |
filtervalue1 | A value corresponding to the attenuation for band 1 filter. Value ranges 0 to 200. |
filtervalue2 | A value corresponding to the attenuation for band 2 filter. Value ranges 0 to 200. |
filtervalue3 | A value corresponding to the attenuation for band 3 filter. Value ranges 0 to 200. |
filtervalue4 | A value corresponding to the attenuation for band 4 filter. Value ranges 0 to 200. |
vocaltract | The voice shaping (tone) expressed in percentage of the default value 100. Value ranges 50 to 150. |
audioboostpreemph | Controls the emphasis of medium and high frequencies. The default value is 0. Value ranges 0 to 90. |
Cepstral Swift Properties
(Source: Cepstral Swift SDK documentation)
Property | Value |
configfile | Configuration file to use instead of the default. |
audiochannels | Number of audio channels [ 1 (mono) or 2 (stereo) ]. |
audiodeadair | Milliseconds of dead air (silence) to pad at the end of speech. |
audiopan | Left-to-Right panning [ -1 = left, 0 = center, 1 = right ]. This must be used with audio/channels=2. |
audiovolume | Volume multiplication factor as a percentage. Default value is 100. |
lexicon | A lexicon.txt file in the voice directory. Review Ceptral Lexicon Editing with [LexiconKit](xref:lexiconkit "LexiconKit") |
speechrate | Speaking rate (average WPM). |
voicedir | Directory for the voice. |
sfx | Special effects output chain file. |
CereProc CereVoice Properties
(Sourcce: CereProc CereVoicec SDK User Guide)
Property | Value |
enginepath | Runtime library path. |
voicepath | Voice files path. |
configfile | Configuration file path. |
license | License file path. |
rootcertificate | Root certificate file path. |
clientcrt | Client CRT file path. |
clientkey | Client key file paht. |
Microsoft Azure Speech Properties
Property | Value |
devicename | The multimedia device ID that is used by the audio object. |
language | The voice language to use for synthesizers that support multiple languages. |
languages | One or more languages for which to enumerate available voices. |
speechkey | The Azure Speech Services key. |
speechregion | The The Azure Speech Services region. |
Microsoft SAPI 5 Properties
(Source: Microsoft SAPI5 Help File)
Property | Value |
deviceid | The multimedia device ID that is used by the audio object. |
rate | The current text rendering rate adjustment. Value specifying the speaking rate of the voice. Supported values range from -10 to 10 - values outside this range may be truncated. |
volume | The synthesizer output volume level of the voice in real time. Volume levels are specified in percentage values ranging from zero to 100. The default base volume for all voices is 100. |
Speech SDK Helpers
The Speak method supports two optional Speech API properties--options and output format--that are defined by enumerated constants in the SDKs.
SpeechKit includes helper constants for these properties that map to the Speech API values.
Acapela TTS
The options value may be specified with one or more ACATTSSPEAKFLAGS constants:
The output value may be specified with one or more ACATTSWAVFORMATFLAGS constants:
Cepstral Swift
The options value may be specified with one or more SWIFTSPEAKFLAGS constants:
The ouput format value may be one of the voice supported SPSTREAMFORMAT constants:
- SPSF_Default
- SPSF_8kHz8BitMono
- SPSF_8kHz8BitStereo
- SPSF_8kHz16BitMono
- SPSF_8kHz16BitStereo
- SPSF_11kHz8BitMono
- SPSF_11kHz8BitStereo
- SPSF_11kHz16BitMono
- SPSF_11kHz16BitStereo
- SPSF_12kHz8BitMono
- SPSF_12kHz8BitStereo
- SPSF_12kHz16BitMono
- SPSF_12kHz16BitStereo
- SPSF_16kHz8BitMono
- SPSF_16kHz8BitStereo
- SPSF_16kHz16BitMono
- SPSF_16kHz16BitStereo
- SPSF_22kHz8BitMono
- SPSF_22kHz8BitStereo
- SPSF_22kHz16BitMono
- SPSF_22kHz16BitStereo
- SPSF_24kHz8BitMono
- SPSF_24kHz8BitStereo
- SPSF_24kHz16BitMono
- SPSF_24kHz16BitStereo
- SPSF_32kHz8BitMono
- SPSF_32kHz8BitStereo
- SPSF_32kHz16BitMono
- SPSF_32kHz16BitStereo
- SPSF_44kHz8BitMono
- SPSF_44kHz8BitStereo
- SPSF_44kHz16BitMono
- SPSF_44kHz16BitStereo
- SPSF_48kHz8BitMono
- SPSF_48kHz8BitStereo
- SPSF_48kHz16BitMono
- SPSF_48kHz16BitStereo
- SPSF_CCITT_ALaw_8kHzMono
- SPSF_CCITT_ALaw_8kHzStereo
- SPSF_CCITT_ALaw_11kHzMono
- SPSF_CCITT_ALaw_11kHzStereo
- SPSF_CCITT_ALaw_22kHzMono
- SPSF_CCITT_ALaw_22kHzStereo
- SPSF_CCITT_ALaw_44kHzMono
- SPSF_CCITT_ALaw_44kHzStereo
- SPSF_CCITT_uLaw_8kHzMono
- SPSF_CCITT_uLaw_8kHzStereo
- SPSF_CCITT_uLaw_11kHzMono
- SPSF_CCITT_uLaw_11kHzStereo
- SPSF_CCITT_uLaw_22kHzMono
- SPSF_CCITT_uLaw_22kHzStereo
- SPSF_CCITT_uLaw_44kHzMono
- SPSF_CCITT_uLaw_44kHzStereo
Microsoft Azure Speech
The output value may be specified with one or more MCSSTREAMFORMAT constants:
- Raw8Khz8BitMonoMULaw
- Riff16Khz16KbpsMonoSiren
- Audio16Khz16KbpsMonoSiren
- Audio16Khz32KBitRateMonoMp3
- Audio16Khz128KBitRateMonoMp3
- Audio16Khz64KBitRateMonoMp3
- Audio24Khz48KBitRateMonoMp3
- Audio24Khz96KBitRateMonoMp3
- Audio24Khz160KBitRateMonoMp3
- Raw16Khz16BitMonoTrueSilk
- Riff16Khz16BitMonoPcm
- Riff8Khz16BitMonoPcm
- Riff24Khz16BitMonoPcm
- Riff8Khz8BitMonoMULaw
- Raw16Khz16BitMonoPcm
- Raw24Khz16BitMonoPcm
- Raw8Khz16BitMonoPcm
- Ogg16Khz16BitMonoOpus
- Ogg24Khz16BitMonoOpus
- Raw48Khz16BitMonoPcm
- Riff48Khz16BitMonoPcm
- Audio48Khz96KBitRateMonoMp3
- Audio48Khz192KBitRateMonoMp3
- Ogg48Khz16BitMonoOpus
- Webm16Khz16BitMonoOpus
- Webm24Khz16BitMonoOpus
- Raw24Khz16BitMonoTrueSilk
- Raw8Khz8BitMonoALaw
- Riff8Khz8BitMonoALaw
- Webm24Khz16Bit24KbpsMonoOpus
- Audio16Khz16Bit32KbpsMonoOpus
- Audio24Khz16Bit48KbpsMonoOpus
- Audio24Khz16Bit24KbpsMonoOpus
- Raw22050Hz16BitMonoPcm
- Riff22050Hz16BitMonoPcm
- Raw44100Hz16BitMonoPcm
- Riff44100Hz16BitMonoPcm
Microsoft SAPI 5
The options value may be specified with one or more SPEAKFLAGS constants:
The ouput format value may be one of the SPSTREAMFORMAT constants:
- SPSF_Default
- SPSF_NoAssignedFormat
- SPSF_Text
- SPSF_NonStandardFormat
- SPSF_ExtendedAudioFormat
- SPSF_8kHz8BitMono
- SPSF_8kHz8BitStereo
- SPSF_8kHz16BitMono
- SPSF_8kHz16BitStereo
- SPSF_11kHz8BitMono
- SPSF_11kHz8BitStereo
- SPSF_11kHz16BitMono
- SPSF_11kHz16BitStereo
- SPSF_12kHz8BitMono
- SPSF_12kHz8BitStereo
- SPSF_12kHz16BitMono
- SPSF_12kHz16BitStereo
- SPSF_16kHz8BitMono
- SPSF_16kHz8BitStereo
- SPSF_16kHz16BitMono
- SPSF_16kHz16BitStereo
- SPSF_22kHz8BitMono
- SPSF_22kHz8BitStereo
- SPSF_22kHz16BitMono
- SPSF_22kHz16BitStereo
- SPSF_24kHz8BitMono
- SPSF_24kHz8BitStereo
- SPSF_24kHz16BitMono
- SPSF_24kHz16BitStereo
- SPSF_32kHz8BitMono
- SPSF_32kHz8BitStereo
- SPSF_32kHz16BitMono
- SPSF_32kHz16BitStereo
- SPSF_44kHz8BitMono
- SPSF_44kHz8BitStereo
- SPSF_44kHz16BitMono
- SPSF_44kHz16BitStereo
- SPSF_48kHz8BitMono
- SPSF_48kHz8BitStereo
- SPSF_48kHz16BitMono
- SPSF_48kHz16BitStereo
- SPSF_TrueSpeech_8kHz1BitMono
- SPSF_CCITT_ALaw_8kHzMono
- SPSF_CCITT_ALaw_8kHzStereo
- SPSF_CCITT_ALaw_11kHzMono
- SPSF_CCITT_ALaw_11kHzStereo
- SPSF_CCITT_ALaw_22kHzMono
- SPSF_CCITT_ALaw_22kHzStereo
- SPSF_CCITT_ALaw_44kHzMono
- SPSF_CCITT_ALaw_44kHzStereo
- SPSF_CCITT_uLaw_8kHzMono
- SPSF_CCITT_uLaw_8kHzStereo
- SPSF_CCITT_uLaw_11kHzMono
- SPSF_CCITT_uLaw_11kHzStereo
- SPSF_CCITT_uLaw_22kHzMono
- SPSF_CCITT_uLaw_22kHzStereo
- SPSF_CCITT_uLaw_44kHzMono
- SPSF_CCITT_uLaw_44kHzStereo
- SPSF_ADPCM_8kHzStereo
- SPSF_ADPCM_11kHzMono
- SPSF_ADPCM_11kHzStereo
- SPSF_ADPCM_22kHzMono
- SPSF_ADPCM_22kHzStereo
- SPSF_ADPCM_44kHzMono
- SPSF_ADPCM_44kHzStereo
- SPSF_GSM610_8kHzMono
- SPSF_GSM610_11kHzMono
- SPSF_GSM610_22kHzMono
- SPSF_GSM610_44kHzMono
The ouput format value may be one of the voice supported SPSTREAMFORMAT constants:
- SPSF_Default
- SPSF_8kHz16BitMono (VE_FREQ_8KHZ)
- SPSF_8kHz16BitStereo (VE_FREQ_8KHZ)
- SPSF_11kHz16BitMono (VE_FREQ_11KHZ)
- SPSF_11kHz16BitStereo (VE_FREQ_11KHZ)
- SPSF_16kHz16BitMono (VE_FREQ_16KHZ)
- SPSF_16kHz16BitStereo (VE_FREQ_16KHZ)
- SPSF_22kHz16BitMono (VE_FREQ_22KHZ)
- SPSF_22kHz16BitStereo (VE_FREQ_22KHZ)