Last reviewed: 3/23/2024 10:15:25 AM

Synthesizing Speech

A speech synthesizer converts text to speech and produces audio bytes for playback or persistence. In addition, events are generated to indicate processing states.

The Microsoft Speech API (SAPI5) and WindowsMedia runtimes are part of Windows that provides application control of the playback and processing of the audio bytes and events of a synthesizer. It optionally provides audio streaming playback and time-sequenced event posting. Microsoft includes synthesizers in many Windows SKUs.

Synthesizers from other speech technology vendors typically render the audio bytes and events but rely on applications and/or SAPI5 to handle playback and event processing. Some support the Microsoft runtime and all provide their own proprietary speech API with SDK and runtimes. See the section Recognizer and Synthesizer Installation for more information about speech technologies.

SpeechKit provides common speech synthesis management for multiple application scenarios across the various speech technology APIs by managing speech synthesis directly with the synthesizer.

SpeechKit includes libraries for the following Speech APIs for speech synthesis:

Speech APIPlatforms
Acapela TTSx64, x86
Apple AVFoundationARM, x64, x86
Cepstral Swiftx64, x86
CereProc CereVoicex64, x86
Google android.speech.ttsARM
Microsoft Azure SpeechARM, x64, x86
Microsoft SAPI 5x64, x86
Microsoft Speech Platformx64, x86
Microsoft .NET System.Speechx64, x86
Microsoft .NET Microsoft.Speechx64, x86
Microsoft WindowsMedia (UWP)ARM, x64, x86
Microsoft WindowsMedia (WinRT)x64, x86

Libraries for the most popular synthesizer speech APIs are included in Chant Developer Workbench. For additional libraries that support different APIs, runtimes, versions, and vendors contact Chant Support.

SpeechKit supports speech synthesis with playback or persistence with a single request.


// Synthesize speech for playback
_Synthesizer.speak("Hello world.");

// Synthesize speech for playback
_Synthesizer.Speak("Hello world.");

// Synthesize speech for playback
_Synthesizer->Speak(L"Hello world.");
    

// Synthesize speech for playback
_Synthesizer->Speak("Hello world.");

// Synthesize speech for playback
_Synthesizer.Speak('Hello world.');
    

// Synthesize speech for playback
_Synthesizer.speak("Hello world.");

// Synthesize speech for playback
 [_synthesizer speak:@"Hello world."];

// Synthesize speech for playback
_Synthesizer!.speak(text: "Hello world.")

' Synthesize speech for playback
_Synthesizer.Speak("Hello world.")
    

To know the progress or state of speech synthesis, the application processes event callbacks.


public class MainActivity extends AppCompatActivity implements com.speechkit.JChantSpeechKitEvents 
{
        ...
        // Set the callback object
        _Synthesizer.setChantSpeechKitEvents(this);
        // Register for callbacks
        _Synthesizer.registerCallback(ChantSpeechKitCallback.CCTTSInitComplete);
        ...
    @Override
    public void initComplete(Object o, TTSEventArgs ttsEventArgs)
    {
        if (_Synthesizer.getChantEngines() != null)
        {
            for (JChantEngine engine : _Synthesizer.getChantEngines())
            {
                // Add name to list
                _Engines.add(engine.getName());
            }
        }
        ...
    }
}
    

// Register Event Handler
_Synthesizer.WordPosition += Synthesizer_WordPosition;
...
private void Synthesizer_WordPosition(object sender, WordPositionEventArgs e)
{
    if (e != null)
    {
        int startPosition = e.Position;
        int wordLength = e.Length;
        ...
    }
}
    

// Register Event Handler
_Synthesizer->SetWordPosition(WordPosition);
...
void CALLBACK WordPosition(void* Sender, CWordPositionEventArgs* Args)
{
    if (Args != NULL) 
    {
        int startPosition = Args->GetPosition();
        int wordLength = Args->GetLength();
        ...
    }
}

// Register Event Handler
_Synthesizer->SetWordPosition(WordPosition);
...
void WordPosition(void* Sender, CWordPositionEventArgs* Args)
{
    if (Args != NULL) 
    {
        int startPosition = Args->GetPosition();
        int wordLength = Args->GetLength();
        ...
    }
}
    

// Register event handler
_Synthesizer.WordPosition := WordPosition;
...
procedure TForm1.WordPosition(Sender: TObject; Args: TWordPositionEventArgs);
var
  startPosition: Integer;
  wordLength: Integer;
begin
    If (Args <> nil) then
    begin
      startPosition := args.Position;
      wordLength := args.Length;
      ...
    end;
end;
    

public class Frame1 extends JFrame implements com.speechkit.JChantSpeechKitEvents
...
// Set the callback
_Synthesizer.setChantSpeechKitEvents(this);
// Register Callbacks for visual cues.
_Synthesizer.registerCallback(ChantSpeechKitCallback.CCTTSWordPosition);
...
public void wordPosition(Object sender, WordPositionEventArgs args)
{
    if (args != null)
    {
        int startPosition = args.getPosition();
        int wordLength = args.getLength();
        ...
    }
}

// Set the callback
[_synthesizer setDelegate:(id<SPChantSynthesizerDelegate>)self];
...
-(void) rangeStart:(NSObject*)sender args:(SPRangeStartEventArgs*)args
{
    [[self textView1] setSelectedRange:NSMakeRange([args location], [args length])];
}

// Set the callback
_Synthesizer!.delegate = self
...
func rangeStart(sender: SPChantSynthesizer, args: SPRangeStartEventArgs)
{
    self.textView1.selectedRange = (NSRange(location: args.location, length: args.length))
}
    

Dim WithEvents _Synthesizer As NSAPI5Synthesizer = Nothing
...
Private Sub Synthesizer_WordPosition(ByVal sender As System.Object, ByVal e As WordPositionEventArgs) Handles _Synthesizer.WordPosition
    Dim startPosition As Integer
    Dim wordLength As Integer
    If (e IsNot Nothing) Then
        startPosition = e.Position
        wordLength = e.Length
        ...
    End If
End Sub

To control basic properties of how the synthesis occurs, some synthesizers support property settings. Review the vendor Speech API version-specific documentation for supported properties.


// No properties

// Set the speaking volume
_Synthesizer.SetProperty("volume","50");

// Set the speaking volume
_Synthesizer->SetProperty(L"volume","50");
    

// Set the speaking volume
_Synthesizer->SetProperty("volume","50");

// Set the speaking volume
_Synthesizer.SetProperty('volume','50');
    

// Set the speaking volume
_Synthesizer.setProperty("volume","50");

// No properties

// No properties

' Set the speaking volume
_Synthesizer.SetProperty("volume","50")
    

Acapela TTS Properties

(Source: Acapela Group Acapela TTS Developer Guide)

PropertyValue
enginepathRuntime library path.
licenseLicense file path.
presetName of the equalizer preset
lexiconA list of user lexicons. Each file name must be separated with a semicolon.
pitchThe baseline pitch expressed in Hz. Man (110) Woman (180). Value ranges between 30 and 500.
speedThe reading speed. Values are in percent of the default speed rate 100. Value ranges between 30 and 300.
maxpitchThe maximum pitch allowed expressed in Hz. Value ranges between pitch and pitch * 2.5.
minpitchThe minimum pitch allowed expressed in Hz. Value ranges between pitch / 5 and pitch.
volumeA ratio percentage of the TTS output volume. Default value is 100. Value ranges 0 to 150.
leadingsilenceThe pause duration at the beginning of speaking in milliseconds. Default value is 50. Value ranges 20 to 5000.
trailingsilenceThe pause duration at the end of speaking in milliseconds. Default value is 500. Value ranges 20 to 5000.
deviceidThe audio device identifier.
readingmodeThe way text is spoken: normal, word at a time, or letter at a time (spelling).
pausepunctThe pause duration for period, exclamation point, and question mark. Value ranges from 0 to 5.
pausesemicolonThe pause duration for semicolon. Value ranges from 0 to 5.
pausecommaThe pause duration for comma and colon. Value ranges from 0 to 5.
pausebracketThe pause duration for quote, braces, and brackets. Value ranges from 0 to 5.
pausespellThe pause duration between letters in spell reading mode. Value ranges from 0 to 5.
usefilterEnable or disable equalizer use: 0 (no) or 1 (yes).
filtervalue1A value corresponding to the attenuation for band 1 filter. Value ranges 0 to 200.
filtervalue2A value corresponding to the attenuation for band 2 filter. Value ranges 0 to 200.
filtervalue3A value corresponding to the attenuation for band 3 filter. Value ranges 0 to 200.
filtervalue4A value corresponding to the attenuation for band 4 filter. Value ranges 0 to 200.
vocaltractThe voice shaping (tone) expressed in percentage of the default value 100. Value ranges 50 to 150.
audioboostpreemphControls the emphasis of medium and high frequencies. The default value is 0. Value ranges 0 to 90.

Cepstral Swift Properties

(Source: Cepstral Swift SDK documentation)

PropertyValue
configfileConfiguration file to use instead of the default.
audiochannelsNumber of audio channels [ 1 (mono) or 2 (stereo) ].
audiodeadairMilliseconds of dead air (silence) to pad at the end of speech.
audiopanLeft-to-Right panning [ -1 = left, 0 = center, 1 = right ]. This must be used with audio/channels=2.
audiovolumeVolume multiplication factor as a percentage. Default value is 100.
lexiconA lexicon.txt file in the voice directory. Review Ceptral Lexicon Editing with [LexiconKit](xref:lexiconkit "LexiconKit")
speechrateSpeaking rate (average WPM).
voicedirDirectory for the voice.
sfxSpecial effects output chain file.

CereProc CereVoice Properties

(Sourcce: CereProc CereVoicec SDK User Guide)

PropertyValue
enginepathRuntime library path.
voicepathVoice files path.
configfileConfiguration file path.
licenseLicense file path.
rootcertificateRoot certificate file path.
clientcrtClient CRT file path.
clientkeyClient key file paht.

Microsoft Azure Speech Properties

(Source: learn.microsoft.com)

PropertyValue
devicenameThe multimedia device ID that is used by the audio object.
languageThe voice language to use for synthesizers that support multiple languages.
languagesOne or more languages for which to enumerate available voices.
speechkeyThe Azure Speech Services key.
speechregionThe The Azure Speech Services region.

Microsoft SAPI 5 Properties

(Source: Microsoft SAPI5 Help File)

PropertyValue
deviceidThe multimedia device ID that is used by the audio object.
rateThe current text rendering rate adjustment. Value specifying the speaking rate of the voice. Supported values range from -10 to 10 - values outside this range may be truncated.
volumeThe synthesizer output volume level of the voice in real time. Volume levels are specified in percentage values ranging from zero to 100. The default base volume for all voices is 100.

Speech SDK Helpers

The Speak method supports two optional Speech API properties--options and output format--that are defined by enumerated constants in the SDKs.

SpeechKit includes helper constants for these properties that map to the Speech API values.

Acapela TTS

The options value may be specified with one or more ACATTSSPEAKFLAGS constants:

  • ACATTS_SYNC
  • ACATTS_ASYNC
  • ACATTS_TEXT
  • ACATTS_FILE
  • ACATTS_TXT_ANSI
  • ACATTS_TXT_OEM
  • ACATTS_TXT_UC2
  • ACATTS_TXT_UTF8
  • ACATTS_TXT_MACROMAN
  • ACATTS_TAG_NO
  • ACATTS_TAG_SAPI
  • ACATTS_READ_DEFAULT
  • ACATTS_READ_TEXT
  • ACATTS_READ_WORD
  • ACATTS_READ_LETTER
  • ACATTS_FILE_RAW
  • ACATTS_FILE_WAV
  • ACATTS_FILE_AU
  • ACATTS_FILE_VOX
  • ACATTS_FILE_AIFF
  • ACATTS_FILE_EXTAUDIOFILEOBJ

The output value may be specified with one or more ACATTSWAVFORMATFLAGS constants:

  • ACATTS_FORMAT_PCM8
  • ACATTS_FORMAT_PCM
  • ACATTS_FORMAT_ALAW
  • ACATTS_FORMAT_MULAW
  • ACATTS_FORMAT_OKI_ADPCM

Cepstral Swift

The options value may be specified with one or more SWIFTSPEAKFLAGS constants:

  • SWIFT_DEFAULT
  • SWIFT_ASYNC
  • SWIFT_IS_FILENAME
  • SWIFT_IS_XML
  • SWIFT_IS_SSML
  • SWIFT_SPELL_OUT
  • SWIFT_SPEAK_PHONEMES
  • SWIFT_NO_BLOCKING

The ouput format value may be one of the voice supported SPSTREAMFORMAT constants:

  • SPSF_Default
  • SPSF_8kHz8BitMono
  • SPSF_8kHz8BitStereo
  • SPSF_8kHz16BitMono
  • SPSF_8kHz16BitStereo
  • SPSF_11kHz8BitMono
  • SPSF_11kHz8BitStereo
  • SPSF_11kHz16BitMono
  • SPSF_11kHz16BitStereo
  • SPSF_12kHz8BitMono
  • SPSF_12kHz8BitStereo
  • SPSF_12kHz16BitMono
  • SPSF_12kHz16BitStereo
  • SPSF_16kHz8BitMono
  • SPSF_16kHz8BitStereo
  • SPSF_16kHz16BitMono
  • SPSF_16kHz16BitStereo
  • SPSF_22kHz8BitMono
  • SPSF_22kHz8BitStereo
  • SPSF_22kHz16BitMono
  • SPSF_22kHz16BitStereo
  • SPSF_24kHz8BitMono
  • SPSF_24kHz8BitStereo
  • SPSF_24kHz16BitMono
  • SPSF_24kHz16BitStereo
  • SPSF_32kHz8BitMono
  • SPSF_32kHz8BitStereo
  • SPSF_32kHz16BitMono
  • SPSF_32kHz16BitStereo
  • SPSF_44kHz8BitMono
  • SPSF_44kHz8BitStereo
  • SPSF_44kHz16BitMono
  • SPSF_44kHz16BitStereo
  • SPSF_48kHz8BitMono
  • SPSF_48kHz8BitStereo
  • SPSF_48kHz16BitMono
  • SPSF_48kHz16BitStereo
  • SPSF_CCITT_ALaw_8kHzMono
  • SPSF_CCITT_ALaw_8kHzStereo
  • SPSF_CCITT_ALaw_11kHzMono
  • SPSF_CCITT_ALaw_11kHzStereo
  • SPSF_CCITT_ALaw_22kHzMono
  • SPSF_CCITT_ALaw_22kHzStereo
  • SPSF_CCITT_ALaw_44kHzMono
  • SPSF_CCITT_ALaw_44kHzStereo
  • SPSF_CCITT_uLaw_8kHzMono
  • SPSF_CCITT_uLaw_8kHzStereo
  • SPSF_CCITT_uLaw_11kHzMono
  • SPSF_CCITT_uLaw_11kHzStereo
  • SPSF_CCITT_uLaw_22kHzMono
  • SPSF_CCITT_uLaw_22kHzStereo
  • SPSF_CCITT_uLaw_44kHzMono
  • SPSF_CCITT_uLaw_44kHzStereo

Microsoft Azure Speech

The output value may be specified with one or more MCSSTREAMFORMAT constants:

  • Raw8Khz8BitMonoMULaw
  • Riff16Khz16KbpsMonoSiren
  • Audio16Khz16KbpsMonoSiren
  • Audio16Khz32KBitRateMonoMp3
  • Audio16Khz128KBitRateMonoMp3
  • Audio16Khz64KBitRateMonoMp3
  • Audio24Khz48KBitRateMonoMp3
  • Audio24Khz96KBitRateMonoMp3
  • Audio24Khz160KBitRateMonoMp3
  • Raw16Khz16BitMonoTrueSilk
  • Riff16Khz16BitMonoPcm
  • Riff8Khz16BitMonoPcm
  • Riff24Khz16BitMonoPcm
  • Riff8Khz8BitMonoMULaw
  • Raw16Khz16BitMonoPcm
  • Raw24Khz16BitMonoPcm
  • Raw8Khz16BitMonoPcm
  • Ogg16Khz16BitMonoOpus
  • Ogg24Khz16BitMonoOpus
  • Raw48Khz16BitMonoPcm
  • Riff48Khz16BitMonoPcm
  • Audio48Khz96KBitRateMonoMp3
  • Audio48Khz192KBitRateMonoMp3
  • Ogg48Khz16BitMonoOpus
  • Webm16Khz16BitMonoOpus
  • Webm24Khz16BitMonoOpus
  • Raw24Khz16BitMonoTrueSilk
  • Raw8Khz8BitMonoALaw
  • Riff8Khz8BitMonoALaw
  • Webm24Khz16Bit24KbpsMonoOpus
  • Audio16Khz16Bit32KbpsMonoOpus
  • Audio24Khz16Bit48KbpsMonoOpus
  • Audio24Khz16Bit24KbpsMonoOpus
  • Raw22050Hz16BitMonoPcm
  • Riff22050Hz16BitMonoPcm
  • Raw44100Hz16BitMonoPcm
  • Riff44100Hz16BitMonoPcm

Microsoft SAPI 5

The options value may be specified with one or more SPEAKFLAGS constants:

  • SPF_DEFAULT
  • SPF_ASYNC
  • SPF_PURGEBEFORESPEAK
  • SPF_IS_FILENAME
  • SPF_IS_XML
  • SPF_IS_NOT_XML
  • SPF_PERSIST_XML
  • SPF_NLP_SPEAK_PUNC
  • SPF_PARSE_SAPI
  • SPF_PARSE_SSML
  • SPF_PARSE_AUTODETECT
  • SPF_NLP_MASK
  • SPF_PARSE_MASK
  • SPF_VOICE_MASK
  • SPF_UNUSED_FLAGS

The ouput format value may be one of the SPSTREAMFORMAT constants:

  • SPSF_Default
  • SPSF_NoAssignedFormat
  • SPSF_Text
  • SPSF_NonStandardFormat
  • SPSF_ExtendedAudioFormat
  • SPSF_8kHz8BitMono
  • SPSF_8kHz8BitStereo
  • SPSF_8kHz16BitMono
  • SPSF_8kHz16BitStereo
  • SPSF_11kHz8BitMono
  • SPSF_11kHz8BitStereo
  • SPSF_11kHz16BitMono
  • SPSF_11kHz16BitStereo
  • SPSF_12kHz8BitMono
  • SPSF_12kHz8BitStereo
  • SPSF_12kHz16BitMono
  • SPSF_12kHz16BitStereo
  • SPSF_16kHz8BitMono
  • SPSF_16kHz8BitStereo
  • SPSF_16kHz16BitMono
  • SPSF_16kHz16BitStereo
  • SPSF_22kHz8BitMono
  • SPSF_22kHz8BitStereo
  • SPSF_22kHz16BitMono
  • SPSF_22kHz16BitStereo
  • SPSF_24kHz8BitMono
  • SPSF_24kHz8BitStereo
  • SPSF_24kHz16BitMono
  • SPSF_24kHz16BitStereo
  • SPSF_32kHz8BitMono
  • SPSF_32kHz8BitStereo
  • SPSF_32kHz16BitMono
  • SPSF_32kHz16BitStereo
  • SPSF_44kHz8BitMono
  • SPSF_44kHz8BitStereo
  • SPSF_44kHz16BitMono
  • SPSF_44kHz16BitStereo
  • SPSF_48kHz8BitMono
  • SPSF_48kHz8BitStereo
  • SPSF_48kHz16BitMono
  • SPSF_48kHz16BitStereo
  • SPSF_TrueSpeech_8kHz1BitMono
  • SPSF_CCITT_ALaw_8kHzMono
  • SPSF_CCITT_ALaw_8kHzStereo
  • SPSF_CCITT_ALaw_11kHzMono
  • SPSF_CCITT_ALaw_11kHzStereo
  • SPSF_CCITT_ALaw_22kHzMono
  • SPSF_CCITT_ALaw_22kHzStereo
  • SPSF_CCITT_ALaw_44kHzMono
  • SPSF_CCITT_ALaw_44kHzStereo
  • SPSF_CCITT_uLaw_8kHzMono
  • SPSF_CCITT_uLaw_8kHzStereo
  • SPSF_CCITT_uLaw_11kHzMono
  • SPSF_CCITT_uLaw_11kHzStereo
  • SPSF_CCITT_uLaw_22kHzMono
  • SPSF_CCITT_uLaw_22kHzStereo
  • SPSF_CCITT_uLaw_44kHzMono
  • SPSF_CCITT_uLaw_44kHzStereo
  • SPSF_ADPCM_8kHzMono
  • SPSF_ADPCM_8kHzStereo
  • SPSF_ADPCM_11kHzMono
  • SPSF_ADPCM_11kHzStereo
  • SPSF_ADPCM_22kHzMono
  • SPSF_ADPCM_22kHzStereo
  • SPSF_ADPCM_44kHzMono
  • SPSF_ADPCM_44kHzStereo
  • SPSF_GSM610_8kHzMono
  • SPSF_GSM610_11kHzMono
  • SPSF_GSM610_22kHzMono
  • SPSF_GSM610_44kHzMono

The ouput format value may be one of the voice supported SPSTREAMFORMAT constants:

  • SPSF_Default
  • SPSF_8kHz16BitMono (VE_FREQ_8KHZ)
  • SPSF_8kHz16BitStereo (VE_FREQ_8KHZ)
  • SPSF_11kHz16BitMono (VE_FREQ_11KHZ)
  • SPSF_11kHz16BitStereo (VE_FREQ_11KHZ)
  • SPSF_16kHz16BitMono (VE_FREQ_16KHZ)
  • SPSF_16kHz16BitStereo (VE_FREQ_16KHZ)
  • SPSF_22kHz16BitMono (VE_FREQ_22KHZ)
  • SPSF_22kHz16BitStereo (VE_FREQ_22KHZ)