• SSML DTD and example. Source: Isard 1995, fig. 5-1, 5-2.
    SSML DTD and example. Source: Isard 1995, fig. 5-1, 5-2.
  • VoiceXML interworks with SSML and others. Source: Froumentin 2004.
    VoiceXML interworks with SSML and others. Source: Froumentin 2004.
  • Prosody support among speech synthesizers. Moore and Eyckelhof 1999.
    Prosody support among speech synthesizers. Moore and Eyckelhof 1999.
  • SSML tags are useful across the speech synthesis process. Source: Baggia and Spa 2019, fig. 4.
    SSML tags are useful across the speech synthesis process. Source: Baggia and Spa 2019, fig. 4.
  • Guardian serves news in audio form using Google text-to-speech API. Source: Coleman 2019.
    Guardian serves news in audio form using Google text-to-speech API. Source: Coleman 2019.
  • An SSML WYSIWYG editor and tester. Source: Top Voice Apps 2019.
    An SSML WYSIWYG editor and tester. Source: Top Voice Apps 2019.

Speech Synthesis Markup Language

User avatar
jeetsingh
1184 DevCoins
Avatar of user arvindpdmn
arvindpdmn
16 DevCoins
2 authors have contributed to this article
Last updated by arvindpdmn
on 2019-10-23 14:25:39
Created by jeetsingh
on 2019-10-22 12:21:26
Improve this article. Show messages

Summary

Speech synthesis is the process of producing natural-sounding human speech from text so that humans can interact with machines via voice interfaces. Typical applications are reading for the blind, speaking aids for the handicapped, remote access to email, proofreading and so on. Speech synthesis system consists of analysis of input text, followed by synthesis of speech.

There are inherent difficulties in these systems. They don't handle symbols or foreign words suitably. Some systems translate leading whitespaces into extra pauses. Some words may need extra stress or change of pitch. It's for these purposes that Speech Synthesis Markup Language (SSML) becomes useful.

SSML adds markup on input text to aid speech synthesizers construct speech waveforms that sound more natural. SSML is a W3C standard, though some implementations have proprietary extensions. Popular voice assistants (Alexa, Assistant, Cortana) are known to use SSML.

Milestones

1995
SSML DTD and example. Source: Isard 1995, fig. 5-1, 5-2.

Amy Isard at the University of Edinburgh completes her thesis on SSML with supervisor Paul Taylor. She describes SSML as an application of SGML. She also presents a prototype implementation that's understood by the CSTR Speech Synthesizer. This implementation includes phrase boundaries, emphasized words, specified pronunciations, and inclusion of other sounds files. The concept of SSML was first introduced by Paul Taylor in 1992.

Oct
1998
VoiceXML interworks with SSML and others. Source: Froumentin 2004.

W3C organizes a workshop titled "Voice Browsers". The idea is to allow people with telephone connections to access Web content. This leads to the formation of Voice Browser Working Group (VBWG) in March 1999. These are the first steps towards the later standardization of SSML and related technologies.

1999
Prosody support among speech synthesizers. Moore and Eyckelhof 1999.

One study compares many speech synthesizers in the market. Each supports different aspects of prosody. There's no mention of SSML in the report. TrueTalk is said to be using escape sequences, which is very system specific. Since each system uses its own proprietary annotations or escape sequences, such annotated input is not portable. SSML is an attempt to introduce a standard to solve this.

Jun
2000

JSpeech Markup Language (JSML) is published as a W3C Note. It's inspired by Isard's SSML thesis and is derived from Java Speech API Markup Language that was developed at Sun Microsystems in the late 1990s.

Dec
2003

Version 1.0 of SSML is published as a W3C Candidate Recommendation. A draft of this can be traced to August 2000.

Mar
2004

Voice Extensible Markup Language (VoiceXML) is published as a W3C Recommendation. VoiceXML is designed for "creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations". SSML elements can be used within the prompt element of VoiceXML. SSML elements such as audio and say-as can have additional attributes when used within VoiceXML.

Sep
2010

Version 1.1 of SSML is published as a W3C Recommendation. Compared to V1.0, this version addresses the needs of many natural languages.

Apr
2017

Amazon Alexa introduces five new SSML tags: amazon:effect name="whispered", say-as interpret-as="expletive", sub, emphasis and prosody.

Mar
2018

Amazon Polly, a text-to-speech service, adds SSML breath feature. Instead of inserting pauses between words, breath sounds can result in more natural sounding speech. The feature allows for manual, automated and mixed modes of inserting breath.

Discussion

  • Which are the main features of SSML?
    Example of synthesized speech using SSML. Source: Google Developers 2019.

    SSML has many useful features to make synthesized speech sound more natural:

    • Voice: Different parts of a text could use different voices (male/female/neutral), which is useful for reading out dialogues.
    • Variations: Some words could be emphasized. Others could be stretched in time. Some phrases could be said in a high pitch. Swear words could be censored.
    • Special Cases: Telephone numbers could be read out as individual digits. Date and time fields should not be read out as individual digits. Abbreviations could be expanded or read out as individual letters.
    • Pauses: Pauses could be introduced, for example, to suggest the speaker thinking or expecting a response.
    • Recording: A recorded audio file can be played, and if unavailable, an alternate text could be synthesized.
    • Multilingual: A default language could be specified at the root level. This can be overridden for specific foreign language phrases.
  • Where does SSML fit in the overall speech synthesis process?
    SSML tags are useful across the speech synthesis process. Source: Baggia and Spa 2019, fig. 4.
    SSML tags are useful across the speech synthesis process. Source: Baggia and Spa 2019, fig. 4.

    Most text-to-speech (TTS) engines process their input in stages: structure analysis, text normalization, text-to-phoneme conversion, prosody analysis and waveform production. All of these can be enhanced by SSML elements. For example, p and s SSML elements mark paragraphs and sentences; say-as is useful for rendering special cases; sub for expanding abbreviations; and so on.

    Text normalization converts text into tokens suitable for speech. For example, '$200' would be converted to 'two hundred dollars'; '1/2' would become 'half'; and 'AAA' would become 'triple A'. These tokens are then converted into units of sounds called phonemes.

    To speak all words in the same tone or loudness, creates monotony. Prosody is therefore useful to make speech more natural and intelligible. It draws attention to certain words by way of emphasis. Prosody is about volume, pitch, and rate of speech. We can specify duration of a word and its pitch contour.

  • What exactly is a phoneme and how does SSML use it?

    Once tokens are obtained via text normalization, the synthesizer must replace each token with a sequence of phonemes. A phoneme is a unit of sound that distinguishes one word from another. For most cases, a dictionary lookup is adequate but when there's ambiguity or non-standard pronunciation, phoneme or say-as elements can be used. One example is "read", which has differing pronunciation based on verb tense. Another example is "Caius College", which should be pronounced as "keys college".

    Phonemes are language dependent. US English typically has 45 phonemes, Hawaiian has 12-18, and some languages may have even 100.

    The phoneme element has attribute alphabet that must at least support "ipa" as value, which refers to International Phonetic Association (IPA). Other alphabets include Speech API Phone Set (SAPI), Universal Phone Set (UPS), , IBM TTS, and Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA).

  • Which are the tags defined in SSML?

    Without being exhaustive, we mention a few important ones here. SSML is basically an application of XML. The root of an SSML document is speak. Using attributes, we can also specify the namespace and schema. Attribute xml:lang specifies the language.

    To load a lexicon from a known URI, we can use lexicon element. To translate tokens into phonemes using a specific lexicon, lookup element is useful.

    The say-as element has interpret-as as a mandatory attribute. The standard doesn't specify values for this attribute. Typical implementations include address, cardinal, characters, date, digits, fraction, ordinal, telephone, time, expletive, unit, interjection, etc.

    The sub element replaces the contained text with the alias attribute value for pronunciation.

    When we need to specify language for a phrase, we can use lang with attribute xml:lang. Where applicable, prefer to use the attribute with text structural elements p, s, w and token.

    Element phoneme with attribute ph specifies phonemic/phonetic pronunciation. Attribute value doesn't go through text normalization or lexicon lookup.

    Many elements control prosody: voice, emphasis, break, prosody. Attributes of prosody include pitch, contour, range, rate, duration and volume.

  • Which are some real-world applications that use SSML?
    Guardian serves news in audio form using Google text-to-speech API. Source: Coleman 2019.
    Guardian serves news in audio form using Google text-to-speech API. Source: Coleman 2019.

    Since 2019, the Guardian has been providing users important news in audio as well. They use of Google's text-to-speech API, to which the input included SSML. They noted that SSML parsing is slow. The API took about 8-10 seconds to generate the audio. Therefore, they opted to serve cached audio rather than just-in-time generation.

    In the UK, NHS is using Amazon Polly to stream synthesized speech through telephone lines. This is a low-cost approach that uses widespread telephone networks to deliver healthcare remotely. A typical response latency of 60ms was observed. They use SSML, although many features are not yet used.

    One blogger has suggested voice-based document reviews during long commutes. A document is converted into multiple MP3 files, each with different voice and cadence. AWS Lambda is used to convert the document to multiple SSML files. Another Lambda call triggers conversion of SSML files to MP3 files.

  • What tips can you give for content writers and developers working with SSML?
    An SSML WYSIWYG editor and tester. Source: Top Voice Apps 2019.
    An SSML WYSIWYG editor and tester. Source: Top Voice Apps 2019.

    With SSML, content creators can miss a closing tag or double quotes for element attributes. In the world of HTML, this problem was solved by Markdown syntax. Likewise, a replacement for SSML is Speech Markdown. However, we need converters to SSML until synthesizers can natively support Speech Markdown.

    An alternative is to use an SSML editor. Examples are from Verndale, PlayX-team at Swedish Radio and SSML Editor. You could even create your own SSML editor, which is based on Sanity.io and React.js.

    Content creators can refer to an Amazon Alexa SSML cheatsheet. YouTube audio library provides more than 5,000 free sounds that we can use in our SSML.

    Developers can refer to a JavaScript implementation of an SSML parser. The Node.js package ssml-builder allows us to create SSML programmatically using the builder pattern.

    Among the open source speech synthesizers are FreeTTS (Java) and eSpeech (C). There are also commercial text-to-speech engines that support SSML. Cepstral and CereVoice are examples. CereVoice includes Scottish-accented female voice, a vocal gesture library and patented Emotional Synthesis.

  • What are some criticisms of SSML?

    Although SSML is a W3C standard, not all its features are being supported by vendors. For example, IBM's text-to-speech service doesn't support or provides only partial support for many SSML elements or attributes. Google Assistant doesn't support the phoneme element.

    Moreover, each vendor is introducing its own proprietary elements. Amazon Alexa's amazon:effect is proprietary. Google Assistant makes use of par, seq and media. These can be used to add background music; or create containers to play media in sequence or in parallel. Elements par and seq are part of another W3C standard called Synchronized Multimedia Integration Language (SMIL).

    Parsing SSML in real time could be slow. For content writers, writing in SSML could be cumbersome and they may prefer Speech Markdown in future.

Sample Code

  • <!--
    Source: https://www.w3.org/TR/speech-synthesis11/
    Accessed: 2019-10-23
    -->
    <?xml version="1.0"?>
    <speak version="1.1"
           xmlns="http://www.w3.org/2001/10/synthesis"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                     http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
           xml:lang="en-US">
      ... the body ...
    </speak>
     
    <!--
    Source: https://developer.amazon.com/docs/custom-skills/speech-synthesis-markup-language-ssml-reference.html
    Accessed: 2019-10-23
    -->
    <speak>
        Here is a number <w role="amazon:VBD">read</w>
        as a cardinal number:
        <say-as interpret-as="cardinal">12345</say-as>.
        Here is the same number with each digit spoken separately:
        <say-as interpret-as="digits">12345</say-as>.
        Here is a word spelled out:
        <say-as interpret-as="spell-out">hello</say-as>.
    </speak>
     
    <speak>
        I already told you I
        <emphasis level="strong">really like</emphasis>
        that person.
    </speak>
     
    <speak>
        In Paris, they pronounce it <lang xml:lang="fr-FR">Paris</lang>
    </speak>
     
    <speak>
        You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
        I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
    </speak>
     
    <speak>
        Normal volume for the first sentence.
        <prosody volume="x-loud">Louder volume for the second sentence</prosody>.
        When I wake up, <prosody rate="x-slow">I speak quite slowly</prosody>.
        I can speak with my normal pitch,
        <prosody pitch="x-high"> but also with a much higher pitch </prosody>,
        and also <prosody pitch="low">with a lower pitch</prosody>.
    </speak>
     
    <speak>
        My favorite chemical element is <sub alias="aluminum">Al</sub>,
        but Al prefers <sub alias="magnesium">Mg</sub>.
    </speak>
     
    <speak>
        Here's a surprise you did not expect.
        <voice name="Kendra"><lang xml:lang="en-US">I want to tell you a secret.</lang></voice>
        <voice name="Brian"><lang xml:lang="en-GB">Your secret is safe with me!</lang></voice>
        <voice name="Kendra"><lang xml:lang="en-US">I am not a real human.</lang></voice>.
        Can you believe it?
    </speak>

References

  1. Amazon Alexa. 2019. "Speech Synthesis Markup Language (SSML) Reference." Alexa Skills Kit, Amazon Alexa. Accessed 2019-10-22.
  2. Baggia, Paolo and Loquendo Spa. 2019. The impact of standards on today's speech applications. Accessed 2019-10-22.
  3. Bulterman, Dick, Jack Jansen, Pablo Cesar, Sjoerd Mullender, Eric Hyche, Marisa DeMeglio, Julien Quint, Hiroshi Kawamura, Daniel Weck, Xabiel García Pañeda, David Melendi, Samuel Cruz-Lara, Marcin Hanclik, Daniel F. Zucker, and Thierry Michel, eds. 2008. "Synchronized Multimedia Integration Language (SMIL 3.0)." W3C Recommendation, December 01. Accessed 2019-10-23.
  4. Burnett, Daniel C. and Zhi Wei Shuang, eds. 2010. "Speech Synthesis Markup Language (SSML) Version 1.1." W3C Recommendation, September 07. Accessed 2019-10-22.
  5. Burnett, Daniel C., Mark R. Walker, and Andrew Hunt, eds. 2003. "Speech Synthesis Markup Language (SSML) Version 1.0." W3C Candidate Recommendation, December 18. Accessed 2019-10-22.
  6. CereProc. 2019. "CereVoice Engine Text-to-Speech SDK." Accessed 2019-10-22.
  7. Coleman, Susie. 2019. "How we automated audio news bulletins." The Guardian, April 04. Accessed 2019-10-22.
  8. Cover Pages. 2019. "SSML: A Speech Synthesis Markup Language." Accessed 2019-10-22.
  9. Froumentin, Max. 2004. "Voice Applications." Talk. Accessed 2019-10-22.
  10. Google Developers. 2019. "SSML." Google Assistant. Accessed 2019-10-22.
  11. Hunt, Andrew, ed. 2000. "JSpeech Markup Language." W3C Note, June 05. Accessed 2019-10-22.
  12. IBM. 2014. "SSML phonemes." IBM WebSphere Voice Server V6.1.1, IBM Knowledge Center, October 24. Accessed 2019-10-22.
  13. IBM. 2019. "SSML elements." Text to Speech, IBM Cloud Docs, June 21. Accessed 2019-10-22.
  14. Isard, Amy. 1995. "SSML: A Markup Language for Speech Synthesis." University of Edinburgh. Accessed 2019-10-22.
  15. Luciani, Silvano. 2017. "More SSML for Actions on Google!" Medium, December 19. Accessed 2019-10-22.
  16. McGlashan, Scott, Daniel C. Burnett, Jerry Carter, Peter Danielsen, Jim Ferrans, Andrew Hunt, Bruce Lucas, Brad Porter, Ken Rehor, and Steph Tryphonas, eds. 2004. "Voice Extensible Markup Language (VoiceXML) Version 2.0." W3C Recommendation, March 16. Accessed 2019-10-22.
  17. Melvær, Knut. 2019. "How To Make A Speech Synthesis Editor." Smashing Magazine, March 21. Accessed 2019-10-22.
  18. Microsoft Docs. 2019a. "Speech Synthesis Markup Language (SSML) reference." Cortana Dev Center, Microsoft, July 12. Updated 2019-09-13. Accessed 2019-10-22.
  19. Moore, Ben A. and Casper Eyckelhof. 1999. "Speech Synthesizer Review." Information Sciences Institute, University of Southern California, November 05. Accessed 2019-10-23.
  20. Myers, Liz. 2017. "New SSML Features Give Alexa a Wider Range of Natural Expression." Blog, Amazon Alexa, April 27. Accessed 2019-10-23.
  21. Nicholls, Leon. 2017. "SSML for Actions on Google." Google Developers, on Medium, April 18. Accessed 2019-10-23.
  22. Peh, Binny. 2018. "Amazon Polly releases new SSML Breath feature." AWS Machine Learning Blog, March 22. Accessed 2019-10-22.
  23. Shukla, Vinod. 2019. "Turning Microsoft Word documents into audio playlists using Amazon Polly." AWS Machine Learning Blog, July 03. Accessed 2019-10-22.
  24. Top Voice Apps. 2019. "SSML." Accessed 2019-10-23.
  25. Tucker, Mark. 2019. "Speech Markdown is the Simpler Way to Format Text-to-Speech Content Over SSML." Voicebot.ai, June 20. Accessed 2019-10-22.
  26. W3C. 1998. "Voice Browsers: W3C Workshop: Call for Participation." October 13. Accessed 2019-10-23.
  27. W3C. 2006. "'Voice Browser' Activity." March 22. Accessed 2019-10-23.
  28. Walker, Mark R. and Andrew Hunt, eds. 2000. "Speech Synthesis Markup Language Specification for the Speech Interface Framework." W3C Working Draft, August 08. Accessed 2019-10-22.
  29. Wray, Michael. 2017. "Using Amazon Polly to Deliver Health Care for People with Long-Term Conditions." AWS Machine Learning Blog, June 30. Accessed 2019-10-22.

Milestones

1995
SSML DTD and example. Source: Isard 1995, fig. 5-1, 5-2.

Amy Isard at the University of Edinburgh completes her thesis on SSML with supervisor Paul Taylor. She describes SSML as an application of SGML. She also presents a prototype implementation that's understood by the CSTR Speech Synthesizer. This implementation includes phrase boundaries, emphasized words, specified pronunciations, and inclusion of other sounds files. The concept of SSML was first introduced by Paul Taylor in 1992.

Oct
1998
VoiceXML interworks with SSML and others. Source: Froumentin 2004.

W3C organizes a workshop titled "Voice Browsers". The idea is to allow people with telephone connections to access Web content. This leads to the formation of Voice Browser Working Group (VBWG) in March 1999. These are the first steps towards the later standardization of SSML and related technologies.

1999
Prosody support among speech synthesizers. Moore and Eyckelhof 1999.

One study compares many speech synthesizers in the market. Each supports different aspects of prosody. There's no mention of SSML in the report. TrueTalk is said to be using escape sequences, which is very system specific. Since each system uses its own proprietary annotations or escape sequences, such annotated input is not portable. SSML is an attempt to introduce a standard to solve this.

Jun
2000

JSpeech Markup Language (JSML) is published as a W3C Note. It's inspired by Isard's SSML thesis and is derived from Java Speech API Markup Language that was developed at Sun Microsystems in the late 1990s.

Dec
2003

Version 1.0 of SSML is published as a W3C Candidate Recommendation. A draft of this can be traced to August 2000.

Mar
2004

Voice Extensible Markup Language (VoiceXML) is published as a W3C Recommendation. VoiceXML is designed for "creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations". SSML elements can be used within the prompt element of VoiceXML. SSML elements such as audio and say-as can have additional attributes when used within VoiceXML.

Sep
2010

Version 1.1 of SSML is published as a W3C Recommendation. Compared to V1.0, this version addresses the needs of many natural languages.

Apr
2017

Amazon Alexa introduces five new SSML tags: amazon:effect name="whispered", say-as interpret-as="expletive", sub, emphasis and prosody.

Mar
2018

Amazon Polly, a text-to-speech service, adds SSML breath feature. Instead of inserting pauses between words, breath sounds can result in more natural sounding speech. The feature allows for manual, automated and mixed modes of inserting breath.

Tags

See Also

Further Reading

  1. Mikhalenko, Peter. 2004. "Speech Synthesis Markup Language: An Introduction." XML.com, October 20. Accessed 2019-10-22.
  2. Larson, James A. 2003. "The W3C Speech Interface Framework." March-April. Accessed 2019-10-22.
  3. Microsoft Docs. 2019b. "Speech Synthesis Markup Language (SSML)." Microsoft Azure, Microsoft, May 07. Updated 2019-09-27. Accessed 2019-10-22.
  4. Peterson, Terren. 2019. "If you want a winning voice app, implement SSML." Hackernoon, October 14. Accessed 2019-10-23.
  5. Vargas, Garrett. 2019. "How to find errors in your SSML responses." Medium, March 26. Accessed 2019-10-23.
  6. Taylor, Paul and Amy Isard. 1997. "SSML: A speech synthesis markup language." Journal of Speech Communication, vol. 21, no. 1-2. pp. 123-133, February. Accessed 2019-10-22.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
2
0
1184
1
1
16
1721
Words
1
Chats
3
Edits
1
Likes
232
Hits

Cite As

Devopedia. 2019. "Speech Synthesis Markup Language." Version 3, October 23. Accessed 2019-11-21. https://devopedia.org/speech-synthesis-markup-language