Speech Synthesis Markup Language
Speech synthesis is the process of producing natural-sounding human speech from text so that humans can interact with machines via voice interfaces. Typical applications are reading for the blind, speaking aids for the handicapped, remote access to email, proofreading and so on. Speech synthesis system consists of analysis of input text, followed by synthesis of speech.
There are inherent difficulties in these systems. They don't handle symbols or foreign words suitably. Some systems translate leading whitespaces into extra pauses. Some words may need extra stress or change of pitch. It's for these purposes that Speech Synthesis Markup Language (SSML) becomes useful.
SSML adds markup on input text to aid speech synthesizers construct speech waveforms that sound more natural. SSML is a W3C standard, though some implementations have proprietary extensions. Popular voice assistants (Alexa, Assistant, Cortana) are known to use SSML.
Which are the main features of SSML?
SSML has many useful features to make synthesized speech sound more natural:
- Voice: Different parts of a text could use different voices (male/female/neutral), which is useful for reading out dialogues.
- Variations: Some words could be emphasized. Others could be stretched in time. Some phrases could be said in a high pitch. Swear words could be censored.
- Special Cases: Telephone numbers could be read out as individual digits. Date and time fields should not be read out as individual digits. Abbreviations could be expanded or read out as individual letters.
- Pauses: Pauses could be introduced, for example, to suggest the speaker thinking or expecting a response.
- Recording: A recorded audio file can be played, and if unavailable, an alternate text could be synthesized.
- Multilingual: A default language could be specified at the root level. This can be overridden for specific foreign language phrases.
Where does SSML fit in the overall speech synthesis process?
Most text-to-speech (TTS) engines process their input in stages: structure analysis, text normalization, text-to-phoneme conversion, prosody analysis and waveform production. All of these can be enhanced by SSML elements. For example,
sSSML elements mark paragraphs and sentences;
say-asis useful for rendering special cases;
subfor expanding abbreviations; and so on.
Text normalization converts text into tokens suitable for speech. For example, '$200' would be converted to 'two hundred dollars'; '1/2' would become 'half'; and 'AAA' would become 'triple A'. These tokens are then converted into units of sounds called phonemes.
To speak all words in the same tone or loudness, creates monotony. Prosody is therefore useful to make speech more natural and intelligible. It draws attention to certain words by way of emphasis. Prosody is about volume, pitch, and rate of speech. We can specify duration of a word and its pitch contour.
What exactly is a phoneme and how does SSML use it?
Once tokens are obtained via text normalization, the synthesizer must replace each token with a sequence of phonemes. A phoneme is a unit of sound that distinguishes one word from another. For most cases, a dictionary lookup is adequate but when there's ambiguity or non-standard pronunciation,
say-aselements can be used. One example is "read", which has differing pronunciation based on verb tense. Another example is "Caius College", which should be pronounced as "keys college".
Phonemes are language dependent. US English typically has 45 phonemes, Hawaiian has 12-18, and some languages may have even 100.
phonemeelement has attribute
alphabetthat must at least support "ipa" as value, which refers to International Phonetic Association (IPA). Other alphabets include Speech API Phone Set (SAPI), Universal Phone Set (UPS), , IBM TTS, and Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA).
Which are the tags defined in SSML?
Without being exhaustive, we mention a few important ones here. SSML is basically an application of XML. The root of an SSML document is
speak. Using attributes, we can also specify the namespace and schema. Attribute
xml:langspecifies the language.
To load a lexicon from a known URI, we can use
lexiconelement. To translate tokens into phonemes using a specific lexicon,
lookupelement is useful.
interpret-asas a mandatory attribute. The standard doesn't specify values for this attribute. Typical implementations include address, cardinal, characters, date, digits, fraction, ordinal, telephone, time, expletive, unit, interjection, etc.
subelement replaces the contained text with the
aliasattribute value for pronunciation.
When we need to specify language for a phrase, we can use
xml:lang. Where applicable, prefer to use the attribute with text structural elements
phspecifies phonemic/phonetic pronunciation. Attribute value doesn't go through text normalization or lexicon lookup.
Many elements control prosody:
prosody. Attributes of
Which are some real-world applications that use SSML?
Since 2019, the Guardian has been providing users important news in audio as well. They use of Google's text-to-speech API, to which the input included SSML. They noted that SSML parsing is slow. The API took about 8-10 seconds to generate the audio. Therefore, they opted to serve cached audio rather than just-in-time generation.
In the UK, NHS is using Amazon Polly to stream synthesized speech through telephone lines. This is a low-cost approach that uses widespread telephone networks to deliver healthcare remotely. A typical response latency of 60ms was observed. They use SSML, although many features are not yet used.
One blogger has suggested voice-based document reviews during long commutes. A document is converted into multiple MP3 files, each with different voice and cadence. AWS Lambda is used to convert the document to multiple SSML files. Another Lambda call triggers conversion of SSML files to MP3 files.
What tips can you give for content writers and developers working with SSML?
With SSML, content creators can miss a closing tag or double quotes for element attributes. In the world of HTML, this problem was solved by Markdown syntax. Likewise, a replacement for SSML is Speech Markdown. However, we need converters to SSML until synthesizers can natively support Speech Markdown.
An alternative is to use an SSML editor. Examples are from Verndale, PlayX-team at Swedish Radio and SSML Editor. You could even create your own SSML editor, which is based on Sanity.io and React.js.
Content creators can refer to an Amazon Alexa SSML cheatsheet. YouTube audio library provides more than 5,000 free sounds that we can use in our SSML.
Among the open source speech synthesizers are FreeTTS (Java) and eSpeech (C). There are also commercial text-to-speech engines that support SSML. Cepstral and CereVoice are examples. CereVoice includes Scottish-accented female voice, a vocal gesture library and patented Emotional Synthesis.
What are some criticisms of SSML?
Although SSML is a W3C standard, not all its features are being supported by vendors. For example, IBM's text-to-speech service doesn't support or provides only partial support for many SSML elements or attributes. Google Assistant doesn't support the
Moreover, each vendor is introducing its own proprietary elements. Amazon Alexa's
amazon:effectis proprietary. Google Assistant makes use of
media. These can be used to add background music; or create containers to play media in sequence or in parallel. Elements
seqare part of another W3C standard called Synchronized Multimedia Integration Language (SMIL).
Parsing SSML in real time could be slow. For content writers, writing in SSML could be cumbersome and they may prefer Speech Markdown in future.
Amy Isard at the University of Edinburgh completes her thesis on SSML with supervisor Paul Taylor. She describes SSML as an application of SGML. She also presents a prototype implementation that's understood by the CSTR Speech Synthesizer. This implementation includes phrase boundaries, emphasized words, specified pronunciations, and inclusion of other sounds files. The concept of SSML was first introduced by Paul Taylor in 1992.
W3C organizes a workshop titled "Voice Browsers". The idea is to allow people with telephone connections to access Web content. This leads to the formation of Voice Browser Working Group (VBWG) in March 1999. These are the first steps towards the later standardization of SSML and related technologies.
One study compares many speech synthesizers in the market. Each supports different aspects of prosody. There's no mention of SSML in the report. TrueTalk is said to be using escape sequences, which is very system specific. Since each system uses its own proprietary annotations or escape sequences, such annotated input is not portable. SSML is an attempt to introduce a standard to solve this.
Voice Extensible Markup Language (VoiceXML) is published as a W3C Recommendation. VoiceXML is designed for "creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations". SSML elements can be used within the
prompt element of VoiceXML. SSML elements such as
say-as can have additional attributes when used within VoiceXML.
- Amazon Alexa. 2019. "Speech Synthesis Markup Language (SSML) Reference." Alexa Skills Kit, Amazon Alexa. Accessed 2019-10-22.
- Baggia, Paolo and Loquendo Spa. 2019. The impact of standards on today's speech applications. Accessed 2019-10-22.
- Bulterman, Dick, Jack Jansen, Pablo Cesar, Sjoerd Mullender, Eric Hyche, Marisa DeMeglio, Julien Quint, Hiroshi Kawamura, Daniel Weck, Xabiel García Pañeda, David Melendi, Samuel Cruz-Lara, Marcin Hanclik, Daniel F. Zucker, and Thierry Michel, eds. 2008. "Synchronized Multimedia Integration Language (SMIL 3.0)." W3C Recommendation, December 01. Accessed 2019-10-23.
- Burnett, Daniel C. and Zhi Wei Shuang, eds. 2010. "Speech Synthesis Markup Language (SSML) Version 1.1." W3C Recommendation, September 07. Accessed 2019-10-22.
- Burnett, Daniel C., Mark R. Walker, and Andrew Hunt, eds. 2003. "Speech Synthesis Markup Language (SSML) Version 1.0." W3C Candidate Recommendation, December 18. Accessed 2019-10-22.
- CereProc. 2019. "CereVoice Engine Text-to-Speech SDK." Accessed 2019-10-22.
- Coleman, Susie. 2019. "How we automated audio news bulletins." The Guardian, April 04. Accessed 2019-10-22.
- Cover Pages. 2019. "SSML: A Speech Synthesis Markup Language." Accessed 2019-10-22.
- Froumentin, Max. 2004. "Voice Applications." Talk. Accessed 2019-10-22.
- Google Developers. 2019. "SSML." Google Assistant. Accessed 2019-10-22.
- Hunt, Andrew, ed. 2000. "JSpeech Markup Language." W3C Note, June 05. Accessed 2019-10-22.
- IBM. 2014. "SSML phonemes." IBM WebSphere Voice Server V6.1.1, IBM Knowledge Center, October 24. Accessed 2019-10-22.
- IBM. 2019. "SSML elements." Text to Speech, IBM Cloud Docs, June 21. Accessed 2019-10-22.
- Isard, Amy. 1995. "SSML: A Markup Language for Speech Synthesis." University of Edinburgh. Accessed 2019-10-22.
- Luciani, Silvano. 2017. "More SSML for Actions on Google!" Medium, December 19. Accessed 2019-10-22.
- McGlashan, Scott, Daniel C. Burnett, Jerry Carter, Peter Danielsen, Jim Ferrans, Andrew Hunt, Bruce Lucas, Brad Porter, Ken Rehor, and Steph Tryphonas, eds. 2004. "Voice Extensible Markup Language (VoiceXML) Version 2.0." W3C Recommendation, March 16. Accessed 2019-10-22.
- Melvær, Knut. 2019. "How To Make A Speech Synthesis Editor." Smashing Magazine, March 21. Accessed 2019-10-22.
- Microsoft Docs. 2019a. "Speech Synthesis Markup Language (SSML) reference." Cortana Dev Center, Microsoft, July 12. Updated 2019-09-13. Accessed 2019-10-22.
- Moore, Ben A. and Casper Eyckelhof. 1999. "Speech Synthesizer Review." Information Sciences Institute, University of Southern California, November 05. Accessed 2019-10-23.
- Myers, Liz. 2017. "New SSML Features Give Alexa a Wider Range of Natural Expression." Blog, Amazon Alexa, April 27. Accessed 2019-10-23.
- Nicholls, Leon. 2017. "SSML for Actions on Google." Google Developers, on Medium, April 18. Accessed 2019-10-23.
- Peh, Binny. 2018. "Amazon Polly releases new SSML Breath feature." AWS Machine Learning Blog, March 22. Accessed 2019-10-22.
- Shukla, Vinod. 2019. "Turning Microsoft Word documents into audio playlists using Amazon Polly." AWS Machine Learning Blog, July 03. Accessed 2019-10-22.
- Top Voice Apps. 2019. "SSML." Accessed 2019-10-23.
- Tucker, Mark. 2019. "Speech Markdown is the Simpler Way to Format Text-to-Speech Content Over SSML." Voicebot.ai, June 20. Accessed 2019-10-22.
- W3C. 1998. "Voice Browsers: W3C Workshop: Call for Participation." October 13. Accessed 2019-10-23.
- W3C. 2006. "'Voice Browser' Activity." March 22. Accessed 2019-10-23.
- Walker, Mark R. and Andrew Hunt, eds. 2000. "Speech Synthesis Markup Language Specification for the Speech Interface Framework." W3C Working Draft, August 08. Accessed 2019-10-22.
- Wray, Michael. 2017. "Using Amazon Polly to Deliver Health Care for People with Long-Term Conditions." AWS Machine Learning Blog, June 30. Accessed 2019-10-22.
- Mikhalenko, Peter. 2004. "Speech Synthesis Markup Language: An Introduction." XML.com, October 20. Accessed 2019-10-22.
- Larson, James A. 2003. "The W3C Speech Interface Framework." March-April. Accessed 2019-10-22.
- Microsoft Docs. 2019b. "Speech Synthesis Markup Language (SSML)." Microsoft Azure, Microsoft, May 07. Updated 2019-09-27. Accessed 2019-10-22.
- Peterson, Terren. 2019. "If you want a winning voice app, implement SSML." Hackernoon, October 14. Accessed 2019-10-23.
- Vargas, Garrett. 2019. "How to find errors in your SSML responses." Medium, March 26. Accessed 2019-10-23.
- Taylor, Paul and Amy Isard. 1997. "SSML: A speech synthesis markup language." Journal of Speech Communication, vol. 21, no. 1-2. pp. 123-133, February. Accessed 2019-10-22.
- Speech Synthesis
- Natural Language Processing
- Speech Recognition
- Voice Interface Design
- Voice Assistants