Greetings, Professor Falken: Web Speech APIs Part I

There's a long, rich history of humans figuring out how to make machines talk and listen, and nearly as long as there has been a W3C there have been efforts to get these features into the browser. It's too complicated to recount here, that's a whole post of it's own. Suffice it to say that it's been fits and starts, steps forward and back and where we are right now is... muddy, but it works... sort of. In this series of posts, I'll tell you all about it...

The Current State of things

The bad news first: There have been several "official standards" that aren't ultimately giving us speech in browsers. There is no Speech API for the browser currently on any standards track.

The good news: There have been unofficial proposals and experiments that give us some (not insignifiant) degree of speech in all modern browsers and together, we can change the 'unofficial' part if we want that.

In 2012, Google shipped an implementation of an unofficial API proposal developed in a community group. Other vendors have since shipped parts of mostly the same API despite it not being on a formal standards track.

The draft contains a number of errata and the implementations don't all agree entirely. Since it's unofficial and a Note, they aren't really updating the draft and making it better isn't exactly the highest priority.

There's some more good news in that this is a fairly low-level interface, allowing us to experiment atop it and inform standards, The Extensible Web Manifesto style. If we can renew interest, perhaps we can get it on a legitimate a standards track and work out the kinks.

Utterly Basic

Probably the simplest thing you might want to do is have your page say something. You do this by creating a SpeechSynthesisUtterance for those words and handing that to the speechSynthesis.speak(...) method.

// I'll explain later why I'm misspelling Falken
speechSynthesis.speak(
  new SpeechSynthesisUtterance(
    'Greetings, Professor Faulken'
  )
)

On this, everyone agrees and yay that is pretty easy! Congraulations, we have easily achieved 1983 War Games speech technology!

From here on, I'll just refer to SpeechSynthesisUtterance as an 'utterance', because that is a lot easier on the eyeballs.

Once you have constructed an utterance, you can mess with several properties before passing it off to be spoken. What we've seen here is just using defaults, and the defaults will vary from machine to machine. All you can really say for sure is that that will be said (well... almost for sure, but we'll come back to that).

Pitch, rate and volume

The most basic properties are:

.pitch (a float from 0 to 2, defaults to 1)
.rate (a float from 0.1 to 10, defaults to 1)
.volume (a float from 0 to 1, defaults to 1).

If these ranges seem unnecessarily arbitrary and confusing to you, all I can say is: same.

Let's try varying it a little and pay tribute to Freddie Mercury...

let one = new SpeechSynthesisUtterance(
    `My money, that's all you wanna talk about.`
)
one.pitch = 2
speechSynthesis.speak(one)

let two = new SpeechSynthesisUtterance(`
    But I'm no fool`
)
two.pitch = 1.5
speechSynthesis.speak(two)


let three = new SpeechSynthesisUtterance(
    `It's in the lap of the Gods`
)
three.pitch = 0.25
three.rate = 0.5
speechSynthesis.speak(three)

Queueing

The speak(...) method is asynchronous. It returns immediately and speechSynthesis manages a queue of utterances that are intended to be spoken.

speechSynthesis is responsible for the queue and exposes some queue related methods (like .pause() .resume() and .cancel() (which clears the queue), all of which are async. Now, this can get a little confusing if you don't understand what's happening, especially since it's a little buggy (a lot in some cases).

According to the draft: [on .speak()'ing an utterance] The SpeechSynthesis object takes exclusive ownership of the SpeechSynthesisUtterance object. Passing it as a speak() argument to another SpeechSynthesis object should throw an exception. In practice though, no one seems to have implemented this, and it doesn't throw. What this means is that at until some point you can still modify (certain) properties of the utterance in the queue as long as you have a reference to it, but what happens depends on whether it's been processed at all yet.

Let's look at at a kind of circuitous example with some Neil Young lyrics that illustrates how this can get confusing...


let utteranceOne = new SpeechSynthesisUtterance(
      `My My, Hey Hey`
    ),
    utteranceTwo = new SpeechSynthesisUtterance(
      `Rock and Roll is here to stay`
    ),
    utteranceThree = new SpeechSynthesisUtterance(
      `It's better to burn out`
    ),
    utteranceFour = new SpeechSynthesisUtterance(
      `Than to fade away`
    )

speechSynthesis.speak(utteranceOne)

/* Changing what we've asked to be
  spoken after queueing.. It should
  throw, but it doesn't.  Will
  it be higher pitched? */
utteranceOne.pitch = 2

speechSynthesis.speak(utteranceTwo)

/* Changing this one too, will it? */
utteranceTwo.pitch = 2

speechSynthesis.speak(utteranceThree)
speechSynthesis.speak(utteranceFour)

console.log(
  `this will log immediately
   before anything is spoken
   because it's async`)

/* pause will actually happen before
  anything can be spoken */
speechSynthesis.pause()

setTimeout(() => {
    // this will begin processing again
    // in ~ 2 secons..
    // but what will the rate of each be?
    console.log(`resuming...`)
    speechSynthesis.resume()
}, 2000)

Did it work? Hard to say, see the notes below. Don't let any of this confuse you into trying to change an utterance once you've sent it to speak, just consider it immutable from your perspective and your life will be a lot happier.

In Chrome (desktop, mac) at the time of this writing the speechSynthesis queue appears to be effectively shared at a very low level, meaning that it translates between requests and across domains. If anyone has called speechSynthesis.pause(), whatever you add to the queue won't get spoken until it is unpaused, and then it is subject to coming after whatever is already in the queue, no matter where it came from. I'm not going to demo this as it's likely to only cause confusion, but if something suddenly isn't working and your browser supports speech, try running speechSynthesis.resume() in the console and see what happens.

If you're on Chrome on Android, this probably didn't work, worse - if you go back, run the first example, it won't work either. .resume() doesn't actually seem to work. Don't panic, just click this button to . How did I do this? speechSynthesis.cancel() to clear the queue. Unfortunately, I swear I have seen cancel actually resume and speak out stuff that was queued, but I can't find a case that illustrates it.

Utterance Events

Because everything is async, if you wanted to do something like sync up some UI with spoken text, you'll need to know when things happen by watching for events on the utterance. Utterances have several events: The ones you probably most care about are:

onstart "Fired when this utterance has begun to be spoken"
onend "Fired when this utterance has completed being spoken. If this event fires, the error event must not be fired for this utterance."
onerror "Fired if there was an error that prevented successful speaking of this utterance. If this event fires, the end event must not be fired for this utterance" We'll come back to this one...

Let's try it out...


let utteranceOne = new SpeechSynthesisUtterance(
      `We come from the land of the ice and snow`
  ),
  utteranceTwo = new SpeechSynthesisUtterance(
      `From the midnight sun where the hot springs flow`
  ),
  syncUIHandler = (event) => {
    // There is an event.utterance in chrome,
    // but that seems to be non-standard,
    // you want event.target which is the
    // utterance object (though it should be read-only)
    document.querySelector('#zepplin-out').innerText = event.target.text
  }

  utteranceOne.onstart = syncUIHandler
  utteranceTwo.onstart = syncUIHandler

  utteranceTwo.onend = () => {
    console.log('done')
  }

speechSynthesis.speak(utteranceOne)
speechSynthesis.speak(utteranceTwo)

You'll see some output here

Assuming that that worked for you: Yay! Couldn't be simpler, right? Got your head well around that? Good... Ok, now let me explain why you probably don't.

I think that most people would associate the words "when this utterance has begun to be spoken" with "some kind of sound has begun to flow out of the speakers" but this isn't the case. If we added a speechSynthesis.pause() to the end of that code sample, the first utterance's .onstart would be called, and the words would appear on the screen, but no sound would play. This is universal in all the browsers. Luckily, there ares are also .onpause and .onresume events (but recall, if pause didn't work for you above, this one won't either). Let's see...


let utteranceOne = new SpeechSynthesisUtterance(
      `Maybe it's not too late`
  ),
  utteranceTwo = new SpeechSynthesisUtterance(
      `To learn how to love, and forget how to hate`
  ),
  syncUIHandler = (event) => {
    let el = document.querySelector('#ozzy-out')

    if (event.type == 'pause') {
      el.innerText = 'Paused'
    } else {
      // Chrome has an event.utterance object,
      // but other browsers don't
      el.innerText = event.target.text
    }
  }

  utteranceOne.onpause = syncUIHandler
  utteranceOne.onstart = syncUIHandler
  utteranceOne.onresume = syncUIHandler
  utteranceTwo.onstart = syncUIHandler
  utteranceTwo.onpause = syncUIHandler
  utteranceTwo.onresume = syncUIHandler

speechSynthesis.speak(utteranceOne)
speechSynthesis.speak(utteranceTwo)
speechSynthesis.pause()

setTimeout(() => {
    speechSynthesis.resume()
}, 2000)

You'll see some output here

Did it work? Here's some bad news... there are a lot of notes here:

If you're on Android and nothing is happening, you might have missed the note about .pause not working right there. and you should click this button to clear the queue and (don't you wish that button worked on the whole world).

At least in Chrome (that's just the one I use most) it appears to be very easy to create situations where callbacks are not called, despite it being blatantly obvious that they happened. There have been a couple of bugs opened about this. In my experience, this seems to be closely related to either the 'shared queue' note above or garbage collection causing utterance objects to be cleaned up before they are really fired because of longer text, so you've got to deal with all that yourself.

One more thing... If .onError is called, the event it raises will have an .error attribute (or, it's supposed to) which should be an error code indicating why. The draft lists 11 'kinds' of problems and they are both informatice to the kinds of things you should be thinking about, but some of these don't seem to happen IRL and leave undefined spec behavior gaps.

Expressiveness

Speech is complicated if you want it to seem natural. In my History of Speech I link to a recording of Bell Lab's Voder in the 1940s demonstrating the many ways one might intone the phrase "she saw me". Similarly, since we're just using strings of text, we're theoretically subject to all of the sorts of potential problems about ambigious strings. For example given the string "12/4 = 3" we'd probably like that to be pronounced "twelve divided by four equals three" whereas we might like "12/12/2012" to be pronounced like "December twelfth, twenty twelve".

The first bit of good news is that all of the speech synthesis implementations I've tried actually do a pretty good job using punctuation to inflect and pause and frequently can deal with a lot of those problems automatically... For example, all of the below pronouce sensibly for me without further effort:

speechSynthesis.speak(
  new SpeechSynthesisUtterance(`
    1. Pi is about 3.14
    2. We loaded the 4x4
    3. Please meet me at 3.14pm EST
    4. My birthday is 2/17/1974
  `)
)

But it's not bullet proof and it's automatic, not consistent. To some, this is unacceptable. The first example in this document purposely misspelled the name of Professor Falken from War Games because the best guess sounds like an expletive. To wit, in my history article I also point to several "official" standards that were developed in the W3C to deal with just these sorts of problems. I'm mentioning this here because the Speech API says that the text provided to an utterance can be one of these: A well-formed SSML (Speech Synthesis Markup Language) document. SSML lets you express all kinds of stuff. This would allow you to create a single utterance which varied all sorts of things throughout and let you express how you wanted each bit to be said. The spec gives implementations a way out saying that if the implementation doesn't support SSML, they have to just take the text content. That's a pretty elegant, progressive compromise I suppose. Well done standards!

It also completely doesn't work that way. All the implementations I've looked at will actually try to pronounce all the markup. That's bad. I suggest you just avoid that.

The really good news though is that you can actually achieve a lot of the same amount of control by doing your own processing/creation of utterances.

Language

Utterances also have a .lang property which you can set. It has to be a string representing a BCP 47 language tag (like en-US). If you don't set it, it will by default be the language of the document. This can give hints to pronounciation. Let's try an example from The Godfather:

let apollonia = new SpeechSynthesisUtterance(
   `io so l'inglese:
      Monday Tuesday Thursday Wednesday
      Friday Sunday Saturday
    `
  )
apollonia.pitch = 1.1
apollonia.lang = 'it-US'
speechSynthesis.speak(apollonia)

If you set the .lang to an invalid language, say, for example "Hobbit" or "Vulcan" it will simply use the default. I couldn't find that in the spec somewhere, but that seems to be the behavior in Chrome.

A million voices cried out

Let's talk about the most painful part of this: voices. You might have noticed that the example above from The Godfather sounded quite a lot different - it was using a different SpeechSynthesisVoice, automatically.

From here out, I'm going to refer to SpeechSynthesisVoice as simply "voice" because, well, that's a mouthful.

You can also set the voice yourself, but there are a number of gotchas here...

.getVoices(...)

You have to get the list of voices supported by your particular speechSyntheis by calling speechSynthesis.getVoices(). The bad news is that as this page was loaded I detected the length of the array returned from speechSynthesis.getVoices() and placed it here -> .

Odds are that you see a 0 above. That's because it's populated asynchronously and you have to wait for the voiceschanged event to ask for the list of voices. Here's a contrived example that should set the content of an element next to the Run button below to the name of each voice (the result you'd see would be the last one in the Array.


(function () {
    speechSynthesis.onvoiceschanged = function () {
       let el = document.querySelector('#voices-changed-1')
       // Hey, at least it's a real Array! (despite the spec)
       speechSynthesis.getVoices().forEach((voice) => {
          el.innerHTML = voice.name
       })
    }
}())

Output will come here if that method is called

Very, very likely nothing is happening for you no matter how many times you click that. Why? Because it's a race, as near as I can tell: The voices are already loaded by the time you ran that, so you missed the event - so it's up to you to manage that.

`SpeechSynthesisVoice` objects

Assuming that you manage to get the array of voice objects, each has several properties:

.default (a boolean indicating if this is the default voice currently being used)
.localService (a boolean indicating basically whether this uses the network to produce speech or works locally)
.name (a humanly readable string name identifier)
.voiceURI (a URI reference string)

According to the draft, you can set the .voice attribute of an utterance to any of these voices and that should cause the utterance to be spoken in that voice. The idea is that you could find a voice in the set and use it to make it sound nifty.. Let's look at an example using a voice called "Bubbles" to speak the lyrics of a song by Jack Johnson:


let voice = speechSynthesis.getVoices().find((voice) => {
    return voice.name == 'Bubbles'
  }),
  lyrics = new SpeechSynthesisUtterance(
      `It's as simple as something that nobody knows
       That her eyes are as big as her bubbly toes`
  )
  lyrics.voice = voice
  speechSynthesis.speak(lyrics)

Did it work for you? Hard to know because, here's the bad news: The list of voices you get back will vary depending on the browser and OS pretty wildly. On my mac I get a list of 60 voices in Safari v7 (don't ask), 43 voices in Chrome v60, and 24 voices in my Firefox Developer Edition. This voice doesn't exist on my current Chrome on Android.

Worse still though the shared speechSynthesis on Android means that it's going to speak in whatever voice was last chosen - pretty much anywhere. If you ran The Godfather example just before this on Android, it probably just spoke those lyrics in an Italian accent. Sweet. Just kidding, this is terrible and caused me many many problems.

Setting the .voice attribute on the utterance might be what's documented, and it might even be what we just (probably?) successfully ran, but what you've got to set if you want it to work in Android is the voiceURI attribute on the utterance, which isn't even listed in the list of attributes in the draft or most documentation I've read. Sadly, neither seems to work for me univerally, so set both if you're choosing a voice... And... good luck finding a way to select a good voice. Finally, not all voices support modification of things like rate and pitch.

Another quick note about the .voiceURI: "Bob" is a valid URI reference. So is the empty string. A URL is a kind of URI. So, basically what I am saying is: It could be pretty wildly variant, and it is, even for voices with the same name.

Wrapping up

So... It's pretty cool stuff, but it's a little like writing anything during the browser wars. You'll want abstractions to paper over the warts and help you deal with all the quirks, and probably just to make it not hellish to write with. There's a bunch of stuff I didn't cover, for example, there are 'boundary' callbacks and a few other things, but I think this gives you a lot. In my next piece I'll talk a little about papering over some of the warts, and then we will get into listening to speech.

Special thanks to my friend, the great Chris Wilson for proofing/commenting on pieces in this series.

Author Information