Author Information

Brian Kardell
  • Developer Advocate at Igalia
  • Original Co-author/Co-signer of The Extensible Web Manifesto
  • Co-Founder/Chair, W3C Extensible Web CG
  • Member, W3C (OpenJS Foundation)
  • Co-author of HitchJS
  • Blogger
  • Art, Science & History Lover
  • Standards Geek
Follow Me On...
Posted on 08/24/2017

You Don't Say: Web Speech APIs Part II

This is part of a series about making the browser speak and listen to speech. In my last post Greetings, Professor Falken: Web Speech APIs Part I, I talked about the existing APIs we "have" for speaking (as well as why the air quotes) and all the many ways that they are wonky today. In this post I'll share some of my own opinions on that API, as well as how I'm dealing with the speaking part IRL.

So, we have these wonky APIs implemented inconsistently in a lot of browsers. We know how to deal with that, we've done it before. I'd like to review our options to explain how I think it makes sense to approach this:

What's wrong with the current API?

Note: These are just my personal opinions after using it a lot for the past couple of months.

Let's assume that the API worked consistently, as advertised by the existing draft, across browsers and devices. Would it be what we really want or need. I think that, actually, the answer is no.

How low can you go?

On the one hand, it's intending to be a low level API, but here I feel like it still falls a little short. Speech is sound, and yet it's not clear to me how this really fits into the audio architecture. There was very brief disussion of this, but to not seems like a failure to me. For example, here's a nifty demo from MDN that allows you to add effects to your voice with input from the mic by connecting it to an audio stream which gets processed to add things like reverb. However, the draft nor implementation seem to have no way to intercept the audio stream generated by speechSynthesis. This seems like a shame because the set of things that you could do there is nearly unbounded, whereas the set of things you can do with the existing levers is exceptionally finite. In my own use cases, I really wanted those abilitites. Similarly, the most handy of them (pitch, rate, volume) overlap. Surely there is something 'down there' that deserves explanation. If these were done 'in post' there's no reason that any voice shouldn't support those. So, I think we need more there.

It seems to me there could easily be an 'audiocreated' callback or something which gave you a chance to deal with the audio itself instead of just piping it to the speakers.

Off to the races

Ok, so it's not the low-level API I'd like it to be, but there's a lot of room for improvement at higher levels too...

As I explained, .getVoices is weird. It's kind of typical Web weird in that listeners for events like DOMContentLoaded are also historically racey. The rationale, I believe, is that remote voices may or may not be available and that can actually change through the duration of your program. If you've ever used Google Maps Navigation you might have heard the speech change - I beleive that's what's going on there. But it feels like we should be past this as we have had Promises for some time and the two can co-exist. Most of the time what we really want to know is "have some voices loaded or not" and managing and guarding against that seems unnecessarily complicated. You can't put something into the queue with speechSynthesis.speak() until you have an utterance. You can't create an utterance until you have voices. You can't simply grab voices because they might not be loaded yet, you can't listen for voices to change if they are already loaded and, in theory, even once something is queued you could lose it before it is actually processed.

This is a lot to ask the average author to understand how to get right in my opinion.

It seems to me that since we can't 'speak' without a voice, whatever logic you are going to use to choose a voice from the available set should be available to the queue itself so that it can choose at the moment just before creating the audio. This would solve a whole lot. It also seems to me that speechSynthesis.speak() should return a promise representing completion or failure of the whole transaction (basically, end or error). For many common cases, this is all we really care about and is kind of analogus to the fact that XHR contains many 'shades of gray' status changes, but the most successful AJAX abstractions had a simpler pass/fail promise on the entire transaction.

Immutable?

The fact that you create an instance of an utterance and then set properties seems to lead to a lot of confusion, it's kind of hard to read too. It definitely seems to me that it would be very handy for it to take an options/config object like MutationObserver and then be immutable properties. Generally speaking, I'm not sure what you would hope to 'do' with an utterance once it has been created except put it in the queue or take it out of the queue and having it present you with an API that throws at some times and not others seems to be begging for classes of problems we don't really neeed.

Reuse

Think for a moment of the choice of voice, the pitch, rate and volume as a "speaker". In my own uses I found that I wanted to reuse these - either to have a single speaker I used all the time, or a few speakers that I wanted to variously "say" things. In the current incarnation, this was kind of more painful than I would like because I had to build up things aware of all the quirks above.

What I did

So, for me, I ended up building something very close to the above that allows me to do simple things simply... Here are some examples.

Look, just say something...

The simplest case is that you just want to say something and you don't care about anything else. To be honest, that was pretty simple before too, unless you wanted to say something long, because there were problems with that, or you cared that it might be defaulting you to some voice that wasn't even in the right language.. but here's what it looks like for me, and it works with long text by chopping it up at hopefully sensible points and chaining them together, selecting a voice by default that is at least in the same major language as the document.


new BasicVoiceSpeaker().say(`
    To be, or not to be, that is the question:
    Whether 'tis nobler in the mind to suffer
    The slings and arrows of outrageous fortune,
    Or to take Arms against a Sea of troubles,
    And by opposing end them: to die, to sleep
    No more; and by a sleep, to say we end
    the heart-ache, and the thousand natural shocks
    that Flesh is heir to? 'Tis a consummation
    devoutly to be wished. To die, to sleep,
    To sleep, perchance to Dream; aye, there's the rub.
`)
                

Can you take me higher?

But I can configure my speaker as described above, and reuse it. They queue in order.


let speaker = new BasicVoiceSpeaker({
    rate: 0.8,
    pitch: 1.2
})
speaker.say(`Greetings, Professor Faulken.`)
speaker.say(`Would you like to play a game?`)
                

Voices

Of course, all the crap with loading the voices/finding one are what you really want to be easier, so there is a .filter property that passes you the voices at the right time and lets you use whatever means you like to find one and return the one you want. If it doesn't return anything, it'll just use the first voice in voices that matches the preferred language of the document. If it can't find even that, It'll just pick the first voice.

let joshua = new BasicVoiceSpeaker({
    rate: 0.8,
    pitch: 1.2,
    filter: (voices) => {
        // keeping this overly simple for the example
        return voices.find((v) => { return /vox/.test(v.name) })
    }
})
joshua.say('A strange game.')
joshua.say('The only winning move is not to play...')
joshua.say('How about a nice game of chess?')

For what it's worth though, I found that most of the time what I really want to do is test against a number of regexps in preferential order searching for something and it's verbose and redundant... To be honest, I'm not really sure what else you could do. Because of this, I added some sugar where you can pass an array of objects with critera about which property (name or lang) you want to check and a regular expressions. It uses these to search, in order and just avoids all that redundant/potentially complicated/verbose function writing. What I'd really write, most of the time, is something more like this..


let speaker = new BasicVoiceSpeaker({
    // Here are several options, in order
    filter: [{name: /vox/}, {name: /noids/}, {name: /en-us/i}],
    pitch: 1.2
})
speaker.say(`D.O.D. pension files indicate current mailing as:`)
speaker.say(`Dr. Robert Hume, a.k.a. Stephen W. Falken, 5 Tall Cedar Road, Goose Island, Oregon 97`)
                

Queing and Promises

Each call to .say(..) also returns a promise, which is resolved when the entire text is spoken (even if it had to be chopped up), so you can kind of mostly have a synchronous looking block of calls that do what you expect them to do, but do it async, and you can still then do whatever larger coordination with those promises too...


// Just a simple conversation back and forth between
// two actors: An en-US person and one I've given
// several acceptable voices to search for, in order

let englishSpeaker = new BasicVoiceSpeaker({
    filter: [{ lang: /en-US/ })
})

let italianSpeaker = new BasicVoiceSpeaker({
    filter: [
        {lang: /it/},
        {name: /victoria/},
        {name: /vicki/},
        {name: /female/}
    ]
})

Promise.all([
    englishSpeaker.say(`How do you say "Good evening" in Italian?`),
    italianSpeaker.say(`We say "Buonasera"`),
    englishSpeaker.say(`Baynostera?`),
    italianSpeaker.say(`No, "Buonasera", try again.`),
    englishSpeaker.say(`Bwohnah sehrah?`),
    italianSpeaker.say(`si!! Molto bene!! Very good!!`)
]).then(() => {
    document.querySelector('#convo-sample-out').innerText = 'All done speaking'
})
              
Some output should show here

Wrapping up

So, this is 'at the bottom' of what I'm using and why. It still doesn't even attempt to deal with the whole audio stream thing. I'm not sure how to deal with that, but one thing I considered through all this is that you could probably record from the mic and then also offer the ability to capture that as bits you could then write out and save and then reload as simple sound. This might have uses if you have a tight script and really want to guarantee the voices are what you want them to be and even have the ability to mix or post process them, but... meh.

The code is still early, it's not a 'finished piece'. It's still rough around the edges and evolving, but that's ok because I'm really just trying to start a conversation and get people thinking and talking about speech again. In my next post I'll start dissussing the more complex end of all this, listening and how that actually also potentially complicates this problem in intersting way.

Special thanks to my friend, the great Chris Wilson for proofing/commenting on pieces in this series.