Thoughts on Voice Recognition: Web Speech APIs Part IV

This is part of a series about making the browser speak and listen to speech. In my last post You Don't Say: Web Speech APIs Part III, I talked about the recognition interfaces: listening or speech-to-text.

To sum up the state of things with recognition:

It's much less widely implemented and deployed than Text-To-Speech (TTS).
Like TTS, it is not an 'official' standard. Browsers are working on it, but not much work seems to be happening 'together' on this anymore.
It requires special privileges
Its API is kind of confusing

What's wrong with the current APIs

These are just my personal opinions after using it a lot for the past couple of months. Sometimes eating the dogfood gives you new perspectives and nothing in this should be taken as a criticism of the brave souls that worked hard to bring us what we have. It's easy to forget that we're standing on their shoulders. Hats off to them.

Let's assume that the API worked consistently, as advertised by the existing draft, across browsers and on all devices. The trouble with is that I see is mostly just that it doesn't "fit" in the Web platform today. It's not really a high level API, nor is it quite the low level we probably need.

Imagine...

When I hear "voice recognition" or "speech to text", this implies to me that I can take a sound, analyze it and turn it into a transcipt of text. When I watch a demo and I see someone speak into the mic and words come out on the screen, I have the same impression. I get excited and begin thinking of the possibilities. Yes, I want that. Think of all the things I could do with the ability to turn sounds of spoken word into text - I begin to imagine...

I turn the directory of podcasts I've downloaded into text. I could do lots of interesting things with that, skim them for example to see if I want to listen. Better still, I can index them and hyperlink them.. I can make a word cloud... I can search for things that were discussed without trying to remember what was said about Web Components in .. hmm some a podcast? or was it a talk? It was Jeremy Keith... I think? Last year maybe? In fact, I don't even have to have watched it to know that.
Speaking of podcasts, I've been thinking of doing one. Imagine I'm in a studio with 4 other people all with their own mics. It'd be super neat to identify them by name and capture their transcribed text in a simple transcript. Hey, that's kind of like meetings. I attend a lot of meetings that aren't minuted. We could fix that.
I have a thing that lets me upload a video to a website I maintain... I know that for accessibility sake I need captions of this video and that has to be in Timed Text Markup Language (TTML). Wow, that's a pain in the ass to write from scratch. I could write a thing to take the audio track, turns chunks of it into text, inserts that into TTML and uploads me a draft in my workflow that I can then simply proof, correct, approve and publish. We can't do that with this API either.
I also had an idea for simplistic, hands-free conversational assistant for while I am driving - kind of my own simplified Jarvis.

But there's the rub, none of these are actually easily accomplishable with the current API for various reasons, and mostly simply because it is kind of hard-wired toward that sound coming from the mic without really exposing that.

This is a shame because since The Extensible Web Manifesto (EWM), we've worked very hard to help "explain the magic" already burried in features of the platform and focus our initial efforts on the powers that are fundamentally new. As far as I can tell, really the only thing in these APIs that is fundamentally new is the idea of an interface for transcibing audio of speech into text. Things like promises, web audio, media devices, fetch and streams are all things we've worked on and that can potentially be used to help explain just about everything else in this API. As the speech draft predates all this, it's no suprise that none of that is taken into account, but it's not too late.

In fact, the authors of the spec did recognize some of this and there are notes in the spec and errata that link to, for example this observation with the thought that perhaps it was too late for this draft, as it is actually a 'final report' and not an official spec. I hope we have room to revisit this.

Let's use this as an exersize to start low and build up the concepts we want, as each level affords new opportunities for people to dream up things we weren't even considering. I've written about this before, but it's astonishing how many creations that are fundamental things today turned out to not be what their inventors even thought they would be. Houses were originally wired for artifical lighting - there were no plugs, because there was nothing to plug in. While we were busy standardizing bulbs, an astonishing flourishing of inventions and whole new industries were created by the simple fact that we gave people electricity.

A 'low-level' transcription interface

Let's imagine that we started simply with the idea of an API that gave us just the fundamentally new part: Something that takes some audio and gives you back the transcribed text. Accessing your filesystem or your mic are things that require permission, but what we're talking about here doesn't itself do that, so that's not even something we need to worry about. Actually, asking for analysis of something could be modeled after the HTTP request/response model, and that seems kind of fitting because there's potentially some variance in the response. For example, different implementations can definitely have different interpretations of the same sounds. So, let's start with the common model of being simple async and transactional, pass/fail things that we deal with via promises and resolve some kind of response...something like:

// let's save bikeshedding a name for another day...
let transcriber = new FictionalTranscriber()

transcriber.transcribe(someAudioStream).then((response) => {
  // we'll talk more about this
})

That's pretty straightfoward I think. A transcriber has a response and an error is handled with the promise's .catch method.

Results

While the current interface you get these wonky array-likes which default to having exactly one item in them, but let's kind of work backwards and see if we can avoid this.. Imagine that the response itself simply contains the properties .transcript and .confidence (from the existing APIs). Then the default 'just give me your best guess' case is as simple as:

let service = new FictionalTranscriber()

transcriber.transcribe(someAudioStream).then((response) => {
  console.log(`
    I am ${response.confidence}% confident
    that I heard "${response.transcript}".
  `)
})

The existing APIs also contain a separate event called onnomatch that is different from onerror in that it's more about it receiving something and processing it but not being able to confidently enough deterimine what was said. It seems that what we mean to say is "we recognized nothing" or "we heard something but what it was is undefined". In both cases these have programming concepts for these: The null string ("") or, in JavaScript we also have undefined. In JavaScript are falsey - so I think we can get rid of that extra event entirely with something as simple as saying that that is a possiblility...

let transcriber = new FictionalTranscriber()

transcriber.transcribe(someAudioStream).then((response) => {
  // if we didn't match anything sufficiently
  if (!response.transcript) {
     console.log(`Sorry, I didn't catch that`)
  } else {
    console.log(`
      I am ${response.confidence}% confident
      that I heard "${response.transcript}".
    `)
  }

})

Alternatives

"Ah," you're thinking, "but those arrays are meaningful... One of them contains N possible 'alternatives'". True, but alternative to what even? This seems like a strange use of the word anyway. Conversely, our model could also allow our response to contain an .alternatives array beyond our best guess transcript. By default, it would be empty because there are no alternatives. This seems to actually match the meaning of the word "alternative" better anyway. Asking, for them seems like it could simply be an optional configuration...

let transcriber = new FictionalTranscriber({maxAlternatives: 10})

transcriber.transcribe(someAudioStream).then((response) => {
  console.log(`
    I am ${response.confidence}% confident
    that I heard "${response.transcript}".
  `)

  if (response.alternatives) {
    console.log(`But here are ${evt.alternatives} alternatives... `)
    response.alternatives.forEach((alternative) => {
      console.log(`
        I am ${alternative.confidence}% sure it could
        have been "${alternative.transcript}"
      `)
    })
  }

})

Implementation Flexibility and Black Boxing

The existing API designs seem to want to allow you to specify a serviceURI and that sort of thing, but it seems unimplemented and there seems to be some kind of hand-waving on just how that would work, because there are competing ideas.

One of the extra handy things about this kind of design, I think, is that it also neatly sidesteps a number of problems that are impossibly difficult to resolve at this stage by boxing those problems at an appropriate place where there is easier to reach value and yet room for experimentation. Let me explain...

What we are providing what is, effectively, a service interface. We're not saying how the actual transciption should be done. There are an astounding amount of challenges there. There are lots of different existing remote services with lots of different qualities, lots of different APIs and lot of different sorts of inputs. A want to figure all of this out will stall a process for a very long time and be, ultimately, as much about politics and sway as actual merit.

On the other hand, if we simply say "Here's what it has to look like on the client, and the client should include a default implementation" we have to agree to very little. While this seems like it isn't giving you 'much', the adoption of this simple pattern would provide the standard we most desparately need: All we have to agree on at this stage is what the client interface looks like.

There are some other practical upshots to this too. The first is that we can also make that extensible by defining basic API lifecycle methods and allow competition and variance to play itself out very easily and employ lines of thinking we can't easily consider in a standards body. Imagine, for example, that Google's cloud platform could release a subclass:

export default class FictionalGoogleCloudTranscriber
  extends FictionalTranscriber {
    // override
    request: (audio) => {
      // return some promise that meets the expected API
      return fetch(....).then(...)
    }
}

The only thing that changes is which one you instantiate - they all work, at the API level, the same way. This means that these adapters are then as easily distributable as jQuery plugins and Amazon can compete with Google and Microsoft and Mozilla and... well... anyone while we figure out

This also fits very nicely with allowing us to experiment with the remaining (seemingly unimplemented) ideas in the APIs (like grammars) and provides room for us to work things out. Eventually, perhaps we can derive a higher level protocol allowing us to express things as simply as:

let transcriber = new FictionalTranscriber({
  serviceURI: 'blah',
  grammars: whateverThisIs
})

transcriber.transcribe(someAudioStream).then((response) => {
  console.log(`
    I am ${response.confidence}% confident
    that I heard "${response.transcript}".
  `)
})

Importantly though, this means we don't have to wait for that day and we can be part of the solution in figuring out what it should be.

Interestingly, there was another proposal, also from someone at Google more recently that is considerably closer to this idea. However, that idea seems to have not gone far.

Low Level First, Not Exclusively

The idea isn't ever to stop at the low level APIs. It's not even that we absolutely positively have to have the low level APIs entirely in place. As I explain in What Would Bruce Lee Do? The Tao of the Extensible Web, in some cases we can come back and expose details of a carefully designed black box after the fact. At a minimum, describing the API in common platform terms and the thought exercise of what a suitable low level API would look like allows us to design the boxes and layers carefully.

For example, given such an API as described above, it would make sense to add a higher level API which made use of it and exposed it... A pattern which explained the common lifecycle of grabbing a device, getting permission, listening to a channel, chunking that input, sending it to the transcriber and ultimately pumping out transcripts. Perhaps something like..

// Here we create an object not entirely unlike what
// the API already had - it defaults to being initialized
// with the audio context/primary mic, sends a chunk
// of sound to our transcriber and relays the promise
// results or errors to events
let speechRecognizer = new FictionalSpeechRecognizer()


// event based methods relay promise results
speechRecognizer.onresult = (evt) => { /* evt.response.transcript */ }
speechRecognizer.onerror = (evt) => { /* evt.error */ }

// the audioContext is exposed: You can control it
speechRecognizer.audioContext.start()
speechRecognizer.audioContext.stop()

The general idea being that we can answer the questions "what's under there?" by saying "it has an audio context and a transcriber" and we can expose those for extensibility... For example, we could allow you to swap out the transcriber implementation without forcing you to reimagine any other part in the system..

let speechRecognizer = new FictionalSpeechRecognizer({
  transcriber: new FictionalGoogleCloudTranscriber()
})

Or if you can get an audio context another way (say, one of N line inputs, or from a file), you can swap that out easily without forcing you to reimagine anything else either..

let speechRecognizer = new FictionalSpeechRecognizer({
  audioContext: someAudioContext
})

Given such an explanation, we could easily experiment with further layers, floating ideas as custom elements. For example: An input decorator that could progressively enhance an textual form controls to allow you to populate them with speech. All you'd need to say is 'they have one of those speech recognizer things and provide a toggle that pipes voice to it and the output back into the field'.. Something like this..


<x-voice-listener>
  <input name="input">
</x-voice-listener>

That's kind of interesting because there were specs trying to do something like that back to the early 2000's and it seems to be where Google restarted the converstaion in 2012 as well. We could figure that out, and do it with real world experimentation and testing! As you can see from this super rough demo, there's a lot to think about.. Should there be an attribute to autostart? Should it automatically stop? Should other attributes be present? paused, for example? I don't know, but we could try a lot of things and see what works.

So now what?

Since there's something implemented on Chrome and Android, you have a lot of 0's in the number of machines that actually support the existing API. As with a lot of 'new' features on the Web, you'll probably want think about how you can use progressive enhancement for a lot of use cases... In some cases, the API is good enough that you can paper over its kinks and that might be 'good enough'.

Unfortunately, the low-level API sketched out in this document isn't possible to achieve using the existing Speech APIs because they don't provide adequate explanations or hooks. Fortunately, however, it is entirely plausbile to make this API real today using a remote service that conforms to a design like this. Such services that transcribe sounds of speech to text are plentiful - Google's Cloud Platform and Amazon both provide APIs capable of achieving this for limited purposes would be free enough to experiment with and prove out, and fairly low cost beyond that.

I have a kind of rough POC that I'll be building out, for purposes of demonstrating ideas to limited groups, but wide publicly sharable demonstration is potentially cost prohibitive (unless some vendor wants to donate free time). However, if you have a business use for this, perhaps it's worthwhile. If you're interested in helping out with that, or hearing more, let me know.

Special thanks to my friend, the great Chris Wilson for proofing/commenting on pieces in this series.

Author Information