Browsers and Language Features
Web browsers do an astounding amount of stuff with language - and we're always trying to do more. Like most things that we get "for free" and don't give a lot of attention to, there is a lot more to it than you might realize.
Recently, I've been bouncing around looking at several browser language features. Today, our web browsers can listen to us and transcribe our words, they can speak to us, they can do spell checking, and grammar checking. At a surface level there is just a seeming "need to understand words", right? But what exactly that means is actually really variant!
Speech to Text... but more.
Today, your browser can talk to you via the Web Speech API. That's not new. Recently there was a W3C workshop on Voice Interaction. All of the presentations are available on the W3C's playlist. For some reason the sound on some of them seems pretty bad, and unfortunately many very good discusions aren't recorded. One presentation Solving Lead vs. Lead: Consistent Pronunciation for Web Content was very interesting to me. In my 2017 post Greetings, Professor Falken I talked about how the underlying speech system was able to gather a lot from context. For example, even back then, all of the browsers and OSes I could try got these "right" in that they were not naively read but rather read as the correct "forms".
1. Pi is about 3.14
2. We loaded the 4x4
3. Please meet me at 3.14pm EST
4. My birthday is 2/17/1974
Going on 10 years later, today's speech systems would get most "lead" (the heavy element) vs "lead" (being out in front) examples correct because of context. But, in the talk and later discussion, there are plenty of examples where you shouldn't really rely on that. For example, if we're trying to teach someone how something is said. That isn't really just about academics, instances abound.
In that same 2017 post I also mentioned that what the speech subsystems didn't get right was "Greetings, Professor Falken". They didn't pronounce it like the movie. I overcame that in the post by feeding it a misspelling, but this was sort of non-deterministic too, and solved by trial and error. Sarah (the presenter of the W3C talk) lays into a lot of examples like this - we discussed way more examples than I was initially considering where either there was no great context, or where no amount of context is actually likely to help. A lot of these cases are proper nouns or regional pronunciations. Montpelier, VT and Montpelier in the south of France are pronounced very differently. Barre, VT and Martin Barre the lead guitarist for Jethro Tull are pronounced very differently. These are somewhat famous examples. Ana can be pronounced several ways - which one is this? How do you pronounce the names of fictional characters? Or companies and products? And so on.
Sarah is from ETS, who also participated in an earlier "Spoken Presentation Task Force" which produced a proposal for Spoken HTML - and in theory Web Speech should support SSML too. So, that's the proposed solution for that kind of problem. Keep that in the back of your mind for now.
As you can imagine, this is all true in the reverse direction as well. That is, if you are transcribing speech there are sound-alike words and so on - which do I transcribe? “cache” or “cash”? Is it "Barre" or "Barry" or "Berry" Vermont? Do I write "4 by 4" or "4x4" or "Four by Four"? Again, today's models will do very well, generally - but you'll have problems with most of those same things, including especially all of those proper nouns.
At the end of the day all of the listening part is statistics based. The listening machine is this many % confident that it heard X, a little less confident that it heard Y and so on... Then it's sort of a lot like an LLM. So, given context it can do better. Contextual biasing is a simple, common way to improve the result: Just tell it a list of words you might be more likely to use and it'll bias toward them (optionally, provide a weight as to how much more likely it is to be this word, vs something that sounds similar). So, in our example above, if your page is a discussion forum about Vermont politics, it's probably going to hear what is pronounced like "berry" but we want it to write "Barre" and not "Barry" or "bury" or "berry" or anything else, even if it's just a single word without a larger context. Like a host asks "where was that?" and someone replies "Barre".
That functionality was recently added to the Web Speech API in Chrome.
More Different Than You Might Think...
Listening, and speaking both feel so related, but the approach to the two is actually very different. In one it's just a word and optionally a number, and in the other is more complex - you need specialized knowledge about SSML and IPA (International Phonetic Alphabet), some mapping and... in practice more. That's because while SSML seems very deterministic compared to an arbitrary number and statistics, in practice, results vary. In theory, providing the IPA to pronounce "tomato" both of the popular ways- /təˈmɑːtəʊ/ and /təˈmeɪtoʊ/ (think of the old song Let's Call the Whole Thing Off) should make them be pronounced as expected. However, sending this to different speech engines yields unpredictable results. There isn't a way to check the authoritativeness. If the engine doesn't support IPA, it may read contents of the SSML itself, if it exists. If you Give it a name like Saoirse Ronan - even with SSML and a very good engine that supports phonemes and IPA - it often still won't actually pronounce it properly unless it has a good Irish voice... And there aren't actually a lot of them.
Contextual biasing is clearly no help on the pronunciation side of things. But, flip it around and you might think that IPA would be a good input as to how to hear words too - but I've not really found anyone who even tries to do that.
And now for something completely different
As I hinted at the beginning, those are only two examples. We also have things like spell checking - that's the part that marks things as spelled incorrectly - or grammar checking. You can independently style these two cases with the CSS :spelling-error and :grammar-error pseudo-classes independently, as written about by my colleague Stephen Chenney who did the work at Igalia thanks to funding from Bloomberg.
Note that neither of these suggests words in any way. None of them deal with the concepts of autosuggest or autocorrect or autofill or hints or maybe soon even stuff to indicate and generate text rewrites in the future.
None of these are the same thing. Many of them also inevitably have this same general specialized domains problem. For example, if I am editing some Hitchhiker's Guide to the Galaxy wiki and it contains a quote of Vogon poetry:
Oh freddled gruntbuggly, Thy micturations are to me, (with big yawning) As plurdled gabbleblotchits, On a lurgid bee, That mordiously hath blurted out, Its earted jurtles, grumbling Into a rancid festering confectious organ squealer. [drowned out by moaning and screaming]
None of those words are actually spelling errors in this context, and highlighting them as such entirely ruins the experience - there are so many false positives, you miss the real errors.
And almost all of these problems are also somehow still differently handled.
For example, spell check dictionaries are located at potentially several levels. Sometimes you have one in a browser, sometimes the browser is just integrated with the OS level one. Sometimes a browser has 1 per profile. In Android, virtual keyboards have their own dictionaries.
But what does that even mean, "dictionaries"?
Just as is in the cases of speech to text, or text to speech, it can mean something different. A lot of things use a thing called Hunspell which packs up language "dictionaries" that work for all languages and more efficiently can encode complexities like plural rules, autosuggestion help and all sorts of things. For example, here is the LibreOffice en_GB.dic. In this file you'll find simple words like ablaze but most words have some kind of 'affixes' and you'll find similar words in runs that look something like this
spoon-feed/SG
spoon/D6GSM
spoonbill/MS
Spooner/M
spoonerism/SM
spoonful/MS
spoonier
spooniest
spooniness/M Noun: uncountable
spoonsful
spoony/SMY
These connect to affix definitions in an parallel ".aff" file. This is full of entries like
SFX n e ations [^ckt]e
SFX n 0 ations [^e]r
SFX n e ations [iou]te
SFX n y ations py
SFX n ke cation ke
SFX n ke cation's ke
SFX n ke cations ke
SFX n y ication [^p]y
SFX n y ication's [^p]y
In other words... It's complicated, but covers a lot.
If you ever right clicked something and selected "learn word" or "add to dictionary" or something, you're doing the equivalent of adding ablaze in the ".dic" file - just a simple string. But, unlike Hunspell or other complex formats, it's simple enough that even an end user can do it.
Bloomberg is also sponsoring our work on a proposal called the SpellCheckCustomDictionary API. We've been working on an explainer. Realistically maybe we should call it SpellCheckExclusions to be clearer that all it really does is allow a site to provide a list of words to not match as :spelling-error. We also let you do that in groups so that any active spellchecking can just re-run once.
There was a lot of debate and thought about how much of this should be shared or centralized, but starting with a simple list that can be well defined in terms of standard exclusion seems like a nice first step. It does mean that you would need to potentially add both "Gandalf" and "Gandalf's" as valid, and "hobbit" and "hobbits" and "hobbit's" and "hobbits'". But at least this is simple and understandable, works in every language and widens the pool of developers who might do it. A nice trade-off here might be to allow some sense of regexp in this list - though, for practical purposes it should probably be a limited subset (() for grouping, | for alternation, ? for optional, plus maybe an :i sigil for case insensitivity). This would potentially help make it easier for some authors to express more with less and help shrink the memory initially required, while still not getting too specialized and fairly easy to make performant enough.
It would also be nice to follow this with a purely declarative solution.
I'm pleased that this went to Stage 1 in WHATWG this week and looking forward to figuring out how we move this forward.
Anyway, it's been really interesting and fun to dig into all of this and it's always eye-opening to look behind another curtain... I'm looking forward to continuing these conversations and ultimately getting something useful into all of the browsers! Thanks again to Bloomberg Tech for the sponsorship!