monotonous.org

Introducing Spiel

A New Speech API and Framework

Spiel Logo

I wrote the beginning of what I hope will be an appealing speech API for desktop Linux and beyond. It consists of two parts, a speech provider interface specification and a client library. My hope is that the simplicity of the design and its leverage of existing free desktop technologies will make adoption of this API easy.

Of course, Linux already has a speech framework in the form of Speech Dispatcher. I believe there have been a handful of technologies and recent developments in the free desktop space that offer a unique opportunity to build something truly special. They include:

D-Bus

D-Bus came about several years after Speech Dispatcher. It is worth pausing and thinking about the different architectural similarities between a local speech service and a desktop IPC bus. The problems that Speech Dispatcher tackles, such as auto-spawning, wire protocols, IPC transports, session persistence, modularity, and others have been generalized by D-Bus.

Instead of a specialized module for Speech Dispatcher, what if speech engines just exposed an interface on the session bus? With a service file they can automatically spawn and go away as needed.

Flatpak (and Snap??)

Flatpak offers a standardized packaging format that can encapsulate complex setups into a sandboxed installation with little to no thoughts of the dependency hell Linux users have grown accustomed to. One neat feature in Flatpaks is that they support exposing fully sandboxed D-Bus services, such as a speech engine. Flatpaks offer an out-of-band distribution model that sidesteps the limitations and fragmentation of traditional distro package streams. Flatpak repositories like Flathub are the perfect vehicle for speech engines because of the mix of proprietary and peculiar licenses that are often associated with them, for example…

Neural text to speech

I have always been frustrated with the lack of naturally sounding speech synthesis in free software. It always seemed that the game was rigged and only the big tech platforms would be able to afford to distribute nice sounding voices. This is all quickly changing with a flurry of new speech systems covering many languages. It is very exciting to see this happening, it seems like there is a new innovation on this front every day. Because of the size of some of the speech models, and because of the eclectic copyright associated with them we can’t expect distros to preinstall them, Flatpaks and Neural speech systems are a perfect match for this purpose.

Talking apps that aren’t screen readers

In recent years we have seen many new applications of speech synthesis entering the mainstream - navigation apps, e-book readers, personal assistants and smart speakers. When Speech Dispatcher was first designed, its primary audience was blind Linux users. As the use cases have ballooned so has the demand for a more generalized framework that will cater to a diverse set of users.

There is precedent for technology that was designed for disabled people becoming mainstream. Everyone benefits when a niche technology becomes conventional, especially those who depend on it most.

Questions and Answers

I’m sure you have questions, I have some answers. So now we will play our two roles, you the perplexed skeptic, unsure about why another software stack is needed, and me - a benevolent guide who can anticipate your questions.

Why are you starting from scratch? Can’t you improve Speech Dispatcher?

Speech Dispatcher is over 20 years old. Of course, that isn’t a reason to replace it. After all, some of your favorite apps are even older. Perhaps there is room for incremental improvements in Speech Dispatcher. But, as I wrote above, I believe there are several developments in recent years that offer an opportunity for a clean slate.

I love eSpeak, what is all this talk about “naturally sounding” voices?

eSpeak isn’t going anywhere. It has a permissible license, is very responsive, and is ergonomic for screen reader users who consume speech at high rates for long periods of time. We will have an eSpeak speech provider in this new framework.

Many other users, who rely on speech for narration or virtual assistants will prefer a more natural voice. The goal is to make those speech engines available and easy to install.

I know for a fact that you can use /insert speech engine/ with Speech Dispatcher

It is true that with enough effort you can plug anything into Speech Dispatcher.

Speech Dispatcher depends on a fraught set of configuration files, scripts, executables and shared libraries. A user who wants to use a synthesis engine other than the default bundled one in their distro needs to open a terminal, carefully place resources in the right place and edit configuration files.

What plan do you have to migrate all the current applications that rely on Speech Dispatcher?

I don’t. Both APIs can coexist. I’m not a contributor or maintainer of Speech Dispatcher. There might always be a need for the unique features in Speech Dispatcher, and it might have another 20 years of service ahead.

I couldn’t help but notice you chose to write libspiel in C instead of a modern memory safe language with a strong ownership model like Rust.

Yes.

speechSynthesis.getVoices()

Half of the DOM Web Speech API deals with speech synthesis. There is a method called speechSynthesis.getVoices that returns a list of all the supported voices in the given browser. Your website can use it to choose a nice voice to use, or present a menu to the user for them to choose.

The one tricky thing about the getVoices() method is that the underlying implementation will usually not have a list of voices ready when first called. Since speech synthesis is not a commonly used API, most browsers will initialize their speech synthesis lazily in the background when a speechSynthesis method is first called. If that method is getVoices() the first time it is called it will return an empty list. So what will conventional wisdom have you do? Something like this:

function getVoices() {
  let voices = speechSynthesis.getVoices();
  while (!voices.length) {
    voices = speechSynthesis.getVoices()
  }

  return voices;
}

If synthesis is indeed not initialized and first returns an empty list, the page will hang in an infinite CPU-bound loop. This is because the loop is monopolizing the main thread and not allowing synthesis to initialize. Also, an empty voice list is a valid value! For example, Chrome does not have speech synthesis enabled on Linux and will always return an empty list.

So, to get this working we need to not block the main thread by making asynchronous calls to getVoices, we should also have a limit on how many times we attempt to call getVoices() before giving up, in the case where there are indeed no voices:

async function getVoices() {
  let voices = speechSynthesis.getVoices();
  for (let attempts = 0; attempts < 100; attempts++) {
    if (voices.length) {
      break;
    }

    await new Promise(r => requestAnimationFrame(r));
    voices = speechSynthesis.getVoices();
  }

  return voices;
}

But that method still polls, which isn’t great and is needlessly wasteful. There is another way to do it. You could rely on the voiceschanged DOM event that will be fired once synthesis voices become available. We will also add a timeout to that so our async method returns even if the browser never fires that event.

  async function getVoices() {
    const GET_VOICES_TIMEOUT = 2000; // two second timeout

    let voices = window.speechSynthesis.getVoices();
    if (voices.length) {
      return voices;
    }

    let voiceschanged = new Promise(
      r => speechSynthesis.addEventListener(
        "voiceschanged", r, { once: true }));

    let timeout = new Promise(r => setTimeout(r, GET_VOICES_TIMEOUT));

    // whatever happens first, a voiceschanged event or a timeout.
    await Promise.race([voiceschanged, timeout]);

    return window.speechSynthesis.getVoices();
  }

You’re welcome, Internet!

HTML AQI Gauge

I needed a meter to tell me what the air quality is like outside. Now I know!

If you need one as well, or if you are looking for an accessible gauge for anything else, here you go.

You can also mess with it on Codepen.

This is how I surf the Internet

tab bar with a bunch of new tabs

Aside from a handful of pinned tabs, I open a new tab for anything I need to do: search the web, file a bug, look up documentation, check on the news, the weather, you get the idea. I am also addicted to Firefox’s new tab page, so I’ll often open a new tab out of boredom to let Pocket suggest an article for me. I hardly ever look at the same tab twice. If I need to get to something, it is never worth digging through all those tabs, I’ll just type what I am looking for in a new tab, and hope for a good suggestion from the awesomebar. After a couple of days I’ll have hundreds of tabs open. I declare “tab bankruptcy”, I purge them all, and start over.

A while ago I made an addon for myself. It was essentially a tab FIFO. It would only allow 10 tabs to be open at a time. If an 11th tab was created, the least recently activated tab would be closed.

Throttle Tabs popup

I came to think what if I am not the only person who abuses tabs in this way? What if there are other poor souls out there with hundreds or even thousands of open tabs. Are they waiting for Marie Kondo to hold their hand while they deliberate each tab before they discard it?

So I decided to polish my addon a bit, give it a UI, and put it up on AMO. Since users might not trust an addon that automatically closes tabs, I decided to add an “overflow” feature which is essentially tab purgatory. Instead of having the addon auto-close the tab, it hides it. The tab is still accessible via the addon’s popup, Firefox’s “Hidden Tabs” submenu, or through tab search in the awesomebar. The overflow can be capped too so it can permanently discard old tabs after a given limit.