eSpeak Web Speech API Addon

Now that eSpeak runs pretty well in JS, it is time for a Web Speech API extension!

What is the Web Speech API? It gives any website access to speech synthesis (and recognition) functionality, Chrome and Safari already have this built-in. This extension adds speech synthesis support in Firefox, and adds eSpeak voices.

For the record, we had speech synthesis support in Gecko for about 2 years. It was introduced for accessibility needs in Firefox OS, now it is time to make sure it is supported on desktop as well.

Why an extension instead of built-in support? A few reasons:

  1. An addon will provide speech synthesis to Firefox now as we implement built-in platform-specific solutions for future releases.
  2. An addon will allow us to surface current bugs both in our Speech API implementation, and in the spec.
  3. We designed our speech synthesis implementation to be extensible with addons, this is a good proof of concept.
  4. People are passionate about eSpeak. Some people love it, some people don’t.

So now I will shut up, and let eSpeak do the talking:

eSpeak Web Speech API Addon

(re)Introducing eSpeak.js

td;dr

Look! A flashy demo with buttons!

Background

A long time ago, we were investigating a way to expose text-to-speech functionality on the web. This was long before the Web Speech API was drafted, and it wasn’t yet clear what this kind of feature would look like. Alon Zakai stepped up, and proposed porting eSpeak to Javascript with Emscripten. This was a provocative idea: was our platform powerful enough to support speech synthesis purely in JS? Alon got back a few days later with a working demo, the answer was “yes”.

While the speak.js port was very impressive, it didn’t answer many of our practical needs. For example, the latency was not good enough for making a responsive UI, you could wait more than a couple of seconds to hear a short phrase. In addition, the longer the text you wanted to synthesize, the longer you needed to wait.

It proved a concept, but there were missing pieces we didn’t have four years ago. Today, we live in the future of 2011, and things that were theoretical then, are possible now (in the future).

asm.js

Today, Emscripten will compile C/C++ code into a subset of Javascript called asm.js. This subset is optimized on all current browsers, and allows performance to be about 2x native. That is really good. eSpeak is a pretty lightweight library already, the extra performance boost of asm.js makes speech instantaneous.

Transferable Objects

Passing data between a web worker and a parent process used to mean a lot of copying, since the worker doesn’t share memory with the parent process. But today, you can transfer ownership of ArrayBuffers with zero copying. When the web worker is ready to send audio data back to the calling process, it could do so while maintaining a single copy of the audio buffer.

Web Audio API

We have a slick, full featured Audio API today on the web. When speak.js came out in 2011, it used a prefixed method on an <audio> element to write PCM data to. Today, we have a proper API that enables us to take the audio data and send it through an elaborate pipeline of filters and mixers, or even send it into the ether with WebRTC.

Emscripten Got Fancy

This was my first time playing with it, so I am not sure what was available in 2011. But, if I have to guess, it was not as powerful and fun to work with. Emscripten’s new WebIDL support makes adding bindings extremely easy. You still get a chance to do some pointer arithmetic, but that’s supposed to be fun. Right?

So here is eSpeak.js!

I wanted to do a real API port, as opposed to simply porting a command line program that takes input and writes a WAV file. Why? two main reasons:

  1. eSpeak can progressively synthesize speech. If you provide a callback to espeak_Synth(), it will be called repeatedly with as many samples as you defined in the buffer size. It doesn’t matter how long the text is that you want synthesized, it will fill the buffer and return it to you immediately. This allows for a consistent low latency from the moment you call espeak_Synth(), until you could start playing audio.
  2. eSpeak supports events. If you use a callback, you get access to a list of events that provide a timestamp in the audio, and the type of event that occurs there, such as word or sentence boundaries.

And, of course, with all the recent-ish platform improvements above, I was really time for a fresh attempt.

Future Work

  • Break up the data files. Right now, eSpeak.js is over a 2MB download. That’s because I packaged all the eSpeak data files indiscriminately. There may be a few bits that are redundant. On the flip side you get all 99 voice/language combinations (that’s a good deal for 2MB, eh?). It would be cool to break it up to a few data files and allow the developer to choose which voices to bundle or, even better, just grab them on demand.
  • Make a demo of the speech events. It makes my head hurt to think about how to do something compelling. But it is a neat feature that should somehow be shown.
  • ScriptProcessorNode is apparently deprecated. This is going to need to be ported to an AudioWorker once that is widely implemented.

I’m done apologizing, here is the demo.

(re)Introducing eSpeak.js