Getting to know Voice
By Jonny Axelsson · 30 Oct, 2006
From a different world than the traditional browsing world comes a range of techniques that allows a developer to code for speech behaviours much easier than previously possible. Opera has early support for this. W3C is working on standards for combining speech and the ordinary graphical user interface.
Speech Recognition
Understanding natural speech is hard for a machine, sometimes almost impossible. One approach to make this easier is to train the machine to recognize the user's voice and speech patterns. The user will be told to speak some predetermined texts to make the machine accustomed to the user.
This is not practical for web pages, so another approach is used instead, namely explicit grammars of expected utterances. The grammars have two advantages. The machine has a dramatically higher chance of guessing what the user meant this way. Also by restricting the accepted values it is easier to make a sensible response to the now known possible values.
Text To Speech
The machine talks back. It is possible get fairly natural machine generated voices, but this is a trade-off with file size, resource use, and flexibility. The machine voices typically used for web pages would fool nobody, but they are still easy to understand. Text To Speech is fairly common by now, but Opera is as of yet the only browser offering styled text using CSS.
XHTML+Voice
The language or profile Opera uses to combine XHTML and Voice is called XHTML+Voice, or X+V for short. More than for normal XHTML pages, X+V design is primarily about interaction design. You create the storyboard for what the user can do and when. You can make simple web page enhancement, or you can craft elaborate Byzantine labyrinths for the user to get lost in, it is all up to you.
An X+V web page is a normal XHTML page with additional Voice forms in the head. These voice forms cover both speech recognition and text to speech markup. The interaction is event based, you use XML Events to describe what should be the consequences or handlers of different events. An event can for instance be when the page has loaded, the user clicks on a button, or says something the speech recognizer doesn't understand. The consequence can be a voice monolog or dialog with the user. This is turn can for instance trigger a script, reformat the page, or throw another event.
Getting started with X+V
- Voice user tutorial
- Testing, testing. One, two, three. How do I use this Voice thing?
- X+V by Example
- Can you say "Hello World"? I knew you could... Start making X+V pages
- X+V in Style
- After all, it isn't as much what you say as how you say it. Add styled speech to X+V.
- X+V in Action
- Do as I say, not as I do. Use X+V to voice-enable JavaScript
- How to Add Voice Interactivity to Your Site
- Practical experience of adding voice to a web site
Getting to know it
Even though X+V still is cutting edge, there are places you can go for help and more information.
- Opera accessibility and voice browsing
- Welcome to our forum on everything voice-related.
- ibm.software.speech.multimodal
- IBM Newsgroup for multimodal applications
- IBM multimodal site
- This IBM site has a large collection of documents on voice and multimodal interaction.
- W3C Multimodal Interaction working group
- The documents here are quite technical, but this is the place where the future standards are defined. It has a public mailing list. You might also want to look at the HTML and Voice working group pages. Styling speech is done by the CSS working group.
Learning to speak
XHTML+Voice
- XHTML+Voice Programmer's Guide (PDF)
- The book of X+V, including a reference for every element added to XHTML.
- XHTML+Voice 1.2
- The technical specification Opera+Voice is based upon.
- X+V Speech Considerations
- The alternatives and trade-offs for text to speech. Why the most natural-sounding voice may not always be the best choice.
- Multimodal Application Design Issues (PDF)
- Many good tips for how to design good X+V pages.
- Developing Multimodal Applications using XHTML+Voice (PDF)
- Much the same topics as the above article, but more code oriented.
- XML Events tutorial
- How to get what you want when something happens. This tutorial explains how XML Events works. Alternatively you can go directly for the specification
Speech recognition
Speech recognition is assisted by grammars.
- Speech Recognition Grammar Specification
- The language to describe what user utterances are accepted.
- Semantic Interpretation for Speech Recognition
- This let you specify what part of the recognized speech to use, for instance to voice-enable a JavaScript application.
VoiceXML, the dialog language
Like HTML is the language for web documents, VoiceXML is the language for voice interaction.
- VoiceXML Forum
- Industry organization for the promotion of VoiceXML, with some useful information.
- World of VoiceXML
- A collection of VoiceXML resources.
- VoiceXML 2.0
- The specification itself.
Styling speech
- CSS3 Speech Module
- This is the specification currently under development. As it is a work in progress, the final version of the module may be different from the one we know today.
- CSS 2 Aural CSS
- In 1998 CSS2 Aural CSS (ACSS) was specified, years ahead of its time. As CSS2 Aural CSS is superceded by CSS3 Speech, this link is for informational reasons only. Even so, as the companion CSS3 Audio Module hasn't been made yet, CSS2 Aural CSS is actually more powerful than CSS3 Speech.