Speaking with your Voice, Pen, Eyes, Face, Hands, and Fingers

Alex Waibel, Interactive Systems Laboratories, Carnegie Mellon University and University of Karlsruhe

Seminar on People, Computers, and Design
Stanford University April 4, 1997


Human-Computer Interaction to date is severely impoverished by the sensory deprivation of a machine that can only interact with its environment by way of the keyboard and mouse. By contrast, human-to-human interaction involves speech, writing, eye-contact, lip-reading, hand-gestures, pointing, and a host of sophisticated interactive mechanisms to transmit information from one person to another robustly and flexibly, and the ability to learn from this interaction.

In this talk I will describe work we are doing at our lab to enable computer agents to deliver similar agility, flexiblity and naturalness to human-computer interaction. I will describe the scientific challenges behind recognizing conversational speech, cursive handwriting, gesture, pointing and lip-motion, tracking faces, eyes, head-orientation and sound sources and the solutions we have developed to date to track and interprete these signals robustly. An important opportunity arising from the *combination* of these modalities is the increased robustness of the overall interface that results from interpreting human intent from the *joint* and complementary cues provided by a human. More interestingly, remaining errors can be repaired quite effectively and naturally by switching between communication modes (speaking, spelling, writing, etc.) that provide another look at the same words or concepts.

I will then describe (and show videos of) several user interfaces that embody these multimodal capabilities. QUICK_DOC and QUICK_TURN are example systems that are developed to allow medical doctors or image analysts to access, manipulate, multimedia records and rapidly create and disseminate reports based on this information. The (multimedia) information is manipulated, searched and delivered by voice, pointing and handwriting and Web-reports automatically generated to achieve faster turn-around. The systems are implemented as a Web based client/server architecture, where voice, pen and other signals can be captured at a (potentially mobile) client side, and processed, interpreted and serviced elsewhere.


Alex Waibel is a senior research scientist in the School of Computer Science at Carnegie-Mellon University. His research interests center on the interpretation and integration of speech, language, and other human communication signals for better human-computer and computer-mediated interaction. Two particularly challenging examples of those interests are the JANUS project, a speech-to-speech translation system that translates spoken language between English, German, Japanese, and Spanish (e.g., a translating telephone), and the INTERACT project, which attempts to design multimodal interfaces.


Titles and abstracts for all years are available by year and by speaker.

For more information about HCI at Stanford see

Overview Degrees Courses Research Faculty FAQ