Terry Winograd, Stanford University,
A further edited version of this will appear in
John Carroll, Ed., HCI in the New Millennium, Addison Wesley,
in press.
Version of August, 2000
This chapter describes a conceptual framework for the development of a new an architecture that is oriented towards integrated interaction spaces. It gives brief examples from a research project on interactive workspaces, and poses some research issues for the future.
As a motivating example, consider a group of people in this interactive workspace working together to develop a complex web site. A large wall-mounted display contains items such as graphs representing the structure of the site, detailed work plans and schedules, pieces of text, and images. There will be a variety of other devices and modalities, but for purposes of illustration we will focus on an interaction with this display.
Consider in contrast an analogous scenario in which the display is on a standard workstation with a graphical user interface (GUI):
So why can't we program the first scenario this easily? One answer might simply be that it takes time for technologies to reach maturity. Because there are not yet many integrated interaction spaces, there have not yet been sufficient resources to develop the corresponding mechanisms for new kinds of interaction. This is, of course, true. But there is a deeper problem as well. The needed mechanisms are not just new features and widgets, but require a shift in the way we think about input-output interactions with computer: a shift to a new architecture for interaction spaces.
Figure 1: Elementary input/output architecture
A programmer who built an interactive application needed to know about the specific devices (we will refer to sensors and actuators jointly as "devices") and the details of their data structures and signals, in order to write code that used them appropriately. The code could be carefully tailored to the specific devices, to gain maximal efficiency and/or to take advantage of their special characteristics.
This arrangement worked, but had some obvious shortcomings:
Figure 2: Conventional input/output architecture
This architecture provides two fundamental levels of indirection between devices and programs.
First, the operating system provides for device drivers, which are coded to deal with the specifics of the signals to and from the device, and which provide a higher level interface to programmers. Drivers can unify abstractions for different devices (for example, different physical pointing devices can provide the same form of two-dimensional coordinate information), or can provide multiple abstraction levels for a single physical device (e.g., both interpreted handwriting and digital ink, for a pen device). An operating system can also provide higher level drivers, which further interpret events. For example, the basic motions of a pointing device can be accessed by programs in terms of an event queue whose events are expressed as high level window and menu operations. Application programs can use libraries with program interfaces that provide higher level events and descriptions, while accessing lower level drivers provided by the operating system.
The second level of indirection is in the linking of devices to programs. The operating system provides a time-sharing manager and/or window manager (details have evolved over time), which allocates connections dynamically. For example, the same keyboard may be interpreted as sending keystrokes to different programs at different moments depending on which window is the current focus. It is possible for this function to be distributed among multiple processes and processors, but for the purposes of this discussion we will simply represent it as a single "Manager" component.
These mechanisms are all at play in making it easy to write a program that implements the workstation GUI scenario presented above. Selection, object sizing, menus, the tracking of position as a mouse moves, displaying a cursor at the location, etc. are all handled by the drivers, libraries, and toolkits, so the programmer can deal with the events at a level that is close to the user-oriented description of what is happening.
The first problematic question is "What are the devices?" In the GUI example there was a mouse and a graphical display. In the interaction space example, the most obvious candidate devices are "the display, Jane's fingers, and Jane's voice". But the latter of these are not devices in the sense of Figures 1 and 2. Although the user (and the application programmer) may think of them as devices, they are not attached to the computer through direct signals. Their activity is interpreted through devices such as cameras, trackers, and microphones. The programmer needs to deal with fingers and words at an appropriate level of abstraction, just as the GUI programmer deals with selection and menus. But this cannot be done by simply providing higher level programming interfaces to the "real" devices such as camera and microphone.
The tracking of a user's finger may involve the integration of inputs from multiple visual and proximity-detection devices, along with modeling of the physical dynamics of the body. This integration is not associated with specific devices, nor is it associated with an individual program or application. An integrated "person watcher" would provide information for any number of different programs, just as the windowing system provides keyboard and pointing information for multiple programs.
Even for simpler objects, we are beginning to see a separation between the devices as viewed by a user and those designed into the computer system. For example, "tangible user interfaces" [Fitzmaurice 1995] incorporate passive or semi-passive physical objects into computer systems as though they were virtual devices. Sensors such as cameras are used by programs that track these objects and model their behavior, and then provide a higher level interface to them.
The architecture of Figure 3 adds an explicit layer of "observers": processes that interact with devices and with other observers, to produce integrated higher level accounts of entities and happenings that are relevant to the interaction structure.
Figure 3: Architecture with a network of observers
The layer of observers has replaced, rather than being added to, the previous layer of drivers. Device drivers and single-device-based program interfaces in current systems can be thought of as simple observers, efficient for phenomena that are close to the device structure. In general, some observers will have a close relationship to the devices they interact with (e.g., a pointing device will be associated with an observer that reports its position). A single device may be used by many different observers (e.g., a camera or microphone that is being used to monitor people and their voices, track objects, detect environmental sounds and lighting, etc.). Some observers may maintain elaborate models (for example the detailed position and motion of a person's body parts).
Each observer provides an interface in terms of a specific set of objects, properties, and events. These can range from low level (" the laser pointer is at position 223, 4446") to high level interpretations ("Jane made a "select" gesture on the screen"). Some observers will be "translators" or "integrators," which do not deal directly with any perceptual or motor devices, but which take descriptions in terms of one set of phenomena and produce others (e.g., a gesture recognition observer taking hand position information from a physical body motion observer, which in turn may take information from a visual observer based on camera input). The observer processes may operate at different places in the computation structure, some on separate machines (e.g., a specialized vision or person-tracking processor), some within the operating system, and some installed as specialized libraries in the code of individual applications processes.
To summarize this step of expanding the architecture, it separates three distinct conceptual elements that are often conflated or put into simple one-to-one correspondence:
Modern network-based software is moving towards another model, in which communication connections are virtual and dynamic, rather than explicitly represented in configuration files, routing tables, and the like. In our research we have developed a system called the Event Heap [Johanson 2000], based on an underlying model that has often been referred to as a "blackboard". Rather than creating explicit communication paths between individual components, each component can post information to a shared server, and can subscribe to receive information that has been posted that matches a chosen pattern. An observer, for example, can subscribe to events posted by particular sensors and can then publish events based on an interpretation of what it received. the person writing the code for the observer does not need to know anything about how the sensors communicate (they post their results on the blackboard), or which other observers will use what this one produces. Of course there need to be agreed-upon data formats so that the posted events can be meaningfully interpreted by receivers, but these are separated from communication protocols and issues.
The basic blackboard idea was proposed many years ago in systems such as LINDA [Ahuja], and has been widely used in Artificial Intelligence programs, in which each source of information is likely to be partial and even unreliable [Engelmore, Martin]. As we move towards distributed, ubiquitous computing environments, the independence (mutual ignorance), partiality and unreliability of individual components is becoming a fact of life for systems that make no claims to intelligence. At the same time, advances in computation and communication speed have made it practical to use an architecture that adds an extra stop in the middle of the communication process. It is inherently slower for one component to post to a server and a second component to then read the data rather than communicating directly from one to the other. We are now reaching the stage where for most communication paths this overhead is acceptable. (For exceptions, see the discussion of perception-action coupling below),
In addition to decoupling communication, robustness can be improved by distributing management. In the above diagrams, there is a single "Manager" that maintains information about the overall system and coordinates activities across the components. For the traditional single-user, single-computer environment, this is an excellent architecture. The user wants to interact with a variety of applications and devices in an intermixed sequence, and the manager keeps track of what is going on (the current application, the window focus, the open files, the network connections, etc.). Since the communication paths all go through a single processor, it is a natural place for integration.
In the distributed environment, with communication via a shared blackboard, it make sense to distribute management functions as well. We still require programs that coordinate other components, but they need not be glued into a monolithic structure that mimics current windowing systems.
Figure 4 illustrates the removal of the explicit communication paths and the splitting up of the manager. There are still implicit communication paths that determine the flow from one component to another, but they are implemented in the components use of the blackboard, not requiring explicit configuration and recording.
Figure 4. Distributed communication and management
Many programs today apply context models to interpretation. In speech systems, for example, speaker-based models are tuned to the characteristics of a particular speaker. In addition, task-based vocabularies and grammars set dynamically by applications can provide a context in which the interpretation of utterances is shaped by expectations of what would be likely to be said.
In separating the observer from specific applications, we do not want to create a context-blind interpretation. In monolithic system structures, the "Manager" is the place where context is stored and distributed as needed to components. In our architecture there still needs to be a stable shared place for storage and retrieval of contextual information, but it is separated from the processes that use it, just as data files are separated from the processes that read and write them. A context memory can be introduced as a separate component, as shown in Figure 5.
Figure 5: Providing interpretive context
The context memory provides persistent storage (beyond the scope of a particular application or interactive session) for context models, which are produced and used by the other components. Some context models deal with the current context (e.g., who is currently where in the physical environment). Others are based in applications (e.g., task-specific vocabularies and grammars). Others belong to a person in general (e.g., preferences, or speech and handwriting characteristics). We can imagine each person having an extended kind of "home page" which provides these models along with other information about preferences, resources (e.g., personal bookmark collections, etc.).
As applications programs run, they provide models for use by the observers, and potentially receive updated models from them. There is no sharp distinction between the kinds of information to be passed directly and the kinds to be stored and retrieve from the memory, but they represent different points in the communication tradeoff space. The memory is persistent with large storage capacity and has relatively slow latency, so it is less appropriate for short term events (e.g., selecting an object) and more appropriate for slowly changing persistent information (e.g., user preferences). For example, a speech model for a user would be stored in the memory and downloaded once to the relevant observer during a session,. The speech events being interpreted would be passed directly as data, although they might also be saved in memory for replay, logging, etc.
Although the context memory can be thought of as a single conceptual entity, it spans a space of data sizes and speed requirements that make it best thought of as a multi-level memory. In our implementation, large data objects (such as images) are stored in a conventional file structure, while small data objects (such as information about the location of a particular device in the interaction space) are stored as entries in an XML-structured database . The XML database is also used to store metadata about the objects in file storage, so that they can be retrieved by a match of their characteristics.
Many aspects of human-computer interaction have been subject to ever higher levels of abstraction and indirection, with satisfactory performance results. Consider, for example, the level at which a programmer specifies what is to be displayed on a screen. We have progressed from individual vectors to shaded, textured, 3-dimensional objects with controlled lighting and viewpoint. Processing power has expanded to make this possible.
The cases where performance has continued to be a deep problem are those with a tight coupling between action and perception. As a prime example, consider virtual reality using a head-mounted display. In order to maintain the perception of immersion in a 3-dimensional world, the visual rendering needs to be updated to reflect changes in head position with no perceptible lag. As a more mundane example, we require tight action-perception coupling in simple cursor positioning with a mouse. If the motion of the cursor lags too far behind the movement of the hand, effectiveness is greatly decreased. To operate at action-perception coupling speeds (i.e., a latency in the milliseconds), system architectures need to pay special attention to coupling.
Many systems today (from head-mounted VR to the cursor tracker in every GUI OS) achieve satisfactory action-perception coupling by wiring it in specially rather than using the more general interaction mechanisms provided for less time-sensitive processes. This makes it difficult to extend these programs, as discovered, for example, by anyone who has tried to extend a standard GUI system to handle multiple users each with a cursor [Myers 1998]. Some such problems are solved in distributed windowing systems (such as X-Windows) by providing specific coupling mechanisms in the server for operations such as dragging. On the other hand, if the programmer wanted to do live rotation instead of translation of an object, this would not work, since the server does not provide sufficient tools for a rotation coupling. Specialized platforms for applications such as live-action games and music-playing provide for coupling within their specialized domains.
A somewhat more general approach was taken in the Cognitive Coprocessor [Robertson 1989], which had a manager dedicated to maintaining interaction coupling between a task queue and a display queue. This can be generalized in tools for action-perception coupling . To be effective, the following conditions must be met:
However, our touch screens are limited in the kinds of interaction they support. The hardware can identify only a single point of touch at any time (multiple or large-area touches are signaled as a single touch at their center), and there is no simple way to augment a touch with the kind of information provided by the buttons on standard pointing devices. In effect, every touch is interpreted as though the left mouse-button is being clicked at a single point.
For the user, the device being used is not the touch screen, but the hand that touches it. A hand has much more variability than a single point of touch. It can be held in various postures (e.g., touching with two fingers at once), can move in complex ways (gestures, rolling, etc.) and can exert complex pressure patterns. By adding additional observers, we can take advantage of the larger space. In our experiments we have added a camera-based observer that enables the system to identify hand posture. By combining this with the information provided by the touch-screen driver, we have designed interaction modes that use posture to modify touch. In one mapping we have experimented with, a single-finger touch is interpreted as a left-button, a two-finger touch as a right button, the palm of the hand as an eraser and the side of the hand as a highlighter. In another mapping, a touch of the side of the hand held vertically against the board does a Copy command, and a touch with the hand held horizontally does a Paste. The user does not think about the devices, but about using hands in new ways.
As with the previous example, this device-centric view does not correspond to the user's view of the world. If there is a single mouse and keyboard, the result of moving the mouse should be oriented to the space as a whole, not to the underlying computational system. When the cursor reaches the edge of a screen, it should continue moving onto the adjacent screen (as it does in multi-monitor single-desktop systems), without regard to which processor is displaying on that screen. This includes moving across different kinds of devices -- from the back-projected touch screens, onto a front-projected wall, onto the bottom-projected table, and even onto screens of laptops sitting on that table. The keyboard should go along, sending its input to whatever display currently displays the cursor.
The PointRight system we have developed [PointRight] provides this space-oriented interaction, allowing any pointing device (the standard one is a wireless GyroMouse) to point anywhere in the workspace. This example, as mentioned in the section on action-perception coupling, demands a low latency so that the user has the experience of directly moving the cursor, not of operating a pointing device which then gives motion instructions to a cursor. This has required ongoing development of the EventHeap architecture to combine the desired speed with the generality of communication paths.
From the user's point of view, this controller operates directly on the physical devices, without regard to the actual sequence of events that are communicated among various components to achieve the effect. From the system-builders point of view, the initial controllers were inflexible, requiring reprogramming whenever the physical situation in the space was modified. This provided a need for a context memory. Information about the devices in their room and their physical layout is stored in this memory and updated when the space is changed. The application that provides room-control interfaces to users can make use of this data dynamically to produce a controller that is specific to the setting.
This capability requires observers that can identify which person makes an action using any device. For some observers (such as ones that respond to voice control) this will require specialized programs that do things like speaker recognition. For many devices, it is sufficient to identify the person who is in physical proximity. A variety of methods are possible, including visual observation of people in the space and wearable infrared badges or radio-frequency tags that can be tracked. Although experiments have been done, there are not yet reliable, sufficiently accurate means to do this, and there are additional concerns to be addressed about users' privacy and desire to wear tracking devices.
First, in building new systems that implement parts of a general mechanism, we can use structures that are compatible with the larger architecture and open to extension within the framework. In our own work on the interactive workspace, we are taking this approach. We are developing and integrating capabilities using a bottom-up strategy, with the larger-scale view as background. Second, the conceptual distinctions here can be useful in sorting out problems and confusions in designing special purpose systems. This will become increasingly important as more applications begin to make use of broad, rich input devices (e.g., cameras and microphones), with their attendant problems of person identification and context-based interpretation of the phenomena of relevance to the user and computing system. Finally a shift of perspective may be a catalyst to help provoke new ideas about what to try, and what can be done in improving the ways in which computers and people interact.