Interaction Spaces for 21st Century Computing

Terry Winograd, Stanford University,
A further edited version of this will appear in
John Carroll, Ed., HCI in the New Millennium, Addison Wesley, in press.

Version of August, 2000

1. INTRODUCTION

Computing environments of the late twentieth century have been dominated by a standard desktop/laptop configuration. A single user sits in front of a screen with a keyboard and pointing device, interacting with a collection of applications, some of which use resources from the Internet. As many researchers have pointed out , computing today is moving away from this model in a number of ways(for example, see [Weiser]): Each of these changes to today's standard computer interaction modes raises its own technical difficulties and specialized areas of research. Taking a broader view, we need to reexamine some fundamental assumptions about the structure of interactive systems and integrated environments. Today's conventional model of interaction architecture and device communication has served us well up until now, but will have to evolve towards a different architecture, focused on multiple users in an interaction space rather than focusing on systems as a network of processors and devices.

This chapter describes a conceptual framework for the development of a new an architecture that is oriented towards integrated interaction spaces. It gives brief examples from a research project on interactive workspaces, and poses some research issues for the future.

1.1 Scenario

Our research group in Graphics and HCI at Stanford University is one of several groups building interactive workspaces, which integrates a number of computer displays and devices in a single room [Rekimoto, Streitz]. Our workspace includes large high-resolution displays (wall mounted and tabletop), personal and mobile devices (PDAs, tablet computers, laser pointers, etc.), and environmental sensors (cameras, microphones, floor pressure sensors, etc.). The space is designed to support joint work by multiple users, who move from device to device and use multiple interaction modalities appropriate to the task and materials. Many of the activities involve more than one physical device (e.g., the large display, pointers, voice, and one or more hand-held devices).

As a motivating example, consider a group of people in this interactive workspace working together to develop a complex web site. A large wall-mounted display contains items such as graphs representing the structure of the site, detailed work plans and schedules, pieces of text, and images. There will be a variety of other devices and modalities, but for purposes of illustration we will focus on an interaction with this display.

  1. Jane places her two index fingers on one of the images and slides them apart and together. As she does, the image expands and shrinks accordingly. She stops when it is the right size.
  2. She touches the screen with her index finger, and draws a circle around a few of the images. The images change appearance to indicate selection.
  3. She says aloud "Hold for product page."
  4. The scaled selected images, are now available for later retrieval under the category "product page."
This scenario is clearly feasible today and we can expect the hardware to soon reach a price where the required devices will be commonplace. Each piece of the functionality has been available for some time: the recognition of freehand gestures [Maes 1993]; gesture-based interaction with whiteboard contents [Moran 1997]; dynamic zooming of images [Bederson 1994], and voice-driven commands [Bolt 1980]. However, each of the existing systems that provides some of these capabilities is a research system, in which integration is limited and a large amount of specialized coding was required to achieve the desired results.

Consider in contrast an analogous scenario in which the display is on a standard workstation with a graphical user interface (GUI):

  1. Jane clicks the mouse over one of the images. The image displays a set of associated handles. She drags one of the handles until it reflects a new desired size and lets up. The image is resized.
  2. Jane drags her mouse along the diagonal of a rectangle that encloses several images, holding down the left button. When she lets up on the button, the images within the rectangular area change appearance to indicate selection.
  3. She uses the mouse to activate the "Hold" menu in the menu bar at the top of the screen, which contains an item for each of the current categories, and selects "product page."
  4. The selected images, in the specified size, are now available for later retrieval under the category "product page."
This second scenario could be programmed fairly easily by anyone skilled in the use of any of a variety of interface building tools (e.g., Visual Basic, TCL/TK, Java tool kits). It is not far beyond what HyperCard made available more than a decade ago to a wide population of programmers from elementary school age up. All of the interaction elements (selection, positioning, command invocation) are available in the basic operating system, or in the form of widgets, tool kits, and standard libraries.

So why can't we program the first scenario this easily? One answer might simply be that it takes time for technologies to reach maturity. Because there are not yet many integrated interaction spaces, there have not yet been sufficient resources to develop the corresponding mechanisms for new kinds of interaction. This is, of course, true. But there is a deeper problem as well. The needed mechanisms are not just new features and widgets, but require a shift in the way we think about input-output interactions with computer: a shift to a new architecture for interaction spaces.

2. ARCHITECTURE MODELS

Three obvious elements are needed for human-computer interaction: a person, a computer, and one or more physical devices that operate in the person's physical space and exchange signals with the computer. In the early days of computing, the structure was simple, as shown in Figure 1.

Figure 1: Elementary input-output architecture

Figure 1: Elementary input/output architecture

A programmer who built an interactive application needed to know about the specific devices (we will refer to sensors and actuators jointly as "devices") and the details of their data structures and signals, in order to write code that used them appropriately. The code could be carefully tailored to the specific devices, to gain maximal efficiency and/or to take advantage of their special characteristics.

This arrangement worked, but had some obvious shortcomings:

  1. Each new program had to have code to deal with the specifics of the devices.
  2. Each new device (or modification to an existing device) could require reprogramming of pre-existing applications.
  3. If a computer supported multiple processes, then conflicts could arise when two processes communicated with the same device.

2.1 Decoupling Devices from Programs

Over the first few decades of computing, a more complex architecture emerged to deal with these problems, using indirection to decouple programs from device interaction details, as illustrated in Figure 2.

Figure 2: Conventional input/output architecture

This architecture provides two fundamental levels of indirection between devices and programs.

First, the operating system provides for device drivers, which are coded to deal with the specifics of the signals to and from the device, and which provide a higher level interface to programmers. Drivers can unify abstractions for different devices (for example, different physical pointing devices can provide the same form of two-dimensional coordinate information), or can provide multiple abstraction levels for a single physical device (e.g., both interpreted handwriting and digital ink, for a pen device). An operating system can also provide higher level drivers, which further interpret events. For example, the basic motions of a pointing device can be accessed by programs in terms of an event queue whose events are expressed as high level window and menu operations. Application programs can use libraries with program interfaces that provide higher level events and descriptions, while accessing lower level drivers provided by the operating system.

The second level of indirection is in the linking of devices to programs. The operating system provides a time-sharing manager and/or window manager (details have evolved over time), which allocates connections dynamically. For example, the same keyboard may be interpreted as sending keystrokes to different programs at different moments depending on which window is the current focus. It is possible for this function to be distributed among multiple processes and processors, but for the purposes of this discussion we will simply represent it as a single "Manager" component.

These mechanisms are all at play in making it easy to write a program that implements the workstation GUI scenario presented above. Selection, object sizing, menus, the tracking of position as a mouse moves, displaying a cursor at the location, etc. are all handled by the drivers, libraries, and toolkits, so the programmer can deal with the events at a level that is close to the user-oriented description of what is happening.

2.2 Decoupling Devices from Phenomena

The problem in trying to support the programming of our interactive workspace scenario is not just one of writing more drivers and APIs. There are some fundamental conceptual shifts.

The first problematic question is "What are the devices?" In the GUI example there was a mouse and a graphical display. In the interaction space example, the most obvious candidate devices are "the display, Jane's fingers, and Jane's voice". But the latter of these are not devices in the sense of Figures 1 and 2. Although the user (and the application programmer) may think of them as devices, they are not attached to the computer through direct signals. Their activity is interpreted through devices such as cameras, trackers, and microphones. The programmer needs to deal with fingers and words at an appropriate level of abstraction, just as the GUI programmer deals with selection and menus. But this cannot be done by simply providing higher level programming interfaces to the "real" devices such as camera and microphone.

The tracking of a user's finger may involve the integration of inputs from multiple visual and proximity-detection devices, along with modeling of the physical dynamics of the body. This integration is not associated with specific devices, nor is it associated with an individual program or application. An integrated "person watcher" would provide information for any number of different programs, just as the windowing system provides keyboard and pointing information for multiple programs.

Even for simpler objects, we are beginning to see a separation between the devices as viewed by a user and those designed into the computer system. For example, "tangible user interfaces" [Fitzmaurice 1995] incorporate passive or semi-passive physical objects into computer systems as though they were virtual devices. Sensors such as cameras are used by programs that track these objects and model their behavior, and then provide a higher level interface to them.

The architecture of Figure 3 adds an explicit layer of "observers": processes that interact with devices and with other observers, to produce integrated higher level accounts of entities and happenings that are relevant to the interaction structure.

Figure 3: Architecture with a network of observers

Figure 3: Architecture with a network of observers

The layer of observers has replaced, rather than being added to, the previous layer of drivers. Device drivers and single-device-based program interfaces in current systems can be thought of as simple observers, efficient for phenomena that are close to the device structure. In general, some observers will have a close relationship to the devices they interact with (e.g., a pointing device will be associated with an observer that reports its position). A single device may be used by many different observers (e.g., a camera or microphone that is being used to monitor people and their voices, track objects, detect environmental sounds and lighting, etc.). Some observers may maintain elaborate models (for example the detailed position and motion of a person's body parts).

Each observer provides an interface in terms of a specific set of objects, properties, and events. These can range from low level (" the laser pointer is at position 223, 4446") to high level interpretations ("Jane made a "select" gesture on the screen"). Some observers will be "translators" or "integrators," which do not deal directly with any perceptual or motor devices, but which take descriptions in terms of one set of phenomena and produce others (e.g., a gesture recognition observer taking hand position information from a physical body motion observer, which in turn may take information from a visual observer based on camera input). The observer processes may operate at different places in the computation structure, some on separate machines (e.g., a specialized vision or person-tracking processor), some within the operating system, and some installed as specialized libraries in the code of individual applications processes.

To summarize this step of expanding the architecture, it separates three distinct conceptual elements that are often conflated or put into simple one-to-one correspondence:

  1. Devices: (sensors and actuators) and the signals they accept and produce
  2. Phenomena: a space of things and happenings that are relevant to a program
  3. Observers, which produce a particular interpretation of the phenomena using information from devices.

3. ROBUST DYNAMIC CONFIGURATION AND COMMUNICATION

The previous figure contains dozens of arrows, representing communication paths between various system components. The arrows are intentionally abstract, not specifying whether communication is between parts of a single program, programs on the same machine, or across a network. In fact, for most systems there will be a mixture, and the resulting complexity creates great problems for building robust systems. Many systems are built today using a mixture of component-to-component connections, which require different kinds of configuration steps (e.g, for adding a new observer process on a machine, or adding a new processor to the network). The result is brittle rather than flexible -- it is difficult to make changes, and difficult to recover from the failures of individual components.

Modern network-based software is moving towards another model, in which communication connections are virtual and dynamic, rather than explicitly represented in configuration files, routing tables, and the like. In our research we have developed a system called the Event Heap [Johanson 2000], based on an underlying model that has often been referred to as a "blackboard". Rather than creating explicit communication paths between individual components, each component can post information to a shared server, and can subscribe to receive information that has been posted that matches a chosen pattern. An observer, for example, can subscribe to events posted by particular sensors and can then publish events based on an interpretation of what it received. the person writing the code for the observer does not need to know anything about how the sensors communicate (they post their results on the blackboard), or which other observers will use what this one produces. Of course there need to be agreed-upon data formats so that the posted events can be meaningfully interpreted by receivers, but these are separated from communication protocols and issues.

The basic blackboard idea was proposed many years ago in systems such as LINDA [Ahuja], and has been widely used in Artificial Intelligence programs, in which each source of information is likely to be partial and even unreliable [Engelmore, Martin]. As we move towards distributed, ubiquitous computing environments, the independence (mutual ignorance), partiality and unreliability of individual components is becoming a fact of life for systems that make no claims to intelligence. At the same time, advances in computation and communication speed have made it practical to use an architecture that adds an extra stop in the middle of the communication process. It is inherently slower for one component to post to a server and a second component to then read the data rather than communicating directly from one to the other. We are now reaching the stage where for most communication paths this overhead is acceptable. (For exceptions, see the discussion of perception-action coupling below),

In addition to decoupling communication, robustness can be improved by distributing management. In the above diagrams, there is a single "Manager" that maintains information about the overall system and coordinates activities across the components. For the traditional single-user, single-computer environment, this is an excellent architecture. The user wants to interact with a variety of applications and devices in an intermixed sequence, and the manager keeps track of what is going on (the current application, the window focus, the open files, the network connections, etc.). Since the communication paths all go through a single processor, it is a natural place for integration.

In the distributed environment, with communication via a shared blackboard, it make sense to distribute management functions as well. We still require programs that coordinate other components, but they need not be glued into a monolithic structure that mimics current windowing systems.

Figure 4 illustrates the removal of the explicit communication paths and the splitting up of the manager. There are still implicit communication paths that determine the flow from one component to another, but they are implemented in the components use of the blackboard, not requiring explicit configuration and recording.

Figure 4. Distributed communication and management

4. CONTEXT-BASED INTERPRETATION

The use of higher-level observers leads to a problem of interpretive context. An application may need to interpret a certain hand motion as a gesture or a sequence of sounds as a voice saying a particular phrase. The purpose of providing a level of indirection through observers is to be able to add general capabilities such as word and gesture recognition to the overall system (not just to one application). But the interpretation of a sequence of motions or sounds will differ depending on what the application (and the user) is doing, how the particular person moves and talks, etc. A circular wave of the hand may be a selection gesture in one activity, and a circle-drawing gesture (or a meaningless motion) in another. The way that Jane moves her hand in pointing may be consistent over time, but different from Jim's.

Many programs today apply context models to interpretation. In speech systems, for example, speaker-based models are tuned to the characteristics of a particular speaker. In addition, task-based vocabularies and grammars set dynamically by applications can provide a context in which the interpretation of utterances is shaped by expectations of what would be likely to be said.

In separating the observer from specific applications, we do not want to create a context-blind interpretation. In monolithic system structures, the "Manager" is the place where context is stored and distributed as needed to components. In our architecture there still needs to be a stable shared place for storage and retrieval of contextual information, but it is separated from the processes that use it, just as data files are separated from the processes that read and write them. A context memory can be introduced as a separate component, as shown in Figure 5.

Figure 4: Providing interpretive context to observers

Figure 5: Providing interpretive context

The context memory provides persistent storage (beyond the scope of a particular application or interactive session) for context models, which are produced and used by the other components. Some context models deal with the current context (e.g., who is currently where in the physical environment). Others are based in applications (e.g., task-specific vocabularies and grammars). Others belong to a person in general (e.g., preferences, or speech and handwriting characteristics). We can imagine each person having an extended kind of "home page" which provides these models along with other information about preferences, resources (e.g., personal bookmark collections, etc.).

As applications programs run, they provide models for use by the observers, and potentially receive updated models from them. There is no sharp distinction between the kinds of information to be passed directly and the kinds to be stored and retrieve from the memory, but they represent different points in the communication tradeoff space. The memory is persistent with large storage capacity and has relatively slow latency, so it is less appropriate for short term events (e.g., selecting an object) and more appropriate for slowly changing persistent information (e.g., user preferences). For example, a speech model for a user would be stored in the memory and downloaded once to the relevant observer during a session,. The speech events being interpreted would be passed directly as data, although they might also be saved in memory for replay, logging, etc.

Although the context memory can be thought of as a single conceptual entity, it spans a space of data sizes and speed requirements that make it best thought of as a multi-level memory. In our implementation, large data objects (such as images) are stored in a conventional file structure, while small data objects (such as information about the location of a particular device in the interaction space) are stored as entries in an XML-structured database . The XML database is also used to store metadata about the objects in file storage, so that they can be retrieved by a match of their characteristics.

5. ACTION AND PERCEPTION

Anyone with experience in writing interactive systems is likely to wonder whether it is practical to make general use of all the levels of indirection and interpretation that have been described so far. There are two primary effects of adding a level of indirection to any computing system:
  1. Consistent levels of indirection make possible a cleaner separation of concerns, which makes systems easier to write, modify, integrate, understand, etc.
  2. Consistent indirection requires additional processing across the entire program, hampering performance.
Whether the structural benefit is worth the efficiency cost is determined by the specifics of the situation. The world is full of examples of successful indirection (how many programs today deal directly with the arrangement of sectors and tracks on a disk?) and examples of failed indirection in systems where the gain in generality simply wasn't worth the performance penalty (as has been the case with many generalized GUI builders).

Many aspects of human-computer interaction have been subject to ever higher levels of abstraction and indirection, with satisfactory performance results. Consider, for example, the level at which a programmer specifies what is to be displayed on a screen. We have progressed from individual vectors to shaded, textured, 3-dimensional objects with controlled lighting and viewpoint. Processing power has expanded to make this possible.

The cases where performance has continued to be a deep problem are those with a tight coupling between action and perception. As a prime example, consider virtual reality using a head-mounted display. In order to maintain the perception of immersion in a 3-dimensional world, the visual rendering needs to be updated to reflect changes in head position with no perceptible lag. As a more mundane example, we require tight action-perception coupling in simple cursor positioning with a mouse. If the motion of the cursor lags too far behind the movement of the hand, effectiveness is greatly decreased. To operate at action-perception coupling speeds (i.e., a latency in the milliseconds), system architectures need to pay special attention to coupling.

Many systems today (from head-mounted VR to the cursor tracker in every GUI OS) achieve satisfactory action-perception coupling by wiring it in specially rather than using the more general interaction mechanisms provided for less time-sensitive processes. This makes it difficult to extend these programs, as discovered, for example, by anyone who has tried to extend a standard GUI system to handle multiple users each with a cursor [Myers 1998]. Some such problems are solved in distributed windowing systems (such as X-Windows) by providing specific coupling mechanisms in the server for operations such as dragging. On the other hand, if the programmer wanted to do live rotation instead of translation of an object, this would not work, since the server does not provide sufficient tools for a rotation coupling. Specialized platforms for applications such as live-action games and music-playing provide for coupling within their specialized domains.

A somewhat more general approach was taken in the Cognitive Coprocessor [Robertson 1989], which had a manager dedicated to maintaining interaction coupling between a task queue and a display queue. This can be generalized in tools for action-perception coupling . To be effective, the following conditions must be met:

  1. The input observers can provide observations at a guaranteed rate that meets the timing conditions (e.g., the sampling rate of a positioning device)
  2. The output observers can guarantee an update rate that meets the timing conditions (e.g., guaranteed frame rate for visual rendering)
  3. The data that needs to be transmitted to and from the manager is small enough to be transmitted in sufficiently short time (e.g., sending a new set of coordinates, versus sending an entire image for each change)
  4. The computation done by the manager for each iteration of the action-perception loop can be done within the timing conditions. In general this will not allow for a callback to the process that created the coupling.
Not all desired action-perception couplings will be able to meet these conditions. Time characteristics are dependent on the level of control that is available. For example, in current graphical interface systems, dragging of objects with the mouse can be done in a coupled way (rather than dragging an outline), since image translation can be achieved with sufficient update rates. On the other hand, real time image zooming is not generally possible, since image scaling is not integrated in a sufficiently fast way. Systems such as Pad++ [Bederson 1994] use special purpose programming to achieve live zooming.

5. EXAMPLES

The overall model described in this chapter is being developed in conjunction with our project to develop an interactive workspace. That work is still in its early stages, and is proceeding by developing components that meet the needs of an integrated space, and using the technical demands of those components as a driving function to develop the interaction architecture. The following examples are in the process of being integrated into the structures described above, and their development is helping to flesh out the details of those structures.

Barehands

The scenario with which this chapter started was motivated by our observations of people using our prototype interactive workspace, which includes three wall-mounted touch-sensitive back-projected displays [SmartBoard]. The other displays in the room are are not touch-sensitive, and require the use of some kind of pointing device. People immediately were attracted to the touch-screen interaction, found it natural and convenient. In fact, they often get frustrated when they attempt unsuccessfully to interact with the room's other displays (including a projection on a bare wall) by touching them.

However, our touch screens are limited in the kinds of interaction they support. The hardware can identify only a single point of touch at any time (multiple or large-area touches are signaled as a single touch at their center), and there is no simple way to augment a touch with the kind of information provided by the buttons on standard pointing devices. In effect, every touch is interpreted as though the left mouse-button is being clicked at a single point.

For the user, the device being used is not the touch screen, but the hand that touches it. A hand has much more variability than a single point of touch. It can be held in various postures (e.g., touching with two fingers at once), can move in complex ways (gestures, rolling, etc.) and can exert complex pressure patterns. By adding additional observers, we can take advantage of the larger space. In our experiments we have added a camera-based observer that enables the system to identify hand posture. By combining this with the information provided by the touch-screen driver, we have designed interaction modes that use posture to modify touch. In one mapping we have experimented with, a single-finger touch is interpreted as a left-button, a two-finger touch as a right button, the palm of the hand as an eraser and the side of the hand as a highlighter. In another mapping, a touch of the side of the hand held vertically against the board does a Copy command, and a touch with the hand held horizontally does a Paste. The user does not think about the devices, but about using hands in new ways.

PointRight

Touch works well for interacting with a screen when a person is right next to it, but there are also uses in which a person seated at a table or standing across the room wants to interact with what is being displayed on a wall-mounted screen. Since our prototype space includes five large displays (including the front-projected wall and a bottom-projected table), each of which can run an independent GUI desktop, the most direct way to support interaction would be to have a pointing device and keyboard for each. An alternative would be to use some kind of monitor switcher that allows users to switch the use of a single pointing device and keyboard between the different desktops.

As with the previous example, this device-centric view does not correspond to the user's view of the world. If there is a single mouse and keyboard, the result of moving the mouse should be oriented to the space as a whole, not to the underlying computational system. When the cursor reaches the edge of a screen, it should continue moving onto the adjacent screen (as it does in multi-monitor single-desktop systems), without regard to which processor is displaying on that screen. This includes moving across different kinds of devices -- from the back-projected touch screens, onto a front-projected wall, onto the bottom-projected table, and even onto screens of laptops sitting on that table. The keyboard should go along, sending its input to whatever display currently displays the cursor.

The PointRight system we have developed [PointRight] provides this space-oriented interaction, allowing any pointing device (the standard one is a wireless GyroMouse) to point anywhere in the workspace. This example, as mentioned in the section on action-perception coupling, demands a low latency so that the user has the experience of directly moving the cursor, not of operating a pointing device which then gives motion instructions to a cursor. This has required ongoing development of the EventHeap architecture to combine the desired speed with the generality of communication paths.

Room Controller

In any physical environment there will be devices that can be explicitly controlled by users (as well as possibly being automatically controlled for other purposes). The first obvious examples in our interactive workspace were the ceiling lights (controlled by an X10 interface) and the projectors (which are often switched off to preserve bulb life, and which can be switched among alternative input sources). It was immediately clear that the simple direct route (physically operating switches on each device and using the X10 remote control) was inconvenient. It required going to the relevant devices, and demanded a device-centered operation (e.g., there was no way to turn on all the projectors, just switches for each projector). We developed a series of interactive controllers that could be displayed on any of the devices in the room (including a Clio tablet with wireless connections). These controllers provide the user with higher level operations (dealing with more than one device) and with a natural mapping for device-specific actions (displaying a geometrical map of the room, with switch-like icons next to each device for the actions it supports)

From the user's point of view, this controller operates directly on the physical devices, without regard to the actual sequence of events that are communicated among various components to achieve the effect. From the system-builders point of view, the initial controllers were inflexible, requiring reprogramming whenever the physical situation in the space was modified. This provided a need for a context memory. Information about the devices in their room and their physical layout is stored in this memory and updated when the space is changed. The application that provides room-control interfaces to users can make use of this data dynamically to produce a controller that is specific to the setting.

6. RESEARCH ISSUES

In order to make use of an architecture for interactive spaces, a number of additional problems need to be addressed.

Person-centered interaction

One of the basic shifts from today's systems to interactive workspaces is letting go of the conventional assumption that each device is associated with a particular user who is logged onto it at a given moment. Shared wall-based displays, for example, need to be usable by anyone at any moment, without going through a separate login or identification step. However, the sharing needs to respect individual information spaces -- when Jane is at the board, she should be able to access her private files, but at the very next moment when Jim walks up to it, he should have access to his instead.

This capability requires observers that can identify which person makes an action using any device. For some observers (such as ones that respond to voice control) this will require specialized programs that do things like speaker recognition. For many devices, it is sufficient to identify the person who is in physical proximity. A variety of methods are possible, including visual observation of people in the space and wearable infrared badges or radio-frequency tags that can be tracked. Although experiments have been done, there are not yet reliable, sufficiently accurate means to do this, and there are additional concerns to be addressed about users' privacy and desire to wear tracking devices.

Dealing efficiently with incomplete and unreliable information

Much of the research on multi-modal interaction has made use of artificial intelligence techniques to infer information from multiple sources, each of which might provide only part of the relevant information, or which might be inaccurate. Both rule-based and probabilistic techniques (such as Bayes nets) have been applied. Our experiments, on the other hand, have come from the opposite direction, building upwards from conventional system architectures and being constrained by the traditional demands on performance. In an interactive space, the power of inferential techniques will need to be reconciled with the performance demands that come from depending on them for the main stream of user interactions, not just for specialized tasks or demonstrations.

Variable quality guaranteed response rate

One of the criteria for implementing an action-perception coupling is that the observers can provide guaranteed timing for their activities. The conservative way to achieve this is to program for the worst case, limiting capabilities to those that can always be achieved. A more flexible strategy is to have varying levels of capacity that can be achieved at different speeds. This has been explored in the area of visual rendering, where a lower quality rendering may be perfectly adequate for rendering an object that is in motion, to be replaced by a higher quality one when it is static. It is possible to design variable-quality actions, both for input and output, which make it possible to maintain guarantees of responsiveness by trading off other resource/quality dimensions. In many cases, the properties of human perception will aid the programmer, since rapid change reduces sensory acuity. In other cases this may not be true, (such as a system using force feedback in conjunction with a fingertip motion over a virtual object). Both technical and psychophysical questions need to be explored to make a variable quality strategy effective.

Multi-person, multi-device, interaction modes

One of the key motivations for the generalizations in this architecture is the desire to support integrated applications with multiple users and multiple devices in an interaction structure that is many-to-many (one person may use several devices, several people may share one). There has been a good deal of work on shared-workspace applications, primarily for remotely linked participants.We have not dealt with questions of telepresence in this chapter, but clearly the design of interaction spaces will extend across more than one physical location..The issues in coordinating multiple activities at any degree of co-presence are both technical and social, and as we expand the space of possible participant-device configurations, we need to better understand and design the ways that people work together.

Standard models

Today's GUI systems have a relatively mature and stable model for visual objects, windows, menus, etc. This makes possible the ease of programming mentioned in our initial GUI scenario. There are no corresponding models for human physical activities, such as speech, gesture, and freehand drawing. These will be more complex to develop, since they need to deal with inputs that can be ambiguous and uncertain, and that may require combining information from multiple modalities. We expect models to emerge for specific aspects of human behavior as the research proceeds, and to evolve through experience to become sufficiently general.

6. CONCLUSION

This chapter has proposed a conceptual framework for the design of interactive spaces. It will take an ambitious research program to develop a general-utility system in accordance with this perspective, and there are many open research questions. There are several shorter-term actions that can be effective in solving some of the problems that motivated the approach presented here.

First, in building new systems that implement parts of a general mechanism, we can use structures that are compatible with the larger architecture and open to extension within the framework. In our own work on the interactive workspace, we are taking this approach. We are developing and integrating capabilities using a bottom-up strategy, with the larger-scale view as background. Second, the conceptual distinctions here can be useful in sorting out problems and confusions in designing special purpose systems. This will become increasingly important as more applications begin to make use of broad, rich input devices (e.g., cameras and microphones), with their attendant problems of person identification and context-based interpretation of the phenomena of relevance to the user and computing system. Finally a shift of perspective may be a catalyst to help provoke new ideas about what to try, and what can be done in improving the ways in which computers and people interact.

ACKNOWLEDGMENTS

Thanks to Michelle Baldonado, Henry Berg, François Gumbretiere, and Debby Hindus for helpful comments on earlier drafts. Also to Pat Hanrahan and the students in the Interactive Workspace project, for discussions and an environment that raises the right questions.

REFERENCES