Human‐centered Vision Systems:
      Ideas for enabling Ambient Intelligence and serving Social Networks

 Mira Dontcheva    Hamid Aghajan, Stanford Dept. of Electrical Engineering 

Seminar on People, Computers, and Design
Stanford University November 13, 2009
, 12:45pm, Gates B01

Vision offers rich information about events involving human activities in user‐centric applications from gesture recognition to occupancy reasoning. Multi‐camera vision allows for applications based on 3D perception and reconstruction, offers opportunities for collaborative decision making, and enables hybrid processing through task assignment to different cameras based on their views.

In addition to the inherent complexities in vision processing stemming from perspective view and occlusions, setup and calibration requirements have challenged the creation of meaningful applications that can operate in uncontrolled environments. Moreover, the task of studying user acceptance criteria such as privacy management and the implications in visual ambient communication has for the most part stayed out of the realm of technology design, further hindering the roll‐out of vision‐based applications in spite of the available sensing, processing, and networking technologies.

The output of visual processing often consists of instantaneous measurements such as location and pose, enabling the vision module to yield quantitative knowledge to higher levels of reasoning. The extracted information is not always flawless and often needs further interpretation at a data fusion level.  Also while quantitative knowledge is essential in many smart environments applications such as gesture control and accident detection, most ambient intelligence applications need to also depend on qualitative knowledge accumulated over time in order to learn user’s behavior models and adapt their services to the preferences explicitly or implicitly stated by the user. 

Proper interfacing of vision to high‐level reasoning allows for integration of information arriving at different times and from different cameras, and application‐level interpretation according to the associated confidence levels, available contextual data, as well as the accumulated knowledge base from the user history and behavior model.  

This talk presents ideas for interfacing vision to other modules to enable real applications in presence of imperfect vision processing output. A number of data fusion models and potential applications are discussed involving the recognition of user activities, namely environment discovery based on user interaction, context‐based ambience control services, speaker assistance system, and experience sharing using avatars..


Hamid Aghajan is a professor of Electrical Engineering (consulting) at Stanford University since 2003. Areas of research in his group consist of multi-camera networks and human interfaces for ambient intelligence and smart environments, with application to smart homes, occupancy-based services, assisted living and well being, ambience control, smart meetings and speaker assistance systems, and avatar-based communication and social interactions. Hamid is editor-in-chief of the Journal of Ambient Intelligence and Smart Environments. He has authored 3 edited volumes on: Multi-Camera Networks – Principles and Applications, Human-centric Interfaces for Ambient Intelligence, and Handbook of Ambient Intelligence and Smart Environments. He has chaired workshops at ICCV, ECCV, ECAI, ACM Multimedia, ICMI-MLMI, and has offered short courses and tutorials at CVPR, ICCV, ICASSP, and ICDSC. Hamid obtained his Ph.D. degree in electrical engineering from Stanford University in 1995.

The talks are open to the public. They are in the Gates Building, Room B01 in the basement. The nearest public parking is in the structure at Campus Drive and Roth Way.

View this talk on line at CS547 on Stanford OnLine or using this video link.

Titles and abstracts for previous years are available by year and by speaker.