15.02.15 - 20.02.15, Seminar 15081

Holistic Scene Understanding

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.

Motivation

Understanding the scene in an image or video requires much more than recording and storing it, extracting some features and eventually recognizing objects. Ultimately, the overall goal is to find a mapping to derive semantic information from sensor data. Besides, purposive scene understanding may require different representations for different specific tasks and, actually, the task itself can be used as driver for the subsequent data processing. However, there is still the need of capturing local, global and dynamic aspects of the acquired observations, which are to be utilized to understand what is occurring in a scene. For example, one might be interested to realize from an image if there is a person present or not and where, and beyond that, to look for its specific pose, e.g., if the person is sitting, walking or raising a hand, etc.. When people move in a scene, the specific time (e.g., 7:30 in the morning, workdays, weekend), the weather (e.g., rain), objects (e.g., cars, a bus approaching a bus stop, crossing bikes, etc.) or surrounding people (crowded, fast moving people) yield to a mixture of low-level and high-level, as well as abstract cues, which need to be jointly analyzed to get an profound understanding of a scene. In other words, generally speaking, all information which is possible to extract from a scene must be considered in context in order to get a comprehensive scene understanding, but this information, while it is easily captured by humans, is still difficult to obtain from a machine.

Next generation recognition systems require a full, holistic, understanding of the scene components and their dynamics in order to cope more and more effectively with real applications like car driver assistance, urban design, surveillance, and many others.

With such topics in mind, the aim of this seminar is to discuss which are the sufficient and necessary elements for a complete scene understanding, i.e. what it really means to understand a scene. Specifically, in this seminar, we want to explore methods that are capable of representing a scene at different level of semantic granularity and modeling various degrees of interactions between objects, humans and 3D space. For instance, a scene-object interaction describes the way a scene type (e.g., a dining room or a bedroom) influences the probability of an objects' presence, and vice versa. The 3D layout of the environment (e.g., walls, floors, etc.) biases the placements of objects and humans in the scene, and also affects the way they interact. An object-object or object-human interaction describes the way objects, humans and their pose affect each other (e.g., a dining table suggests that a set of chairs are to be found around it). In other words, the 3D configuration of the environment and the relative placements and poses of the objects and humans therein, the associated dynamics (relative distance, human body posture and gesture, gazing, etc.), as well as other contextual information (e.g., weather, temperature, etc.) support the holistic understanding of the observed scene. Since many scenes involve humans, we are also interested in discussing novel methods for analyzing group activities and human interactions at different levels of spatial and semantic resolution.

In this sense, understanding a visual scene requires multidisciplinary discussions between scientists in Computer Vision, Machine Learning, but also Robotics, Computer Graphics, Mathematics, Natural Language Processing and Cognitive Sciences. Additionally, disciplines like Psychology, Anthropology, Sociology, Linguistics or Neuroscience touch upon this problem, which is inherent in the human comprehension of the environment and our social lives. Rarely these communities get a possibility to share their views on this same topic.

We will gather not only researchers well-known in Computer Vision areas such as object detection, classification, motion segmentation, crowd and group behavior analysis or 3D scene reconstruction, but also Computer Vision affiliated people from the aforementioned communities in order to share each others point of view on the common topic of scene understanding.