15. – 20. Februar 2015, Dagstuhl-Seminar 15081

Holistic Scene Understanding


Jiri Matas (Czech Technical University, CZ)
Vittorio Murino (Italian Institute of Technology – Genova, IT)
Bodo Rosenhahn (Leibniz Universität Hannover, DE)


Laura Leal-Taixé (ETH Zürich, CH)

Auskunft zu diesem Dagstuhl-Seminar erteilt

Dagstuhl Service Team


Dagstuhl Report, Volume 5, Issue 2 Dagstuhl Report
Dagstuhl-Seminar Wiki

(Zum Einloggen bitte Seminarnummer und Zugangscode verwenden)



To understand a scene in a given image or video is much more than to simply record and store it, extract some features and eventually recognize an object. The overall goal is to find a mapping to derive semantic information from sensor data. Purposive Scene understanding may require a different representation for different specific tasks. The task itself can be used as prior but we still require an in-depth understanding and balancing between local, global and dynamic aspects which can occur within a scene. For example, an observer might be interested to understand from an image if there is a person present or not, and beyond that, if it is possible to look for more information, e.g. if the person is sitting, walking or raising a hand, etc.

When people move in a scene, the specific time (e.g. 7:30 in the morning, workdays, weekend), the weather (e.g. rain), objects (cars, a bus approaching a bus stop, crossing bikes, etc.) or surrounding people (crowded, fast moving people) yield to a mixture of low-level and high-level, as well as abstract cues, which need to be jointly analyzed to get an in-depth understanding of a scene. In other words, generally speaking, the so-called extit{context} is to be considered for a comprehensive scene understanding, but this information, while it is easily captured by human beings, is still difficult to obtain from a machine.

Holistic scene interpretation is crucial to design the next generation of recognition systems, which are important for several applications, e.g. driver assistance, city modeling and reconstruction, outdoor motion capture and surveillance.

With such topics in mind, the aim of this workshop was to discuss which are the sufficient and necessary elements for a complete scene understanding, i.e. what it really means to understand a scene. Specifically, in this workshop, we wanted to explore methods that are capable of representing a scene at different level of semantic granularity and modeling various degrees of interactions between objects, humans and 3D space. For instance, a scene-object interaction describes the way a scene type (e.g. a dining room or a bedroom) influences objects' presence, and vice versa. An object-3D-layout or human-3D-layout interaction describes the way the 3D layout (e.g. the 3D configuration of walls, floor and observer's pose) biases the placement of objects or humans in the image, and vice versa. An object-object or object-human interaction describes the way objects, humans and their pose affect each other (e.g. a dining table suggests that a set of chairs are to be found around it). In other words, the 3D configuration of the environment and the relative placements and poses of the objects and humans therein, the associated dynamics (relative distance, human body posture and gesture, gazing, etc.), as well as other contextual information (e.g., weather, temperature, etc.) support the holistic understanding of the observed scene.

As part of a larger system, understanding a scene semantically and functionally allows to make predictions about the presence and locations of unseen objects within the space, and thus predict behaviors and activities that are yet to be observed. Combining predictions at multiple levels into a global estimate can improve each individual prediction.

Since most scenes involve humans, we were also interested in discussing novel methods for analyzing group activities and human interactions at different levels of spatial and semantic resolution. As advocated in recent literature, it is beneficial to solve the problem of tracking individuals and understand their activities in a joint fashion by combining bottom-up evidence with top-down reasoning as opposed to attack these two problems in isolation.

Top-down constraints can provide critical contextual information for establishing accurate associations between detections across frames and, thus, for obtaining more robust tracking results. Bottom-up evidence can percolate upwards so as to automatically infer action labels for determining activities of individual actors, interactions among individuals and complex group activities. But of course there is more than this, it is indeed the cooperation of both data flows that makes the inference more manageable and reliable in order to improve the comprehension of a scene.

We gathered researchers which are not only well-known in Computer Vision areas such as object detection, classification, motion segmentation, crowd and group behavior analysis or 3D scene reconstruction, but also Computer Vision affiliated people from other communities in order to share each others point of view on the common topic of scene understanding.


Our main goals of the seminar can be summarized as follows:

  • Address holistic scene understanding, a topic that has not been discussed before in detail at previous seminars, with special focus on a multidisciplinary perspective for sharing or competing the different views.
  • Gather well-known researchers from the Computer Vision, Machine Learning, Social Sciences (e.g. Cognitive Psychology), Neuroscience, Robotics and Computer Graphics communities to compare approaches to representing scene geometry, dynamics, constraints as well as problems and task formulations adopted in these fields. The interdisciplinary scientific exchange is likely to enrich the communities involved.
  • Create a platform for discussing and bridging topics like perception, detection, tracking, activity recognition, multi-people multi-object interaction and human motion analysis, which are surprisingly treated independently in the communities.
  • Publication of an LNCS post-proceedings as previously done for the 2006, 2008 and 2010 seminars. These will include the scientific contributions of participants of the Seminar, focusing specially on the discussed topics presented at the Seminar.

Organization of the seminar

During the workshop we discussed different modeling techniques and experiences researchers have collected. We discussed sensitivity, time performance and e.g. numbers of parameters required for special algorithms and the possibilities for context-aware adaptive and interacting algorithms. Furthermore, we had extensive discussions on open questions in these fields.

On the first day, the organizers provided general information about Dagstuhl seminars, the philosophy behind Dagstuhl and the expectations to the participants. We also clarified the kitchen-rules and organized a running-group for the early mornings (5 people participated frequently!).

Social event.

On Wednesday afternoon we organized two afternoon event: One group made a trip to Trier, and another group went on a 3h hike in the environment.

Working Groups.

To strongly encourage discussions during the seminar, we organized a set of working groups on the first day (with size between 8--12 people). As topics we selected

  • What does "Scene Understanding" mean ?
  • Dynamic Scene: Humans.
  • Recognition in static scenes (in 3D).

There were two afternoon slots reserved for these working groups and the outcome of the working groups has been presented in the Friday morning session.

LNCS Post-Proceedings.

We will edit a Post-Proceeding and invite participants to submit articles. In contrast to standard conference articles, we allow for more space (typically 25 single-column pages) and allow to integrate open questions or preliminary results, ideas, etc. from the seminar into the proceedings. Additionally, we will enforce joint publications of participants who started to collaborate after the seminar. All articles will be reviewed by at least two reviewers and based on the evaluation, accepted papers will be published. We will publish the proceeding at the Lecture Notes in Computer Science (LNCS-Series) by Springer. The papers will be collected during the summer months.

Overall, it was a great seminar and we received very positive feedback from the participants. We would like to thank castle Dagstuhl for hosting the event and are looking forward to revisit Dagstuhl whenever possible.

Summary text license
  Creative Commons BY 3.0 Unported license
  Jiri Matas and Vittorio Murino and Bodo Rosenhahn and Laura Leal-Taixé

Dagstuhl-Seminar Series


  • Artificial Intelligence / Robotics
  • Computer Graphics / Computer Vision
  • Modelling / Simulation


  • Scene Analysis
  • Image Understanding
  • Crowd Analysis
  • People and Object Recognition


In der Reihe Dagstuhl Reports werden alle Dagstuhl-Seminare und Dagstuhl-Perspektiven-Workshops dokumentiert. Die Organisatoren stellen zusammen mit dem Collector des Seminars einen Bericht zusammen, der die Beiträge der Autoren zusammenfasst und um eine Zusammenfassung ergänzt.


Download Übersichtsflyer (PDF).

Dagstuhl's Impact

Bitte informieren Sie uns, wenn eine Veröffentlichung ausgehend von Ihrem Seminar entsteht. Derartige Veröffentlichungen werden von uns in der Rubrik Dagstuhl's Impact separat aufgelistet  und im Erdgeschoss der Bibliothek präsentiert.


Es besteht weiterhin die Möglichkeit, eine umfassende Kollektion begutachteter Arbeiten in der Reihe Dagstuhl Follow-Ups zu publizieren.