03.11.13 - 08.11.13, Seminar 13451

Computational Audio Analysis

The following text appeared on our web pages prior to the seminar, and was included as part of the invitation.

Press Room

Motivation

With the rapid growth and omnipresence of digitized multimedia data, the processing, analysis, and understanding of such data by means of automated methods has become a central issue in computer science and associated areas of research. As for the acoustic domain, audio analysis traditionally has strongly been focused on data related to speech with the goal to recognize and transcribe the spoken words. In the proposed seminar, we want to consider current and future audio analysis tasks that go beyond the classical speech recognition scenario. On the one hand side, we want to look at the computational analysis of speech with regard to the speakers’ traits (e. g., gender, age, height, cultural and social background), physical conditions (e. g., sleepiness, medical and alcohol intoxication, health), or emotion-related and affective states (e. g., stress, interest, confidence, frustration). So, rather than recognizing what is being said, the goal is to find out how and by whom it is being said. On the other side, there is a rich variety of sounds besides speech such as music recordings, animal sounds, environmental sounds, and mixtures thereof. Here, similar as for the speech domain, we want to study how to decompose and classify the content of complex sound mixtures with the objective to infer semantically meaningful information.

When dealing with specific audio domains such as speech or music, it is crucial to properly understand and apply the appropriate domain-specific properties, be they acoustic, linguistic, or musical. Furthermore, data-driven learning techniques that exploit the availability of carefully annotated audio material have successfully been used for recognition and classification tasks. However, dealing with rather vague categories as in emotion recognition or considering general audio sources such as environmental sounds, model assumptions are often violated, or it even renders impossible to define explicit models. Furthermore, for non-standard audio material, annotated datasets are hardly available. Therefore, in this context, data-driven methods that are used in speech recognition are not directly applicable; instead semi-supervised or unsupervised learning techniques can be a promising approach to remedy these issues. Finally, audio sources as occurring in the real world are typically superimposed and highly blended, potentially consisting of overlapping speech, music, and general sound sources. Thus, efficient source separation techniques are required that allow for splitting up, re-synthesizing, analyzing, and classifying the individual sources—a problem that, for general audio signals, is yet not well understood.

The aim of this seminar is to gather researchers, who are experts in different audio-related multimedia domains and cover a broad spectrum of data analysis techniques (including audio signal processing, machine learning, probabilistic models) and tasks (including annotation, alignment, segmentation, classification, searching). Doing so, we hope to break new ground for analyzing, processing, experiencing, and understanding multi-faceted and highly blended real-world audio sources from a semantic perspective. Furthermore, bringing together experts from different disciplines, young researchers as well as people from industry, we expect this seminar to generate strong interest and to stimulate vibrant discussions, while highlighting opportunities for new collaborations across the attendees.

General questions and issues that will be addressed in this seminar include:

  • Efficient blind source separation with limited knowledge on the type of audio
  • Unsupervised learning for handling general audio
  • Machine learning techniques that allow for coupling various sources of information
  • Development of probabilistic frameworks for multi-faceted analysis
  • Detection and exploitation of mutual dependencies between different aspects
  • Handling partial, uncertain, and scattered information
  • Discovering long-term temporal structures
  • Automatic annotation of noisy raw data
  • Designing and learning robust and expressive mid-level audio features
  • Matching semantically similar data
  • Extracting emotion-related parameters from speech, music, and general audio