Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Within this website:
External resources:
Within this website:
External resources:
  • the dblp Computer Science Bibliography

Dagstuhl Seminar 18291

Extreme Classification

( Jul 15 – Jul 20, 2018 )

(Click in the middle of the image to enlarge)

Please use the following short url to reference this page:




Extreme classification is a rapidly growing research area within machine learning focusing on multi-class and multi-label problems involving an extremely large number of labels. Many applications of extreme classification have been found in diverse areas ranging from language modeling to document tagging in NLP, face recognition to learning universal feature representations in computer vision, gene function prediction in bioinformatics, etc. Extreme classification has also opened up a new paradigm for key industrial applications such as ranking and recommendation by reformulating them as multi-label learning tasks where each item to be ranked or recommended is treated as a separate label. Such reformulations have led to significant gains over traditional collaborative filtering and content-based recommendation techniques. Consequently, extreme classifiers have been deployed in many real-world applications in industry.

Extreme classification raises a number of interesting research questions including those related to the following topics:

  • Algorithms for extreme classification
  • Theoretical foundations of extreme classification
  • Gathering techniques for supervised data
  • Long tail effects in machine learning
  • Deep learning for extreme classification
  • Counterfactual learning for extreme classification
  • Applications in advertising, bioinformatics, information retrieval, natural language processing, recommender systems, vision and other domains

This Dagstuhl Seminar on extreme classification aims to bring together researchers interested in these topics to encourage discussion, identify important problems and promising research directions, foster collaboration and improve upon the state-of-the-art. We hope to have a healthy mix of participants from both academia and industry as well as researchers from both core machine learning and applied areas such as recommender systems, computer vision, computational advertising, information retrieval and natural language processing.

Copyright Samy Bengio, Krzysztof Dembczynski, Thorsten Joachims, Marius Kloft, and Manik Varma


The topic of this seminar is in the general context of machine learning [10] which concerns the study and development of algorithms that learn from empirical data how to make accurate predictions about yet unseen data without being explicitly programmed. Multi-class and multi-label learning are classical problems in machine learning. The outputs here stem from a finite set of categories (classes), and the aim is to classify each input into one (multi-class) or multiple (multi-label) out of several possible target classes. Classical applications of multi-class and multi-label learning include handwritten optical character recognition [8], part-of-speech tagging [11], and text categorization [7]. However, with the advent of the big data era, learning problems can involve even millions of classes. As examples let us consider the following problems:

  • Person recognition in Facebook images (there are billions of Facebook users; given an image, we might want to predict the subset of users present in the image for such applications like security, surveillance, social network analysis, etc.).
  • Predicting Wikipedia tags for new Wikipedia articles or webpages (Wikipedia has almost 2 million tags now).
  • Recommending Amazon items where each of the 100 million items on Amazon is a separate label.
  • Search on Google/Bing where each of the 100 million queries is a separate label.
  • Language modelling – predicting the next word in a sentence from the millions of words available.

The problems of this type are often referred to as extreme classification. They have posed new computational and statistical challenges and opened a new line of research within machine learning.

The main goal of extreme classification is to design learning and prediction algorithms, characterized by strong statistical guarantees, that exhibit sublinear time and space complexity in the number of classes. Unfortunately, the theoretical results obtained so far are still not satisfactory and very limited. Moreover, the problems at this scale often suffer from unreliable learning information, e.g., there is no chance to identify all positive labels and assign them precisely to training examples. The majority of labels is used very rarely, which leads to the problem of the long-tail distribution. In practical applications, learning algorithms run in rapidly changing environments. Hence, during testing/prediction phase new labels might appear that have not been present in the training set [4, 2]. This is the so-called zero-shot learning problem. Furthermore, typical performance measures used to assess the prediction quality of learning algorithms, such as 0/1 or Hamming loss, do not fit well to the nature of extreme classification problems. Therefore, other measures are often used such as precision@k [9] or the F-measure [6]. However, none of the above is appropriate to measure predictive performance in the long-tail problems or in the zero-shot setting. Hence, the goal is to design measures, which promote a high coverage of sparse labels [5].

The seminar aimed at bringing together researchers interested in extreme classification to encourage discussion on the above mentioned problems, identify the most important ones and promising research directions, foster collaboration and improve upon the state-of-the-art algorithms. The meeting in this regard was very successful as participants from both academia and industry as well as researchers from both core machine learning and applied areas such as recommender systems, computer vision, computational advertising, information retrieval and natural language processing, were given the opportunity to see similar problems from different angles.

The seminar consisted of invited talks, working groups, presentation of their results, and many informal discussions. The talks concerned among others such topics as: common applications of extreme classification, potential applications in bioinformatics and biotechnology, neural networks for extreme classification, learning theory for problems with a large number of labels, approaches for dealing with tail labels, learning and prediction algorithms, extreme classification challenges in natural language processing, multi-task learning with large number of tasks, pitfalls of multi-class classification, recommendation systems and their connection to extreme classification, counterfactual learning and zero-shot learning. The short abstracts of these talks can be found below in this report. The four working groups focused on the following problems: loss functions and types of predictions in multi-label classification, deep networks for extreme classification, zero-shot learning and long tail labels, and generalization bounds and log-time-and-space algorithms. Short summaries of the results obtained by the working groups can also be found below.

During the seminar, we also discussed different definitions of extreme classification. The basic one determines extreme classification as a multi-class or multi-label problem with a very large number of labels. The labels are rather typical identifiers without any explicit meaning. However, there usually exists some additional information about similarities between the labels (or this information can be extracted or learned from data). From this point of view, we can treat extreme classification as a learning problem with a weak structure over the labels. This is in difference to structured output prediction [1], where we assume much stronger knowledge about the structure. The most general definition, however, says that extreme classification concerns all problems with an extreme number of choices. The talks, working groups, and discussions have helped to gain a better understanding of existing algorithms, theoretical challenges, and practical problems not yet solved. We believe that the seminar has initiated many new collaborations and strengthen the existing ones that will soon deliver new results for the extreme classification problems.


  1. Bakir GH, Hofmann T, Schölkopf B, Smola AJ, Taskar B, Vishwanathan SVN (eds) Predicting Structured Data. Neural Information Processing, The MIT Press, 2007
  2. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129, 2013.
  3. Yann Guermeur. lp-norm Sauer-Shelah lemma for margin multi-category classifiers. arXiv preprint arXiv:1609.07953, 2016.
  4. Bharath Hariharan, S. V. N. Vishwanathan, and Manik Varma. Efficient max-margin multilabel classification with applications to zero-shot learning. Machine Learning, 88(1-2):127–155, 2012.
  5. Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 935–944, New York, NY, USA, 2016. ACM.
  6. K. Jasinska, K. Dembczynski, R. Busa-Fekete, K. Pfannschmidt, T. Klerx, and E. Hüllermeier. Extreme F-measure maximization using sparse probability estimates. In Proceedings of the 33rd International Conference on International Conference on Machine Learning – Volume 48, pages 1435–1444, 2016.
  7. Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998.
  8. Nei Kato, Masato Suzuki, Shin Ichiro Omachi, Hirotomo Aso, and Yoshiaki Nemoto. A handwritten character recognition system using directional element feature and asymmetric mahalanobis distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(3):258–262, 1999.
  9. Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 263–272. ACM, 2014.
  10. Vladimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.
  11. Atro Voutilainen. Part-of-speech tagging. The Oxford handbook of computational linguistics, pages 219–232, 2003.
Copyright Samy Bengio, Krzysztof Dembczynski, Thorsten Joachims, Marius Kloft, and Manik Varma

  • Maximilian Alber (TU Berlin, DE) [dblp]
  • Rohit Babbar (Aalto University, FI) [dblp]
  • Samy Bengio (Google Inc. - Mountain View, US) [dblp]
  • Alexander Binder (Singapore University of Technology and Design, SG) [dblp]
  • Evgenii Chzhen (University Paris-Est - Créteil, FR) [dblp]
  • Kunal Dahiya (Indian Institute of Technology - New Dehli, IN) [dblp]
  • Krzysztof Dembczynski (Poznan University of Technology, PL) [dblp]
  • Urun Dogan (Microsoft Research UK - Cambridge, GB) [dblp]
  • Matthias Enders (NPZ Innovation GmbH, DE)
  • Asja Fischer (Ruhr-Universität Bochum, DE) [dblp]
  • Johannes Fürnkranz (TU Darmstadt, DE) [dblp]
  • Thomas Gärtner (University of Nottingham, GB) [dblp]
  • Edouard Grave (Facebook - Menlo Park, US) [dblp]
  • Yann Guermeur (LORIA & INRIA Nancy, FR) [dblp]
  • Eyke Hüllermeier (Universität Paderborn, DE) [dblp]
  • Christian Igel (University of Copenhagen, DK) [dblp]
  • Himanshu Jain (Indian Institute of Technology - New Dehli, IN) [dblp]
  • Kalina Jasinska (Poznan University of Technology, PL) [dblp]
  • Armand Joulin (Facebook - Menlo Park, US) [dblp]
  • Nikos Karampatziakis (Microsoft Research - Redmond, US) [dblp]
  • Matthias Kirchler (HU Berlin, DE) [dblp]
  • Marius Kloft (TU Kaiserslautern, DE) [dblp]
  • Christoph H. Lampert (IST Austria - Klosterneuburg, AT) [dblp]
  • John Langford (Microsoft Research - Redmond, US) [dblp]
  • Antoine Ledent (TU Kaiserslautern, DE)
  • Christoph Lippert (Max-Delbrück-Centrum - Berlin, DE) [dblp]
  • Nicolas Mayoraz (Google Research - Mountain View, US) [dblp]
  • Jinseok Nam (TU Darmstadt, DE) [dblp]
  • Alexandru Niculescu-Mizil (NEC Laboratories America, Inc. - Princeton, US) [dblp]
  • Yashoteja Prabhu (Indian Institute of Technology - New Dehli, IN) [dblp]
  • Pradeep Ravikumar (Carnegie Mellon University - Pittsburgh, US) [dblp]
  • Adith Swaminathan (Microsoft Research - Redmond, US) [dblp]
  • Manik Varma (Microsoft Research India - Bangalore, IN) [dblp]
  • Willem Waegeman (Ghent University, BE) [dblp]
  • Marek Wydmuch (Poznan University of Technology, PL) [dblp]

Related Seminars
  • Dagstuhl Seminar 15152: Machine Learning with Interdependent and Non-identically Distributed Data (2015-04-07 - 2015-04-10) (Details)

  • artificial intelligence / robotics
  • computer graphics / computer vision
  • data structures / algorithms / complexity

  • algorithms and complexity
  • machine learning
  • artificial intelligence
  • computer vision