22.01.17 - 27.01.17, Seminar 17042

From Characters to Understanding Natural Language (C2NLU): Robust End-to-End Deep Learning for NLP

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.

Motivation

Deep learning is currently one of most active areas of research in machine learning and its applications, including natural language processing (NLP). One hallmark of deep learning is end-to-end learning: all parameters of a deep learning model are optimized directly on the learning objective; e.g., on the objective of accuracy on the binary classification task: is the input image the image of a cat? Crucially, the set of parameters that are optimized includes "first-layer" parameters that connect the raw input representation (e.g., pixels) to the first layer of internal representations of the network (e.g., edge detectors). In contrast, many other machine learning models employ hand-engineered features to take the role of these first-layer parameters.

Even though deep learning has had a number of successes in NLP, research on true end-to-end learning is just beginning to emerge. Most NLP deep learning models still start with a hand-engineered layer of representation, the level of tokens or words, i.e., the input is broken up into units by manually designed tokenization rules. Such rules often fail to capture structure both within tokens (e.g., morphology) and across multiple tokens (e.g., multi-word expressions).

Another problem of token-based end-to-end systems is that they currently have no principled and general way to generate tokens that are not part of the training vocabulary. Since a token is represented as a vocabulary index and parameters governing system behavior affecting this token are referring to this vocabulary index, a token that does not have a vocabulary index cannot easily be generated in end-to-end systems. In contrast, character-based end-to-end systems can generate new vocabulary items, so that -- at least in theory -- they do not have an out-of-vocabulary problem.

Character-based processing is also interesting from a theoretical point of view for linguistics and computational linguistics. We generally assume that the relationship between signifiers (tokens) and the signified (meaning) is arbitrary. There are well-known cases of non-arbitrariness, including onomatopoeia and regularities in names (female vs male first names), but these are usually considered to be exceptions. Character-based approaches can deal much better with such non-arbitrariness than token-based approaches. Thus, if non-arbitrariness is more pervasive than generally assumed, then character-based approaches would have an additional advantage.

Given the success of end-to-end learning in other domains, it is likely that it will also be widely used in NLP to alleviate these issues and lead to great advances. This workshop will bring together an interdisciplinary group of researchers from deep learning, machine learning and computational linguistics to develop a research agenda for end-to-end deep learning applied to natural language.

License
Creative Commons BY 3.0 Unported license
Hinrich Schütze