Dagstuhl-Seminar 21351: Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics

Dagstuhl-Seminar 21351

Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics

( 30. Aug – 31. Aug, 2021 )

(zum Vergrößern in der Bildmitte klicken)

Permalink

Bitte benutzen Sie folgende Kurz-Url zum Verlinken dieser Seite: https://www.dagstuhl.de/21351

Organisatoren

Timothy Baldwin (The University of Melbourne, AU)
William Croft (University of New Mexico - Alburquerque, US)
Joakim Nivre (Uppsala University, SE)
Agata Savary (Université de Tours - Blois, FR)

Kontakt

Shida Kunz (für wissenschaftliche Fragen)
Jutka Gasiorowski (für administrative Fragen)

Motivation

Show Motivation

Computational linguistics builds models that can usefully process and produce language and that can increase our understanding of linguistic phenomena. From a computational perspective, language is particularly challenging notably due to its variable degree of idiosyncrasy (unexpected properties shared by few peer objects), and the pervasiveness of non-compositional phenomena such as multiword expressions (whose meaning cannot be straightforwardly deduced from the meanings of their components, e.g. red tape, by and large, to pay a visit and to pull one’s leg) and constructions (conventional associations of forms and meanings). Additionally, if models and methods are to be consistent and valid across languages, they have to face specificities inherent either to particular languages, or to various linguistic traditions.

A few existing initiatives, such as Universal Dependencies, PARSEME and UniMorph, have been addressing these challenges with the aim of revealing the universals of idiosyncrasy in language, proposing cross-lingually applicable typologies and methodologies for language modelling, and creating highly multilingual language resources and tools. These efforts have been carried on relatively independently, resulting in partly diverging terminologies and methods.

The objectives of this seminar are threefold:

Theoretical: To deepen the understanding of language universals, and of how they apply to linguistic idiosyncrasy, so as to further promote unified modelling while preserving diversity.
Practical: To improve the treatment of idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for more languages with greater typological diversity.
Networking: To promote a higher degree of convergence across typology-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics.

In order to pursue these objectives, we propose a list of research questions grouped into thematic categories:

Atomic units of language: Identifying words across languages. Relation of syntactic words to lexical units. Morphological universals in words.
Syntactic annotation in presence of idiosyncrasies: Annotating expressions which are partly regular and partly irregular. Capturing syntactical idiosyncrasies of MWEs which capture generalisations at the level of types rather than tokens. The interplay between lexicon and treebanking.
Syntax-semantics interface in treebanking: Division of labor between syntactic and semantic annotation. Modeling expressions whose regular vs. idiosyncratic nature is particularly hard to capture: serial verbs, light-verb constructions (to pay a visit) and verb-particle constructions (to bring about), functional MWEs (in spite of, because of, not only).
Universals of idiosyncrasy: Universals of linguistic idiosyncrasy established so far. Cross-lingual characterization of idiomaticity and syntactic irregularity. Relations between the syntactic irregularity and semantic non-compositionality.
Semantics of MWEs: Defining and testing semantic non-compositionality for rigorous and reproducible MWE annotation. Semantic calculus in MWEs.
Exploratory issues: Long-term objectives to consider for universal-driven initiatives. Extension of the existing models and methods to syntactic constructions.

The expected outcomes of the seminar include: (i) enhanced unified versions of the already existing annotation guidelines put forward by UD, PARSEME and UniMorph, (ii) criteria for applying unified guidelines to specific languages, (iii) recommendations on syntactic and semantic representation of MWEs in lexicons, and (iv) recommendations on how to cover grammatical constructions within treebanking frameworks and NLP tools.

The list of invitees includes researchers in NLP, linguistics and typology, with expertise in morphology, syntax, semantics, MWEs, constructions, annotation, parsing, and dozens of languages from diverse language families. They are based in 22 countries, spread across 4 continents. Due to the SARS-CoV-2 pandemic, a hybrid (on-site and videoconferencing) meeting format will most probably be offered.

Creative Commons BY 3.0 DE

Timothy Baldwin, William Croft, Joakim Nivre, and Agata Savary

Summary

Show Summary

This Dagstuhl Seminar was initially planned as a 1-week event in June 2020 (with number 20261) with the following objectives:

Theoretical: To deepen the understanding of language universals, and of how they apply to linguistic idiosyncrasy, so as to further promote unified modelling while preserving diversity.
Practical: To improve the treatment of idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for more languages with greater typological diversity.
Networking: To promote a higher degree of convergence across typology-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics.

Due to the COVID-19 pandemic, the event was first rescheduled and finally reduced to a 2-day online event on 30-31 August 2021, with two 3-hour sessions, repeated for better inclusiveness of various time zones (which corresponds to about 20% of the initially planned duration).

Prior to the event, participants submitted discussion issues, based on which working groups and the program were formed, as described in our Wiki space.

More precisely, the program of the event followed the Dagstuhl model:

A list of recommended readings was published prior to the event
Introductory talks, given by the 4 organizers, ensured common understanding of the scope and challenges to address.
Personal introductions of all participants helped achieve a community building effect, despite the online setting.
Working groups (WGs) were built on the basis of the discussion issues submitted by the participants. Each WG had 4 co-leaders, at least one of which could attend repeated sessions, so as to ensure consistency between the 2 time-zone sub-groups. The following WGs were created:
- WG1: What counts as a word?
- WG2: What counts as a MWE and as a construction?
- WG3: Syntax vs. semantics
Discussion issues were addressed in WGs by the proposers' short introductions followed by brainstorming.
Plenary reporting sessions from WGs took place twice for every time zone.

The event attracted 51 participants, who judged it successful and expressed the need for a full-size onsite follow-up event. All the organizational details and outcomes of the seminar are gathered in our Wiki space.

Despite its very reduced and fully online format, the seminar achieved part of its objectives, stressed the importance of some initially-defined research questions, gave rise to new questions, and showed the efficiency of some instruments.

On the networking side, the intended convergence effect was clearly apparent. While the initial proposal and invitee list was dominated by NLP-oriented members of the UD and PARSEME communities, strong contributions came notably from the less numerous typology and UniMorph experts. The four communities interacted actively, and reinforcing these interactions is intended for the near future. Notably, steps were taken towards:
- integrating typology experts in the PARSEME core group
- accompanying a seminal work in typology (Croft, to appear) with a "companion volume" about practical implementation of morphosyntactic concepts in UD.
On the theoretical side, the event showed:
1. The importance of the research question How to identify words across languages? (item I.A in the seminar proposal), to which the whole of Working Group 1 was dedicated. In particular, new insights from lesser-studied languages, brought by typology experts, allowed us to broaden the perspective on this issue.
2. The need for capturing the relationship between the two fundamental notions in this proposal: a multiword expression and a construction, studied by Working Group 2. From the linguistic and typology perspective, a MWE is a special case of a construction, which is rarely made explicit in current NLP models. But the notion of a construction needs a more formal definition to be implementable in NLP, notably as far as the type-token opposition is concerned (question II.B in the seminar proposal). Thus, the typology-NLP interactions are essential in the quest for an optimal model.
3. The scope of the syntax-semantics interface issues (question II in the proposal) addressed by Working Group 3. On the one hand, the interests of the community in this respect exceeded the scope intended by the event organizers. Namely corpus-lexicon interlinking for all language units, not only for MWEs, was targeted. On the other hand, MWEs are exemplars of condensed syntax-semantic interface issues, and as such provide good case studies in this domain.
On the practical side, some initial proposals emerged as to harmonizing UD treebank annotation guidelines with: (i) modelling morphological properties at the subword level (heavily studied by UniMorph), (ii) labelling MWEs (core activity of PARSEME).

Each multidisciplinary approach like ours bears heavy risks of intractability. This is because different communities often have different objectives and points of view on the same phenomena, and they may fail to agree on a unified approach, or even on the usefulness of working towards such a unification. In our case, there is a tension between:

diversity and descriptive detail required in linguistics,
necessary simplifications for the sake of robustness in NLP.

In other words, it is legitimate to question the usefulness of universality-driven initiatives (in NLP) if idiosyncrasy and diversity are basic properties of language data. Yet even typologists seek language universals which abstract away from the idiosyncrasy.

We feel that the event allowed us to mitigate this tension. Namely, even if a universality-based treebank fails to render the diversity of possible analyses of a language phenomenon, it is still useful not only for NLP applications but also for linguistic and typological analyses. This is because relevant examples are easy to extract (and to further re-interpret), as long as the annotation is consistent and well-documented.

Another barrier-lifting effect of the event concerned the relation between UD and PARSEME. It seems that the MWE categories defined by UD and PARSEME are less incompatible than initially expected, simply because the definition of an MWE in itself is different in UD and PARSEME. This could have been a source of major incompatibility but since a MWE does not really have a status in the UD annotation process, the discrepancies could (at least in some cases) be overcome relatively easily.

In conclusion, the event provided, in our opinion, a proof of concept for the framing objectives set up in the original Dagstuhl seminar proposal. However, since the effective framework and duration was severely reduced as compared to the initially intended setting, only part of these objectives could be achieved. Thus, we are currently putting efforts to ensure follow-up events. In particular, a new Dagstuhl seminar with roughly the same objectives has been submitted.

Creative Commons BY 4.0

Timothy Baldwin, William Croft, Joakim Nivre, and Agata Savary

Teilnehmer

Zeige Teilnehmer

Remote:

Timothy Baldwin (The University of Melbourne, AU) [dblp]
Verginica Barbu Mititelu (Research Institute for A.I. - Bucharest, RO) [dblp]
Emily M. Bender (University of Washington - Seattle, US) [dblp]
Archna Bhatia (Florida IHMC - Ocala, US) [dblp]
Bernd Bohnet (Google - Amsterdam, NL) [dblp]
Francis Bond (Nanyang TU - Singapore, SG) [dblp]
Cem Bozsahin (Middle East Technical University - Ankara, TR) [dblp]
Ryan Cotterell (ETH Zürich, CH) [dblp]
William Croft (University of New Mexico - Alburquerque, US) [dblp]
Miryam de Lhoneux (University of Copenhagen, DK) [dblp]
Marie-Catherine de Marneffe (Ohio State University - Columbus, US) [dblp]
Jamie Findlay (University of Oslo, NO) [dblp]
Daniel Flickinger (Stanford University, US) [dblp]
Kim Gerdes (University Paris-Saclay - Orsay, FR) [dblp]
Voula Giouli (Athena Research Center, GR) [dblp]
Tunga Gungor (Bogaziçi University - Istanbul, TR) [dblp]
Jan Hajic (Charles University - Prague, CZ) [dblp]
Dag Haug (University of Oslo, NO) [dblp]
Uxoa Iñurrieta (Donostia, ES) [dblp]
Laura Kallmeyer (Universität Düsseldorf, DE) [dblp]
Christo Kirov (Google - New York, US) [dblp]
Maria Koptjevskaja-Tamm (Stockholm University, SE)
Artur Kulmizev (Uppsala University, SE) [dblp]
Lori Levin (Carnegie Mellon University - Pittsburgh, US) [dblp]
Natalia Levshina (Max-Planck-Institut für Psycholinguistik - Nijmege, NL) [dblp]
Teresa Lynn (Dublin City University, IE) [dblp]
Stella Markantonatou (Athena Research Center, GR) [dblp]
Nurit Melnik (The Open University of Israel - Raanana, IL)
Paola Merlo (University of Geneva, CH) [dblp]
Yusuke Miyao (University of Tokyo, JP) [dblp]
Kadri Muischnek (University of Tartu, EE) [dblp]
Joakim Nivre (Uppsala University, SE) [dblp]
Petya Osenova (Bulgarian Academy of Sciences - Sofia, BG) [dblp]
Steve Pepper (University of Oslo, NO) [dblp]
James Pustejovsky (Brandeis University - Waltham, US) [dblp]
Alexandre Rademaker (IBM Research - Sao Paulo, BR) [dblp]
Carlos Ramisch (Aix-Marseille University, FR) [dblp]
Manfred Sailer (Goethe-Universität Frankfurt am Main, DE) [dblp]
Agata Savary (Université de Tours - Blois, FR) [dblp]
Emmanuel Schang (University of Orleans, FR) [dblp]
Nathan Schneider (Georgetown University - Washington, DC, US) [dblp]
Ivelina Stoyanova (Bulgarian Academy of Sciences - Sofia, BG) [dblp]
Sara Stymne (Uppsala University, SE) [dblp]
Reut Tsarfaty (Bar-Ilan University - Ramat Gan, IL) [dblp]
Francis M. Tyers (Indiana University - Bloomington, US) [dblp]
Meagan Vigus (University of New Mexico - Alburquerque, US) [dblp]
Aline Villavicencio (University of Sheffield, GB) [dblp]
Veronika Vincze (University of Szeged, HU) [dblp]
Ekaterina Vylomova (The University of Melbourne, AU) [dblp]
Nianwen Xue (Brandeis University - Waltham, US) [dblp]
David Yarowsky (Johns Hopkins University - Baltimore, US) [dblp]
Amir Zeldes (Georgetown University - Washington, DC, US) [dblp]
Daniel Zeman (Charles University - Prague, CZ) [dblp]
Tim Zingler (University of New Mexico - Alburquerque, US)

Klassifikation

artificial intelligence / robotics

Schlagworte

computational linguistics
morphosyntax
multiword expressions
language universals
idiosyncrasy

Seminar 21351

Suche auf der Schloss Dagstuhl Webseite

Schloss Dagstuhl Services

Seminare

Innerhalb dieser Seite:

Externe Seiten:

Publishing

Innerhalb dieser Seite:

Externe Seiten:

dblp

Innerhalb dieser Seite:

Externe Seiten:

Dagstuhl-Seminar 21351

Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics

( 30. Aug – 31. Aug, 2021 )

Permalink

Organisatoren

Kontakt

Externe Veranstaltungsseite

Publikationen

Impacts

Programm

Motivation

Summary

Teilnehmer

Verwandte Seminare

Klassifikation

Schlagworte