https://www.dagstuhl.de/23191

May 7 – 12 , 2023, Dagstuhl Seminar 23191

Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics

Organizers

Timothy Baldwin (The University of Melbourne, AU)
William Croft (University of New Mexico – Alburquerque, US)
Joakim Nivre (Uppsala University, SE)
Agata Savary (University Paris-Saclay – Orsay, FR)

For support, please contact

Christina Schwarz for administrative matters

Michael Gerke for scientific matters

Documents

External Homepage

Motivation

Computational linguistics builds models that can usefully process and produce language and that can increase our understanding of linguistic phenomena. From a computational perspective, language is particularly challenging notably due to its variable degree of idiosyncrasy (unexpected properties shared by few peer objects), and the pervasiveness of non-compositional phenomena such as multiword expressions (whose meaning cannot be straightforwardly deduced from the meanings of their components, e.g. red tape, by and large, to pay a visit and to pull one’s leg) and constructions (conventional associations of forms and meanings). Additionally, if models and methods are to be consistent and valid across languages, they have to face specificities inherent either to particular languages, or to various linguistic traditions.

A few existing initiatives, such as Universal Dependencies1, PARSEME2 and UniMorph3, have been addressing these challenges with the aim of revealing the universals of idiosyncrasy in language, proposing cross-lingually applicable typologies and methodologies for language modelling, and creating highly multilingual language resources and tools. These efforts have been carried on relatively independently, resulting in partly diverging terminologies and methods.

The objectives of this Dagstuhl Seminar are threefold:

  • Theoretical: To deepen the understanding of language universals, and of how they apply to linguistic idiosyncrasy, so as to further promote unified modelling while preserving diversity.
  • Practical: To improve the treatment of idiosyncrasy in treebanking frameworks, in computationally tractable ways and, thus, to foster high quality NLP tools for more languages with greater typological diversity.
  • Networking: To promote a higher degree of convergence across typology-driven initiatives, while focusing on three main aspects of language modelling: morphology, syntax, and semantics.

In order to pursue these objectives, we propose a list of research questions grouped into thematic categories:

  • Atomic units of language: Identifying words across languages. Relation of syntactic words to lexical units. Morphological universals in words.
  • Syntactic annotation in presence of idiosyncrasies: Annotating expressions which are partly regular and partly irregular. Capturing syntactical idiosyncrasies of MWEs which capture generalisations at the level of types rather than tokens. The interplay between lexicon and treebanking.
  • Syntax-semantics interface in treebanking: Division of labor between syntactic and semantic annotation. Modeling expressions whose regular vs. idiosyncratic nature is particularly hard to capture: serial verbs, light-verb constructions (to pay a visit) and verb-particle constructions (to bring about), functional MWEs (in spite of, because of, not only).
  • Universals of idiosyncrasy: Universals of linguistic idiosyncrasy established so far. Cross-lingual characterization of idiomaticity and syntactic irregularity. Relations between the syntactic irregularity and semantic non-compositionality.
  • Semantics of MWEs: Defining and testing semantic non-compositionality for rigorous and reproducible MWE annotation. Semantic calculus in MWEs.
  • Exploratory issues: Long-term objectives to consider for universal-driven initiatives. Extension of the existing models and methods to syntactic constructions.

The expected outcomes of the seminar include: (i) enhanced unified versions of the already existing annotation guidelines put forward by UD, PARSEME and UniMorph, (ii) criteria for applying unified guidelines to specific languages, (iii) recommendations on syntactic and semantic representation of MWEs in lexicons, and (iv) recommendations on how to cover grammatical constructions within treebanking frameworks and NLP tools.

The list of invitees includes researchers in NLP, linguistics and typology, with expertise in morphology, syntax, semantics, MWEs, constructions, annotation, parsing, and dozens of languages from diverse language families. They are based in 22 countries, spread across 5 continents.

This Dagstuhl Seminar is a follow-up event of the 2-day online seminar on 30-31 August 2021 (21351) under the same title. The seminar could meet part of the initial objectives and provided a proof of concept for the project behind the current seminar. We would be very happy to have you on board!

1 http://universaldependencies.org/
2 http://www.parseme.eu; https://gitlab.com/parseme/corpora/-/wikis/
3 https://unimorph.github.io/

Motivation text license
  Creative Commons BY 4.0
  Timothy Baldwin, William Croft, Joakim Nivre, and Agata Savary

Related Dagstuhl Seminar

Classification

  • Artificial Intelligence
  • Computation And Language

Keywords

  • Computational linguistics
  • Morphosyntax
  • Multiword expressions
  • Language universals
  • Idiosyncrasy

Documentation

In the series Dagstuhl Reports each Dagstuhl Seminar and Dagstuhl Perspectives Workshop is documented. The seminar organizers, in cooperation with the collector, prepare a report that includes contributions from the participants' talks together with a summary of the seminar.

 

Download overview leaflet (PDF).

Dagstuhl's Impact

Please inform us when a publication was published as a result from your seminar. These publications are listed in the category Dagstuhl's Impact and are presented on a special shelf on the ground floor of the library.

Publications

Furthermore, a comprehensive peer-reviewed collection of research papers can be published in the series Dagstuhl Follow-Ups.