Volume

OASIcs, Volume 93

3rd Conference on Language, Data and Knowledge (LDK 2021)



Thumbnail PDF

Event

LDK 2021, September 1-3, 2021, Zaragoza, Spain

Editors

Dagmar Gromann
  • University of Vienna, Austria
Gilles Sérasset
  • Université Grenoble Alpes, France
Thierry Declerck
  • DFKI GmbH, Germany
John P. McCrae
  • National University of Ireland Galway, Ireland
Jorge Gracia
  • University of Zaragoza, Spain
Julia Bosque-Gil
  • University of Zaragoza, Spain
Fernando Bobillo
  • University of Zaragoza, Spain
Barbara Heinisch
  • University of Vienna, Austria

Publication Details

  • published at: 2021-08-30
  • Publisher: Schloss Dagstuhl – Leibniz-Zentrum für Informatik
  • ISBN: 978-3-95977-199-3
  • DBLP: db/conf/ldk/ldk2021

Access Numbers

Documents

No documents found matching your filter selection.
Document
Complete Volume
OASIcs, Volume 93, LDK 2021, Complete Volume

Authors: Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, and Barbara Heinisch


Abstract
OASIcs, Volume 93, LDK 2021, Complete Volume

Cite as

3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 1-516, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@Proceedings{gromann_et_al:OASIcs.LDK.2021,
  title =	{{OASIcs, Volume 93, LDK 2021, Complete Volume}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{1--516},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021},
  URN =		{urn:nbn:de:0030-drops-145352},
  doi =		{10.4230/OASIcs.LDK.2021},
  annote =	{Keywords: OASIcs, Volume 93, LDK 2021, Complete Volume}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, and Barbara Heinisch


Abstract
Front Matter, Table of Contents, Preface, Conference Organization

Cite as

3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{gromann_et_al:OASIcs.LDK.2021.0,
  author =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{0:i--0:xvi},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.0},
  URN =		{urn:nbn:de:0030-drops-145364},
  doi =		{10.4230/OASIcs.LDK.2021.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}
Document
Invited Talk
The JeuxDeMots Project (Invited Talk)

Authors: Mathieu Lafourcade


Abstract
The JeuxDeMots project aims at building a very large knowledge base in French, both common sense and specialized, using games, contributory approaches, and inference mechanisms. A dozen games have been designed as part of this project, each one allowing to collect specific information, or to consolidate the information acquired through the other games. With this presentation, the data collected and constructed since the launch of the project in the summer of 2007 will be analyzed both qualitatively and quantitatively. In particular, the following aspects will be detailed: the structure of the lexical and semantic network, some types of relations (semantic, ontological, subjective, semantic roles, associations of ideas), annotation of relations (meta-information), semantic refinements (management of polysemy), the creation of clusters allowing the representation of richer knowledge (n-argument relations) that make an implicit neural network. Finally, I will describe some complementary acquisition methods and applications such as a bot for endogenous contributions, a chatbot making inferences and semantic extraction from texts.

Cite as

Mathieu Lafourcade. The JeuxDeMots Project (Invited Talk). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{lafourcade:OASIcs.LDK.2021.1,
  author =	{Lafourcade, Mathieu},
  title =	{{The JeuxDeMots Project}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{1:1--1:1},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.1},
  URN =		{urn:nbn:de:0030-drops-145377},
  doi =		{10.4230/OASIcs.LDK.2021.1},
  annote =	{Keywords: Lexical Semantic Network, Games with a Purpose, Inferences, Knowledge Representation, Semantic Representation}
}
Document
Invited Talk
A Smell is Worth a Thousand Words: Olfactory Information Extraction and Semantic Processing in a Multilingual Perspective (Invited Talk)

Authors: Sara Tonelli


Abstract
More than any other sense, smell is linked directly to our emotions and our memories. However, smells are intangible and very difficult to preserve, making it hard to effectively identify, consolidate, and promote the wide-ranging role scents and smelling have in our cultural heritage. While some novel approaches have been recently proposed to monitor so-called urban smellscapes and analyse the olfactory dimension of our environments (Quercia et al., 2015), when it comes to smellscapes from the past little research has been done to keep track of how places, events and people have been described from an olfactory perspective. Fortunately, some key prerequisites for addressing this problem are now in place. In recent years, European cultural heritage institutions have invested heavily in large-scale digitisation: we hold a wealth of object, text and image data which can now be analysed using artificial intelligence. What remains missing is a methodology for the extraction of scent-related information from large amounts of texts, as well as a broader awareness of the wealth of historical olfactory descriptions, experiences and memories contained within the heritage datasets. In this talk, I will describe ongoing activities towards this goal, focused on text mining and semantic processing of olfactory information. I will present the general framework designed to annotate smell events in documents, and some preliminary results on information extraction approaches in a multilingual scenario. I will discuss the main findings and the challenges related to modelling textual descriptions of smells, including the metaphorical use of smell-related terms and the well-known limitations of smell vocabulary in European languages compared to other senses.

Cite as

Sara Tonelli. A Smell is Worth a Thousand Words: Olfactory Information Extraction and Semantic Processing in a Multilingual Perspective (Invited Talk). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, p. 2:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{tonelli:OASIcs.LDK.2021.2,
  author =	{Tonelli, Sara},
  title =	{{A Smell is Worth a Thousand Words: Olfactory Information Extraction and Semantic Processing in a Multilingual Perspective}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{2:1--2:1},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.2},
  URN =		{urn:nbn:de:0030-drops-145386},
  doi =		{10.4230/OASIcs.LDK.2021.2},
  annote =	{Keywords: olfactory information extraction, smellscapes, multilingual annotation}
}
Document
Invited Talk
Free/Open-Source Machine Translation for the Low-Resource Languages of Spain (Invited Talk)

Authors: Mikel L. Forcada


Abstract
While machine translation has historically been rule-based, that is, based on dictionaries and rules written by experts, most present-day machine translation is corpus-based. In the last few years, statistical machine translation, the dominant corpus-based approach, has been displaced by neural machine translation in most applications, in view of the better results reported, particularly for languages with very different syntax. But both statistical and neural machine translation need to be trained on large amounts of parallel data, that is, sentences in one language carefully paired with their translations in their other language, and this is a resource that may not be available for some low-resource languages. While some of the languages of Spain may be considered to be reasonably endowed with parallel corpora connecting them to Spanish or even to English - Basque, Catalan, Galician -, and are well-served with machine translation systems, there are many other languages which cannot afford them such as Aranese Occitan, Aragonese, or Asturian/Leonese. Fortunately, languages in this last group belong to the Romance language family, as Spanish does, and this makes translation from and into Spanish under a rule-based paradigm the only feasible approach. After describing briefly the main machine translation paradigms, I will describe the Apertium free/open-source rule-based machine translation platform, which has been used to build machine translation systems for these low-resource languages of Spain, indeed, sometimes the only ones available. The free/open-source setting has made linguistic data for these languages available for anyone in their linguistic communities to build other linguistic technologies for these low-resourced languages. For example, the Apertium family of bilingual and monolingual data has been converted into RDF and they have been made accessible on the Web as linked data.

Cite as

Mikel L. Forcada. Free/Open-Source Machine Translation for the Low-Resource Languages of Spain (Invited Talk). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, p. 3:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{forcada:OASIcs.LDK.2021.3,
  author =	{Forcada, Mikel L.},
  title =	{{Free/Open-Source Machine Translation for the Low-Resource Languages of Spain}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{3:1--3:1},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.3},
  URN =		{urn:nbn:de:0030-drops-145399},
  doi =		{10.4230/OASIcs.LDK.2021.3},
  annote =	{Keywords: free/open-source, machine translation, languages of Spain, low-resource machine translation}
}
Document
Crazy New Idea
A Computational Simulation of Children’s Language Acquisition (Crazy New Idea)

Authors: Ben Ambridge


Abstract
Many modern NLP models are already close to simulating children’s language acquisition; the main thing they currently lack is a "real world" representation of semantics that allows them to map from form to meaning and vice-versa. The aim of this "Crazy Idea" is to spark a discussion about how we might get there.

Cite as

Ben Ambridge. A Computational Simulation of Children’s Language Acquisition (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 4:1-4:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{ambridge:OASIcs.LDK.2021.4,
  author =	{Ambridge, Ben},
  title =	{{A Computational Simulation of Children’s Language Acquisition}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{4:1--4:3},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.4},
  URN =		{urn:nbn:de:0030-drops-145402},
  doi =		{10.4230/OASIcs.LDK.2021.4},
  annote =	{Keywords: Child language acquisition, language development, deep learning, BERT, ELMo, GPT-3}
}
Document
Crazy New Idea
Get! Mimetypes! Right! (Crazy New Idea)

Authors: Christian Chiarcos


Abstract
This paper identifies three technical requirements - availability of data, sustainable hosting and resolvable URIs for hosted data - as minimal pre-conditions for Linguistic Linked Open Data technology to develop towards a mature technological ecosystem that third party applications can build upon. While a critical amount of data is available (and it continues to grow), there does not seem to exist a hosting solution that combines the prospects of long-term availability with an unrestricted capability to support resolvable URIs. In particular, data hosting services do currently not allow data to be declared as RDF content by means of their media type (mime type), so that the capability of clients to recognize formats and to resolve URIs on that basis is severely limited.

Cite as

Christian Chiarcos. Get! Mimetypes! Right! (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 5:1-5:4, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{chiarcos:OASIcs.LDK.2021.5,
  author =	{Chiarcos, Christian},
  title =	{{Get! Mimetypes! Right!}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{5:1--5:4},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.5},
  URN =		{urn:nbn:de:0030-drops-145418},
  doi =		{10.4230/OASIcs.LDK.2021.5},
  annote =	{Keywords: data hosting, mimetypes, resolvability, URIs, Linked Data foundations}
}
Document
Crazy New Idea
Mind the Gap: Language Data, Their Producers, and the Scientific Process (Crazy New Idea)

Authors: Tobias Weber


Abstract
This paper discusses the role of low-resource languages in NLP through the lens of different stakeholders. It argues that the current "consumerist approach" to language data reinforces a vicious circle which increases the technological exclusion of minority communities. Researchers' decisions directly affect these processes to the detriment of minorities and practitioners engaging in language work in these communities. In line with the conference topic, the paper concludes with strategies and prerequisites for creating a positive feedback loop in our research benefiting language work within the next decade.

Cite as

Tobias Weber. Mind the Gap: Language Data, Their Producers, and the Scientific Process (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 6:1-6:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{weber:OASIcs.LDK.2021.6,
  author =	{Weber, Tobias},
  title =	{{Mind the Gap: Language Data, Their Producers, and the Scientific Process}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{6:1--6:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.6},
  URN =		{urn:nbn:de:0030-drops-145424},
  doi =		{10.4230/OASIcs.LDK.2021.6},
  annote =	{Keywords: minority languages, data integration, sociology of technology, documentary linguistics, exclusion}
}
Document
Representing the Under-Represented: a Dataset of Post-Colonial, and Migrant Writers

Authors: Marco Antonio Stranisci, Viviana Patti, and Rossana Damiano


Abstract
In today’s media and in the Web of Data, non-Western people still suffer a lack of representation. In our work, we address this issue by presenting a pipeline for collecting and semantically encoding Wikipedia biographies of writers who are under-represented due to their non-Western origins, or their legal status in a country. The two main components of the ontology will be described, together with a framework for mapping textual biographies to their corresponding semantic representations. A description of the data set, and some examples of biographical texts conversion to the Ontology Classes, will be provided.

Cite as

Marco Antonio Stranisci, Viviana Patti, and Rossana Damiano. Representing the Under-Represented: a Dataset of Post-Colonial, and Migrant Writers. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{stranisci_et_al:OASIcs.LDK.2021.7,
  author =	{Stranisci, Marco Antonio and Patti, Viviana and Damiano, Rossana},
  title =	{{Representing the Under-Represented: a Dataset of Post-Colonial, and Migrant Writers}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{7:1--7:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.7},
  URN =		{urn:nbn:de:0030-drops-145431},
  doi =		{10.4230/OASIcs.LDK.2021.7},
  annote =	{Keywords: Ontologies, Knowledge Graph, Language Resources, Migrations}
}
Document
Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup

Authors: Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, Mikko Koho, Jouni Tuominen, Matti La Mela, and Eero Hyvönen


Abstract
This paper presents a knowledge graph created by transforming the plenary debates of the Parliament of Finland (1907-) into Linked Open Data (LOD). The data, totaling over νm{900 000} speeches, with automatically created semantic annotations and rich ontology-based metadata, are published in a Linked Open Data Service and are used via a SPARQL API and as data dumps. The speech data is part of larger LOD publication FinnParla that also includes prosopographical data about the politicians. The data is being used for studying parliamentary language and culture in Digital Humanities in several universities. To serve a wider variety of users, the entirety of this data was also produced using Parla-CLARIN markup. We present the first publication of all Finnish parliamentary debates as data. Technical novelties in our approach include the use of both Parla-CLARIN and an RDF schema developed for representing the speeches, integration of the data to a new Parliament of Finland Ontology for deeper data analyses, and enriching the data with a variety of external national and international data sources.

Cite as

Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, Mikko Koho, Jouni Tuominen, Matti La Mela, and Eero Hyvönen. Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 8:1-8:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{sinikallio_et_al:OASIcs.LDK.2021.8,
  author =	{Sinikallio, Laura and Drobac, Senka and Tamper, Minna and Leal, Rafael and Koho, Mikko and Tuominen, Jouni and La Mela, Matti and Hyv\"{o}nen, Eero},
  title =	{{Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{8:1--8:17},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.8},
  URN =		{urn:nbn:de:0030-drops-145444},
  doi =		{10.4230/OASIcs.LDK.2021.8},
  annote =	{Keywords: Plenary debates, parliamentary data, Parla-CLARIN, Linked Open Data, Digital Humanities}
}
Document
Towards a Corpus of Historical German Plays with Emotion Annotations

Authors: Thomas Schmidt, Katrin Dennerlein, and Christian Wolff


Abstract
In this paper, we present first work-in-progress annotation results of a project investigating computational methods of emotion analysis for historical German plays around 1800. We report on the development of an annotation scheme focussing on the annotation of emotions that are important from a literary studies perspective for this time span as well as on the annotation process we have developed. We annotate emotions expressed or attributed by characters of the plays in the written texts. The scheme consists of 13 hierarchically structured emotion concepts as well as the source (who experiences or attributes the emotion) and target (who or what is the emotion directed towards). We have conducted the annotation of five example plays of our corpus with two annotators per play and report on annotation distributions and agreement statistics. We were able to collect over 6,500 emotion annotations and identified a fair agreement for most concepts around a κ-value of 0.4. We discuss how we plan to improve annotator consistency and continue our work. The results also have implications for similar projects in the context of Digital Humanities.

Cite as

Thomas Schmidt, Katrin Dennerlein, and Christian Wolff. Towards a Corpus of Historical German Plays with Emotion Annotations. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 9:1-9:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{schmidt_et_al:OASIcs.LDK.2021.9,
  author =	{Schmidt, Thomas and Dennerlein, Katrin and Wolff, Christian},
  title =	{{Towards a Corpus of Historical German Plays with Emotion Annotations}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{9:1--9:11},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.9},
  URN =		{urn:nbn:de:0030-drops-145459},
  doi =		{10.4230/OASIcs.LDK.2021.9},
  annote =	{Keywords: Emotion, Annotation, Digital Humanities, Computational Literary Studies, German Drama, Sentiment Analysis, Emotion Analysis, Corpus}
}
Document
Enriching a Lexical Resource for French Verbs with Aspectual Information

Authors: Anna Kupść, Pauline Haas, Rafael Marín, and Antonio Balvet


Abstract
The paper presents a syntactico-semantic lexicon of over a thousand French verbs. It has been created by manually adding lexical aspect features to verb frames from TreeLex [Kupść and Abeillé, 2008]. We present how the original syntactic resource has been adapted to the current project, our aspect assignment procedure and an overview of the resulting lexical resource.

Cite as

Anna Kupść, Pauline Haas, Rafael Marín, and Antonio Balvet. Enriching a Lexical Resource for French Verbs with Aspectual Information. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 10:1-10:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{kupsc_et_al:OASIcs.LDK.2021.10,
  author =	{Kup\'{s}\'{c}, Anna and Haas, Pauline and Mar{\'\i}n, Rafael and Balvet, Antonio},
  title =	{{Enriching a Lexical Resource for French Verbs with Aspectual Information}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{10:1--10:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.10},
  URN =		{urn:nbn:de:0030-drops-145460},
  doi =		{10.4230/OASIcs.LDK.2021.10},
  annote =	{Keywords: computational semantics, corpora-based methods in language engineering, electronic language resources and tools, formalization of natural languages}
}
Document
Annotation of Fine-Grained Geographical Entities in German Texts

Authors: Julián Moreno-Schneider, Melina Plakidis, and Georg Rehm


Abstract
We work on the creation of a corpus, crawled from the internet, on the Berlin district of Moabit, primarily meant for training NER systems in German and English. Typical NER corpora and corresponding systems distinguish persons, organisations and locations, but do not distinguish different types of location entities. For our tourism-inspired use case, we need fine-grained annotations for toponyms. In this paper, we outline the fine-grained classification of geographical entities, the resulting annotations and we present preliminary results on automatically tagging toponyms in a small, bootstrapped gold corpus.

Cite as

Julián Moreno-Schneider, Melina Plakidis, and Georg Rehm. Annotation of Fine-Grained Geographical Entities in German Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 11:1-11:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{morenoschneider_et_al:OASIcs.LDK.2021.11,
  author =	{Moreno-Schneider, Juli\'{a}n and Plakidis, Melina and Rehm, Georg},
  title =	{{Annotation of Fine-Grained Geographical Entities in German Texts}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{11:1--11:8},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.11},
  URN =		{urn:nbn:de:0030-drops-145473},
  doi =		{10.4230/OASIcs.LDK.2021.11},
  annote =	{Keywords: Named Entity Recognition, Geographical Entities, Annotation}
}
Document
Supporting the Annotation Experience Through CorEx and Word Mover’s Distance

Authors: Stefania Pecòre


Abstract
Online communities can be used to promote destructive behaviours, as in pro-Eating Disorder (ED) communities. Research needs annotated data to study these phenomena. Even though many platforms have already moderated this type of content, Twitter has not, and it can still be used for research purposes. In this paper, we unveiled emojis, words, and uncommon linguistic patterns within the ED Twitter community by using the Correlation Explanation (CorEx) algorithm on unstructured and non-annotated data to retrieve the topics. Then we annotated the dataset following these topics. We analysed then the use of CorEx and Word Mover’s Distance to retrieve automatically similar new sentences and augment the annotated dataset.

Cite as

Stefania Pecòre. Supporting the Annotation Experience Through CorEx and Word Mover’s Distance. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 12:1-12:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{pecore:OASIcs.LDK.2021.12,
  author =	{Pec\`{o}re, Stefania},
  title =	{{Supporting the Annotation Experience Through CorEx and Word Mover’s Distance}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{12:1--12:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.12},
  URN =		{urn:nbn:de:0030-drops-145481},
  doi =		{10.4230/OASIcs.LDK.2021.12},
  annote =	{Keywords: topic retrieval, annotation, eating disorders, natural language processing}
}
Document
A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Authors: Danka Jokić, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih


Abstract
Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset.

Cite as

Danka Jokić, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih. A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{jokic_et_al:OASIcs.LDK.2021.13,
  author =	{Joki\'{c}, Danka and Stankovi\'{c}, Ranka and Krstev, Cvetana and \v{S}andrih, Branislava},
  title =	{{A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{13:1--13:17},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.13},
  URN =		{urn:nbn:de:0030-drops-145493},
  doi =		{10.4230/OASIcs.LDK.2021.13},
  annote =	{Keywords: abusive language, hate speech, Serbian, Twitter, lexicon, corpus}
}
Document
Bias in Knowledge Graphs - An Empirical Study with Movie Recommendation and Different Language Editions of DBpedia

Authors: Michael Matthias Voit and Heiko Paulheim


Abstract
Public knowledge graphs such as DBpedia and Wikidata have been recognized as interesting sources of background knowledge to build content-based recommender systems. They can be used to add information about the items to be recommended and links between those. While quite a few approaches for exploiting knowledge graphs have been proposed, most of them aim at optimizing the recommendation strategy while using a fixed knowledge graph. In this paper, we take a different approach, i.e., we fix the recommendation strategy and observe changes when using different underlying knowledge graphs. Particularly, we use different language editions of DBpedia. We show that the usage of different knowledge graphs does not only lead to differently biased recommender systems, but also to recommender systems that differ in performance for particular fields of recommendations.

Cite as

Michael Matthias Voit and Heiko Paulheim. Bias in Knowledge Graphs - An Empirical Study with Movie Recommendation and Different Language Editions of DBpedia. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 14:1-14:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{voit_et_al:OASIcs.LDK.2021.14,
  author =	{Voit, Michael Matthias and Paulheim, Heiko},
  title =	{{Bias in Knowledge Graphs - An Empirical Study with Movie Recommendation and Different Language Editions of DBpedia}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{14:1--14:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.14},
  URN =		{urn:nbn:de:0030-drops-145506},
  doi =		{10.4230/OASIcs.LDK.2021.14},
  annote =	{Keywords: Knowledge Graph, DBpedia, Recommender Systems, Bias, Language Bias, RDF2vec}
}
Document
Enriching Word Embeddings with Food Knowledge for Ingredient Retrieval

Authors: Álvaro Mendes Samagaio, Henrique Lopes Cardoso, and David Ribeiro


Abstract
Smart assistants and recommender systems must deal with lots of information coming from different sources and having different formats. This is more frequent in text data, which presents increased variability and complexity, and is rather common for conversational assistants or chatbots. Moreover, this issue is very evident in the food and nutrition lexicon, where the semantics present increased variability, namely due to hypernyms and hyponyms. This work describes the creation of a set of word embeddings based on the incorporation of information from a food thesaurus - LanguaL - through retrofitting. The ingredients were classified according to three different facet label groups. Retrofitted embeddings seem to properly encode food-specific knowledge, as shown by an increase on accuracy as compared to generic embeddings (+23%, +10% and +31% per group). Moreover, a weighing mechanism based on TF-IDF was applied to embedding creation before retrofitting, also bringing an increase on accuracy (+5%, +9% and +5% per group). Finally, the approach has been tested with human users in an ingredient retrieval exercise, showing very positive evaluation (77.3% of the volunteer testers preferred this method over a string-based matching algorithm).

Cite as

Álvaro Mendes Samagaio, Henrique Lopes Cardoso, and David Ribeiro. Enriching Word Embeddings with Food Knowledge for Ingredient Retrieval. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{samagaio_et_al:OASIcs.LDK.2021.15,
  author =	{Samagaio, \'{A}lvaro Mendes and Lopes Cardoso, Henrique and Ribeiro, David},
  title =	{{Enriching Word Embeddings with Food Knowledge for Ingredient Retrieval}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{15:1--15:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.15},
  URN =		{urn:nbn:de:0030-drops-145510},
  doi =		{10.4230/OASIcs.LDK.2021.15},
  annote =	{Keywords: Word embeddings, Retrofitting, LanguaL, Food Embeddings, Knowledge Graph}
}
Document
TatWordNet: A Linguistic Linked Open Data-Integrated WordNet Resource for Tatar

Authors: Alexander Kirillovich, Marat Shaekhov, Alfiya Galieva, Olga Nevzorova, Dmitry Ilvovsky, and Natalia Loukachevitch


Abstract
We present the first release of TatWordNet (http://wordnet.tatar), a wordnet resource for Tatar. TatWordNet has been constructed by the combination of the expand and the merge approaches. The synsets of TatWordNet have been compiled by: (i) the automatic conversion of concepts of TatThes, a socio-political Tatar; (ii) semi-automatic translation of synsets of RuWordNet, a wordnet resource for Russian with the followed manual verification and correction; (iii) manual translation of base RuWordNet synsets; (iv) and manual translation of the all hypernyms of the previously translated RuWordNet synsets. The currents version of TatWordNet contains 18,583 synsets, 36,540 lexical entries and 49,525 senses. The resource has been published to the Linguistic Linked Open Data cloud and interlinked with the Global WordNet Grid.

Cite as

Alexander Kirillovich, Marat Shaekhov, Alfiya Galieva, Olga Nevzorova, Dmitry Ilvovsky, and Natalia Loukachevitch. TatWordNet: A Linguistic Linked Open Data-Integrated WordNet Resource for Tatar. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 16:1-16:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{kirillovich_et_al:OASIcs.LDK.2021.16,
  author =	{Kirillovich, Alexander and Shaekhov, Marat and Galieva, Alfiya and Nevzorova, Olga and Ilvovsky, Dmitry and Loukachevitch, Natalia},
  title =	{{TatWordNet: A Linguistic Linked Open Data-Integrated WordNet Resource for Tatar}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{16:1--16:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.16},
  URN =		{urn:nbn:de:0030-drops-145528},
  doi =		{10.4230/OASIcs.LDK.2021.16},
  annote =	{Keywords: Linguistic Linked Open Data, WordNet, Thesaurus, Tatar language}
}
Document
Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph

Authors: Ismail Harrando and Raphaël Troncy


Abstract
Pre-trained word embeddings constitute an essential building block for many NLP systems and applications, notably when labeled data is scarce. However, since they compress word meanings into a fixed-dimensional representation, their use usually lack interpretability beyond a measure of similarity and linear analogies that do not always reflect real-world word relatedness, which can be important for many NLP applications. In this paper, we propose a model which extracts topics from text documents based on the common-sense knowledge available in ConceptNet [Speer et al., 2017] - a semantic concept graph that explicitly encodes real-world relations between words - and without any human supervision. When combining both ConceptNet’s knowledge graph and graph embeddings, our approach outperforms other baselines in the zero-shot setting, while generating a human-understandable explanation for its predictions through the knowledge graph. We study the importance of some modeling choices and criteria for designing the model, and we demonstrate that it can be used to label data for a supervised classifier to achieve an even better performance without relying on any humanly-annotated training data. We publish the code of our approach at https://github.com/D2KLab/ZeSTE and we provide a user friendly demo at https://zeste.tools.eurecom.fr/.

Cite as

Ismail Harrando and Raphaël Troncy. Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 17:1-17:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{harrando_et_al:OASIcs.LDK.2021.17,
  author =	{Harrando, Ismail and Troncy, Rapha\"{e}l},
  title =	{{Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{17:1--17:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.17},
  URN =		{urn:nbn:de:0030-drops-145532},
  doi =		{10.4230/OASIcs.LDK.2021.17},
  annote =	{Keywords: Topic Extraction, Zero-Shot Classification, Explainable NLP, Knowledge Graph}
}
Document
Relevance Feedback Search Based on Automatic Annotation and Classification of Texts

Authors: Rafael Leal, Joonas Kesäniemi, Mikko Koho, and Eero Hyvönen


Abstract
The idea behind Relevance Feedback Search (RFBS) is to build search queries as an iterative and interactive process in which they are gradually refined based on the results of the previous search round. This can be helpful in situations where the end user cannot easily formulate their information needs at the outset as a well-focused query, or more generally as a way to filter and focus search results. This paper concerns (1) a framework that integrates keyword extraction and unsupervised classification into the RFBS paradigm and (2) the application of this framework to the legal domain as a use case. We focus on the Natural Language Processing (NLP) methods underlying the framework and application, where an automatic annotation tool is used for extracting document keywords as ontology concepts, which are then transformed into word embeddings to form vectorial representations of the texts. An unsupervised classification system that employs similar techniques is also used in order to classify the documents into broad thematic classes. This classification functionality is evaluated using two different datasets. As the use case, we describe an application perspective in the semantic portal LawSampo - Finnish Legislation and Case Law on the Semantic Web. This online demonstrator uses a dataset of 82145 sections in 3725 statutes of Finnish legislation and another dataset that comprises 13470 court decisions.

Cite as

Rafael Leal, Joonas Kesäniemi, Mikko Koho, and Eero Hyvönen. Relevance Feedback Search Based on Automatic Annotation and Classification of Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{leal_et_al:OASIcs.LDK.2021.18,
  author =	{Leal, Rafael and Kes\"{a}niemi, Joonas and Koho, Mikko and Hyv\"{o}nen, Eero},
  title =	{{Relevance Feedback Search Based on Automatic Annotation and Classification of Texts}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{18:1--18:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.18},
  URN =		{urn:nbn:de:0030-drops-145543},
  doi =		{10.4230/OASIcs.LDK.2021.18},
  annote =	{Keywords: relevance feedback, keyword extraction, zero-shot text classification, word embeddings, LawSampo}
}
Document
Automatic Construction of Knowledge Graphs from Text and Structured Data: A Preliminary Literature Review

Authors: Maraim Masoud, Bianca Pereira, John McCrae, and Paul Buitelaar


Abstract
Knowledge graphs have been shown to be an important data structure for many applications, including chatbot development, data integration, and semantic search. In the enterprise domain, such graphs need to be constructed based on both structured (e.g. databases) and unstructured (e.g. textual) internal data sources; preferentially using automatic approaches due to the costs associated with manual construction of knowledge graphs. However, despite the growing body of research that leverages both structured and textual data sources in the context of automatic knowledge graph construction, the research community has centered on either one type of source or the other. In this paper, we conduct a preliminary literature review to investigate approaches that can be used for the integration of textual and structured data sources in the process of automatic knowledge graph construction. We highlight the solutions currently available for use within enterprises and point areas that would benefit from further research.

Cite as

Maraim Masoud, Bianca Pereira, John McCrae, and Paul Buitelaar. Automatic Construction of Knowledge Graphs from Text and Structured Data: A Preliminary Literature Review. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 19:1-19:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{masoud_et_al:OASIcs.LDK.2021.19,
  author =	{Masoud, Maraim and Pereira, Bianca and McCrae, John and Buitelaar, Paul},
  title =	{{Automatic Construction of Knowledge Graphs from Text and Structured Data: A Preliminary Literature Review}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{19:1--19:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.19},
  URN =		{urn:nbn:de:0030-drops-145556},
  doi =		{10.4230/OASIcs.LDK.2021.19},
  annote =	{Keywords: Knowledge Graph Construction, Enterprise Knowledge Graph}
}
Document
An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology

Authors: Christian Chiarcos, Maxim Ionov, Luis Glaser, and Christian Fäth


Abstract
In language technology and language sciences, tab-separated values (TSV) represent a frequently used formalism to represent linguistically annotated natural language, often addressed as "CoNLL formats". A large number of such formats do exist, but although they share a number of common features, they are not interoperable, as different pieces of information are encoded differently in these dialects. CoNLL-RDF refers to a programming library and the associated data model that has been introduced to facilitate processing and transforming such TSV formats in a serialization-independent way. CoNLL-RDF represents CoNLL data, by means of RDF graphs and SPARQL update operations, but so far, without machine-readable semantics, with annotation properties created dynamically on the basis of a user-defined mapping from columns to labels. Current applications of CoNLL-RDF include linking between corpora and dictionaries [Mambrini and Passarotti, 2019] and knowledge graphs [Tamper et al., 2018], syntactic parsing of historical languages [Chiarcos et al., 2018; Chiarcos et al., 2018], the consolidation of syntactic and semantic annotations [Chiarcos and Fäth, 2019], a bridge between RDF corpora and a traditional corpus query language [Ionov et al., 2020], and language contact studies [Chiarcos et al., 2018]. We describe a novel extension of CoNLL-RDF, introducing a formal data model, formalized as an ontology. The ontology is a basis for linking RDF corpora with other Semantic Web resources, but more importantly, its application for transformation between different TSV formats is a major step for providing interoperability between CoNLL formats.

Cite as

Christian Chiarcos, Maxim Ionov, Luis Glaser, and Christian Fäth. An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 20:1-20:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{chiarcos_et_al:OASIcs.LDK.2021.20,
  author =	{Chiarcos, Christian and Ionov, Maxim and Glaser, Luis and F\"{a}th, Christian},
  title =	{{An Ontology for CoNLL-RDF: Formal Data Structures for TSV Formats in Language Technology}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{20:1--20:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.20},
  URN =		{urn:nbn:de:0030-drops-145566},
  doi =		{10.4230/OASIcs.LDK.2021.20},
  annote =	{Keywords: language technology, data models, CoNLL-RDF, ontology}
}
Document
On the Utility of Word Embeddings for Enriching OpenWordNet-PT

Authors: Hugo Gonçalo Oliveira, Fredson Silva de Souza Aguiar, and Alexandre Rademaker


Abstract
The maintenance of wordnets and lexical knwoledge bases typically relies on time-consuming manual effort. In order to minimise this issue, we propose the exploitation of models of distributional semantics, namely word embeddings learned from corpora, in the automatic identification of relation instances missing in a wordnet. Analogy-solving methods are first used for learning a set of relations from analogy tests focused on each relation. Despite their low accuracy, we noted that a portion of the top-given answers are good suggestions of relation instances that could be included in the wordnet. This procedure is applied to the enrichment of OpenWordNet-PT, a public Portuguese wordnet. Relations are learned from data acquired from this resource, and illustrative examples are provided. Results are promising for accelerating the identification of missing relation instances, as we estimate that about 17% of the potential suggestions are good, a proportion that almost doubles if some are automatically invalidated.

Cite as

Hugo Gonçalo Oliveira, Fredson Silva de Souza Aguiar, and Alexandre Rademaker. On the Utility of Word Embeddings for Enriching OpenWordNet-PT. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{goncalooliveira_et_al:OASIcs.LDK.2021.21,
  author =	{Gon\c{c}alo Oliveira, Hugo and Aguiar, Fredson Silva de Souza and Rademaker, Alexandre},
  title =	{{On the Utility of Word Embeddings for Enriching OpenWordNet-PT}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{21:1--21:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.21},
  URN =		{urn:nbn:de:0030-drops-145578},
  doi =		{10.4230/OASIcs.LDK.2021.21},
  annote =	{Keywords: word embeddings, lexical resources, wordnet, analogy tests}
}
Document
Towards Learning Terminological Concept Systems from Multilingual Natural Language Text

Authors: Lennart Wachowiak, Christian Lang, Barbara Heinisch, and Dagmar Gromann


Abstract
Terminological Concept Systems (TCS) provide a means of organizing, structuring and representing domain-specific multilingual information and are important to ensure terminological consistency in many tasks, such as translation and cross-border communication. While several approaches to (semi-)automatic term extraction exist, learning their interrelations is vastly underexplored. We propose an automated method to extract terms and relations across natural languages and specialized domains. To this end, we adapt pretrained multilingual neural language models, which we evaluate on term extraction standard datasets with best performing results and a combination of relation extraction standard datasets with competitive results. Code and dataset are publicly available.

Cite as

Lennart Wachowiak, Christian Lang, Barbara Heinisch, and Dagmar Gromann. Towards Learning Terminological Concept Systems from Multilingual Natural Language Text. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 22:1-22:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{wachowiak_et_al:OASIcs.LDK.2021.22,
  author =	{Wachowiak, Lennart and Lang, Christian and Heinisch, Barbara and Gromann, Dagmar},
  title =	{{Towards Learning Terminological Concept Systems from Multilingual Natural Language Text}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{22:1--22:18},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.22},
  URN =		{urn:nbn:de:0030-drops-145586},
  doi =		{10.4230/OASIcs.LDK.2021.22},
  annote =	{Keywords: Terminologies, Neural Language Models, Multilingual Information Extraction}
}
Document
Encoder-Attention-Based Automatic Term Recognition (EA-ATR)

Authors: Sampritha H. Manjunath and John P. McCrae


Abstract
Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA - annotated biological terms and Krapivin - scientific papers from the computer science domain.

Cite as

Sampritha H. Manjunath and John P. McCrae. Encoder-Attention-Based Automatic Term Recognition (EA-ATR). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{manjunath_et_al:OASIcs.LDK.2021.23,
  author =	{Manjunath, Sampritha H. and McCrae, John P.},
  title =	{{Encoder-Attention-Based Automatic Term Recognition (EA-ATR)}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{23:1--23:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.23},
  URN =		{urn:nbn:de:0030-drops-145597},
  doi =		{10.4230/OASIcs.LDK.2021.23},
  annote =	{Keywords: Automatic Term Recognition, Term Extraction, BERT, EEAP, Deep Learning for ATR}
}
Document
Universal Dependencies for Multilingual Open Information Extraction

Authors: Massinissa Atmani and Mathieu Lafourcade


Abstract
In this paper, we present our approach for Multilingual Open Information Extraction. Our sequence labeling based approach builds only on Universal Dependency representation to capture OpenIE’s regularities and to perform Cross-lingual Multilingual OpenIE. We propose a new two-stage pipeline model for sequence labeling, that first identifies all the arguments of the relation and only then classifies them according to their most likely label. This paper also introduces a new benchmark evaluation for French. Experimental Evaluation shows that our approach achieves the best results in the available Benchmarks (English, French, Spanish and Portuguese).

Cite as

Massinissa Atmani and Mathieu Lafourcade. Universal Dependencies for Multilingual Open Information Extraction. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 24:1-24:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{atmani_et_al:OASIcs.LDK.2021.24,
  author =	{Atmani, Massinissa and Lafourcade, Mathieu},
  title =	{{Universal Dependencies for Multilingual Open Information Extraction}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{24:1--24:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.24},
  URN =		{urn:nbn:de:0030-drops-145600},
  doi =		{10.4230/OASIcs.LDK.2021.24},
  annote =	{Keywords: Natural Language Processing, Information Extraction, Machine Learning}
}
Document
Inconsistency Detection in Job Postings

Authors: Joana Urbano, Miguel Couto, Gil Rocha, and Henrique Lopes Cardoso


Abstract
The use of AI in recruitment is growing and there is AI software that reads jobs' descriptions in order to select the best candidates for these jobs. However, it is not uncommon for these descriptions to contain inconsistencies such as contradictions and ambiguities, which confuses job candidates and fools the AI algorithm. In this paper, we present a model based on natural language processing (NLP), machine learning (ML), and rules to detect these inconsistencies in the description of language requirements and to alert the recruiter to them, before the job posting is published. We show that the use of an hybrid model based on ML techniques and a set of domain-specific rules to extract the language details from sentences achieves high performance in the detection of inconsistencies.

Cite as

Joana Urbano, Miguel Couto, Gil Rocha, and Henrique Lopes Cardoso. Inconsistency Detection in Job Postings. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 25:1-25:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{urbano_et_al:OASIcs.LDK.2021.25,
  author =	{Urbano, Joana and Couto, Miguel and Rocha, Gil and Lopes Cardoso, Henrique},
  title =	{{Inconsistency Detection in Job Postings}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{25:1--25:16},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.25},
  URN =		{urn:nbn:de:0030-drops-145612},
  doi =		{10.4230/OASIcs.LDK.2021.25},
  annote =	{Keywords: NLP, Ambiguities, Contradictions, Recruitment software}
}
Document
A Workbench for Corpus Linguistic Discourse Analysis

Authors: Julia Krasselt, Matthias Fluor, Klaus Rothenhäusler, and Philipp Dreesen


Abstract
In this paper, we introduce the Swiss-AL workbench, an online tool for corpus linguistic discourse analysis. The workbench enables the analysis of Swiss-AL, a multilingual Swiss web corpus with sources from media, politics, industry, science, and civil society. The workbench differs from other corpus analysis tools in three characteristics: (1) easy access and tidy interface, (2) focus on visualizations, and (3) wide range of analysis options, ranging from classic corpus linguistic analysis (e.g., collocation analysis) to more recent NLP approaches (topic modeling and word embeddings). It is designed for researchers of various disciplines, practitioners, and students.

Cite as

Julia Krasselt, Matthias Fluor, Klaus Rothenhäusler, and Philipp Dreesen. A Workbench for Corpus Linguistic Discourse Analysis. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 26:1-26:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{krasselt_et_al:OASIcs.LDK.2021.26,
  author =	{Krasselt, Julia and Fluor, Matthias and Rothenh\"{a}usler, Klaus and Dreesen, Philipp},
  title =	{{A Workbench for Corpus Linguistic Discourse Analysis}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{26:1--26:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.26},
  URN =		{urn:nbn:de:0030-drops-145623},
  doi =		{10.4230/OASIcs.LDK.2021.26},
  annote =	{Keywords: corpus analysis software, discourse analysis, data visualization}
}
Document
APiCS-Ligt: Towards Semantic Enrichment of Interlinear Glossed Text

Authors: Maxim Ionov


Abstract
This paper presents APiCS-Ligt, an LLOD version of a collection of interlinear glossed linguistic examples from APiCS, the Atlas of Pidgin and Creole Language Structures. Interlinear glossed text (IGT) plays an important role in typological and theoretical linguistic research, especially with understudied and endangered languages: It provides a way to understand linguistic phenomena without necessarily knowing the source language which is crucial for these languages since native speakers are not always easily accessible. Previously, we presented Ligt, RDF vocabulary created for representing interlinear glosses in text segments. In this paper, we present our conversion of the APiCS IGT dataset into this model and describe our efforts in linking linguistic annotations to an external ontology to add semantic representation.

Cite as

Maxim Ionov. APiCS-Ligt: Towards Semantic Enrichment of Interlinear Glossed Text. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 27:1-27:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{ionov:OASIcs.LDK.2021.27,
  author =	{Ionov, Maxim},
  title =	{{APiCS-Ligt: Towards Semantic Enrichment of Interlinear Glossed Text}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{27:1--27:8},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.27},
  URN =		{urn:nbn:de:0030-drops-145633},
  doi =		{10.4230/OASIcs.LDK.2021.27},
  annote =	{Keywords: Linguistic Linked Open Data (LLOD), less-resourced languages in the (multilingual) Semantic Web, interlinear glossed text (IGT), data modeling}
}
Document
Introducing the NLU Showroom: A NLU Demonstrator for the German Language

Authors: Dennis Wegener, Sven Giesselbach, Niclas Doll, and Heike Horstmann


Abstract
We present the NLU Showroom, a platform for interactively demonstrating the functionality of natural language understanding models with easy to use visual interfaces. The NLU Showroom focuses primarily on the German language, as not many German NLU resources exist. However, it also serves corresponding English models to reach a broader audience. With the NLU Showroom we demonstrate and compare the capabilities and limitations of a variety of NLP/NLU models. The four initial demonstrators include a) a comparison on how different word representations capture semantic similarity b) a comparison on how different sentence representations interpret sentence similarity c) a showcase on analyzing reviews with NLU d) a showcase on finding links between entities. The NLU Showroom is build on state-of-the-art architectures for model serving and data processing. It targets a broad audience, from newbies to researchers but puts a focus on putting the presented models in the context of industrial applications.

Cite as

Dennis Wegener, Sven Giesselbach, Niclas Doll, and Heike Horstmann. Introducing the NLU Showroom: A NLU Demonstrator for the German Language. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 28:1-28:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{wegener_et_al:OASIcs.LDK.2021.28,
  author =	{Wegener, Dennis and Giesselbach, Sven and Doll, Niclas and Horstmann, Heike},
  title =	{{Introducing the NLU Showroom: A NLU Demonstrator for the German Language}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{28:1--28:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.28},
  URN =		{urn:nbn:de:0030-drops-145642},
  doi =		{10.4230/OASIcs.LDK.2021.28},
  annote =	{Keywords: Natural Language Understanding, Natural Language Processing, NLU, NLP, Showroom, Demonstrator, Demos, Text Similarity, Opinion Mining, Relation Extraction}
}
Document
AAA4LLL - Acquisition, Annotation, Augmentation for Lively Language Learning

Authors: Bartholomäus Wloka and Werner Winiwarter


Abstract
In this paper we describe a method for enhancing the process of studying Japanese by a user-centered approach. This approach includes three parts: an innovative way of acquiring learning material from topic seeds, multifaceted sentence analysis to present sentence annotations, and the browser-integrated augmentation of perusing Wikipedia pages of special interest for the learner. This may result in new topic seeds to yield additional learning content, thus repeating the cycle.

Cite as

Bartholomäus Wloka and Werner Winiwarter. AAA4LLL - Acquisition, Annotation, Augmentation for Lively Language Learning. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 29:1-29:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{wloka_et_al:OASIcs.LDK.2021.29,
  author =	{Wloka, Bartholom\"{a}us and Winiwarter, Werner},
  title =	{{AAA4LLL - Acquisition, Annotation, Augmentation for Lively Language Learning}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{29:1--29:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.29},
  URN =		{urn:nbn:de:0030-drops-145654},
  doi =		{10.4230/OASIcs.LDK.2021.29},
  annote =	{Keywords: Web-based language learning, augmented browsing, natural language annotation, corpus alignment, Japanese computing, semantic representation}
}
Document
Improving Intent Detection Accuracy Through Token Level Labeling

Authors: Michał Lew, Aleksander Obuchowski, and Monika Kutyła


Abstract
Intent detection is traditionally modeled as a sequence classification task where the role of the models is to map the users' utterances to their class. In this paper, however, we show that the classification accuracy can be improved with the use of token level intent annotations and introducing new annotation guidelines for labeling sentences in the intent detection task. What is more, we introduce a method for training the network to predict joint sentence level and token level annotations. We also test the effects of different annotation schemes (BIO, binary, sentence intent) on the model’s accuracy.

Cite as

Michał Lew, Aleksander Obuchowski, and Monika Kutyła. Improving Intent Detection Accuracy Through Token Level Labeling. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 30:1-30:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{lew_et_al:OASIcs.LDK.2021.30,
  author =	{Lew, Micha{\l} and Obuchowski, Aleksander and Kuty{\l}a, Monika},
  title =	{{Improving Intent Detection Accuracy Through Token Level Labeling}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{30:1--30:11},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.30},
  URN =		{urn:nbn:de:0030-drops-145662},
  doi =		{10.4230/OASIcs.LDK.2021.30},
  annote =	{Keywords: Intent Detection, Annotation, NLP, Chatbots}
}
Document
Towards Scope Detection in Textual Requirements

Authors: Ole Magnus Holter and Basil Ell


Abstract
Requirements are an integral part of industry operation and projects. Not only do requirements dictate industrial operations, but they are used in legally binding contracts between supplier and purchaser. Some companies even have requirements as their core business. Most requirements are found in textual documents, this brings a couple of challenges such as ambiguity, scalability, maintenance, and finding relevant and related requirements. Having the requirements in a machine-readable format would be a solution to these challenges, however, existing requirements need to be transformed into machine-readable requirements using NLP technology. Using state-of-the-art NLP methods based on end-to-end neural modelling on such documents is not trivial because the language is technical and domain-specific and training data is not available. In this paper, we focus on one step in that direction, namely scope detection of textual requirements using weak supervision and a simple classifier based on BERT general domain word embeddings and show that using openly available data, it is possible to get promising results on domain-specific requirements documents.

Cite as

Ole Magnus Holter and Basil Ell. Towards Scope Detection in Textual Requirements. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 31:1-31:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{holter_et_al:OASIcs.LDK.2021.31,
  author =	{Holter, Ole Magnus and Ell, Basil},
  title =	{{Towards Scope Detection in Textual Requirements}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{31:1--31:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.31},
  URN =		{urn:nbn:de:0030-drops-145674},
  doi =		{10.4230/OASIcs.LDK.2021.31},
  annote =	{Keywords: Scope Detection, Textual requirements, NLP}
}
Document
Discrepancies Between Database- and Pragmatically Driven NLG: Insights from QUD-Based Annotations

Authors: Christoph Hesse, Maurice Langner, Anton Benz, and Ralf Klabunde


Abstract
We present annotation findings when using an annotated corpus of driving reports as informational texts with an elaborated pragmatics for the automatic generation of corresponding texts. The generation process requires access to a database providing the technical details of the vehicles, as well as an annotated corpus for sophisticated, pragmatically motivated text planning. We focus on the annotation results since they are the basic framework for linking text planning with database queries and microplanning. We show that the annotations point to a variety of linguistic phenomena that have received little or no attention in the literature so far, and they raise corresponding questions regarding the access to information from databases for the generation process.

Cite as

Christoph Hesse, Maurice Langner, Anton Benz, and Ralf Klabunde. Discrepancies Between Database- and Pragmatically Driven NLG: Insights from QUD-Based Annotations. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 32:1-32:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{hesse_et_al:OASIcs.LDK.2021.32,
  author =	{Hesse, Christoph and Langner, Maurice and Benz, Anton and Klabunde, Ralf},
  title =	{{Discrepancies Between Database- and Pragmatically Driven NLG: Insights from QUD-Based Annotations}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{32:1--32:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.32},
  URN =		{urn:nbn:de:0030-drops-145681},
  doi =		{10.4230/OASIcs.LDK.2021.32},
  annote =	{Keywords: NLG, question-under-discussion analysis, information structure, database retrieval}
}
Document
Bridging the Gap Between Ontology and Lexicon via Class-Specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus

Authors: Basil Ell, Mohammad Fazleh Elahi, and Philipp Cimiano


Abstract
There is a well-known lexical gap between content expressed in the form of natural language (NL) texts and content stored in an RDF knowledge base (KB). For tasks such as Information Extraction (IE), this gap needs to be bridged from NL to KB, so that facts extracted from text can be represented in RDF and can then be added to an RDF KB. For tasks such as Natural Language Generation, this gap needs to be bridged from KB to NL, so that facts stored in an RDF KB can be verbalized and read by humans. In this paper we propose LexExMachina, a new methodology that induces correspondences between lexical elements and KB elements by mining class-specific association rules. As an example of such an association rule, consider the rule that predicts that if the text about a person contains the token "Greek", then this person has the relation nationality to the entity Greece. Another rule predicts that if the text about a settlement contains the token "Greek", then this settlement has the relation country to the entity Greece. Such a rule can help in question answering, as it maps an adjective to the relevant KB terms, and it can help in information extraction from text. We propose and empirically investigate a set of 20 types of class-specific association rules together with different interestingness measures to rank them. We apply our method on a loosely-parallel text-data corpus that consists of data from DBpedia and texts from Wikipedia, and evaluate and provide empirical evidence for the utility of the rules for Question Answering.

Cite as

Basil Ell, Mohammad Fazleh Elahi, and Philipp Cimiano. Bridging the Gap Between Ontology and Lexicon via Class-Specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 33:1-33:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{ell_et_al:OASIcs.LDK.2021.33,
  author =	{Ell, Basil and Elahi, Mohammad Fazleh and Cimiano, Philipp},
  title =	{{Bridging the Gap Between Ontology and Lexicon via Class-Specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{33:1--33:21},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.33},
  URN =		{urn:nbn:de:0030-drops-145691},
  doi =		{10.4230/OASIcs.LDK.2021.33},
  annote =	{Keywords: Ontology, Lexicon, Association Rules, Pattern Mining}
}
Document
HISTORIAE, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case

Authors: Florentina Armaselu, Elena-Simona Apostol, Anas Fahad Khan, Chaya Liebeskind, Barbara McGillivray, Ciprian-Octavian Truică, and Giedrė Valūnaitė Oleškevičienė


Abstract
The paper proposes an interdisciplinary approach including methods from disciplines such as history of concepts, linguistics, natural language processing (NLP) and Semantic Web, to create a comparative framework for detecting semantic change in multilingual historical corpora and generating diachronic ontologies as linguistic linked open data (LLOD). Initiated as a use case (UC4.2.1) within the COST Action Nexus Linguarum, European network for Web-centred linguistic data science, the study will explore emerging trends in knowledge extraction, analysis and representation from linguistic data science, and apply the devised methodology to datasets in the humanities to trace the evolution of concepts from the domain of socio-cultural transformation. The paper will describe the main elements of the methodological framework and preliminary planning of the intended workflow.

Cite as

Florentina Armaselu, Elena-Simona Apostol, Anas Fahad Khan, Chaya Liebeskind, Barbara McGillivray, Ciprian-Octavian Truică, and Giedrė Valūnaitė Oleškevičienė. HISTORIAE, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 34:1-34:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{armaselu_et_al:OASIcs.LDK.2021.34,
  author =	{Armaselu, Florentina and Apostol, Elena-Simona and Khan, Anas Fahad and Liebeskind, Chaya and McGillivray, Barbara and Truic\u{a}, Ciprian-Octavian and Val\={u}nait\.{e} Ole\v{s}kevi\v{c}ien\.{e}, Giedr\.{e}},
  title =	{{HISTORIAE, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{34:1--34:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.34},
  URN =		{urn:nbn:de:0030-drops-145704},
  doi =		{10.4230/OASIcs.LDK.2021.34},
  annote =	{Keywords: linguistic linked open data, natural language processing, semantic change, diachronic ontologies, digital humanities}
}
Document
An Automatic Partitioning of Gutenberg.org Texts

Authors: Davide Picca and Cyrille Gay-Crosier


Abstract
Over the last 10 years, the automatic partitioning of texts has raised the interest of the community. The automatic identification of parts of texts can provide a faster and easier access to textual analysis. We introduce here an exploratory work for multi-part book identification. In an early attempt, we focus on Gutenberg.org which is one of the projects that has received the largest public support in recent years. The purpose of this article is to present a preliminary system that automatically classifies parts of texts into 35 semantic categories. An accuracy of more than 93% on the test set was achieved. We are planning to extend this effort to other repositories in the future.

Cite as

Davide Picca and Cyrille Gay-Crosier. An Automatic Partitioning of Gutenberg.org Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 35:1-35:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{picca_et_al:OASIcs.LDK.2021.35,
  author =	{Picca, Davide and Gay-Crosier, Cyrille},
  title =	{{An Automatic Partitioning of Gutenberg.org Texts}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{35:1--35:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.35},
  URN =		{urn:nbn:de:0030-drops-145714},
  doi =		{10.4230/OASIcs.LDK.2021.35},
  annote =	{Keywords: Digital Humanities, Machine Learning, Corpora}
}
Document
A Data Augmentation Approach for Sign-Language-To-Text Translation In-The-Wild

Authors: Fabrizio Nunnari, Cristina España-Bonet, and Eleftherios Avramidis


Abstract
In this paper, we describe the current main approaches to sign language translation which use deep neural networks with videos as input and text as output. We highlight that, under our point of view, their main weakness is the lack of generalization in daily life contexts. Our goal is to build a state-of-the-art system for the automatic interpretation of sign language in unpredictable video framing conditions. Our main contribution is the shift from image features to landmark positions in order to diminish the size of the input data and facilitate the combination of data augmentation techniques for landmarks. We describe the set of hypotheses to build such a system and the list of experiments that will lead us to their verification.

Cite as

Fabrizio Nunnari, Cristina España-Bonet, and Eleftherios Avramidis. A Data Augmentation Approach for Sign-Language-To-Text Translation In-The-Wild. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 36:1-36:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{nunnari_et_al:OASIcs.LDK.2021.36,
  author =	{Nunnari, Fabrizio and Espa\~{n}a-Bonet, Cristina and Avramidis, Eleftherios},
  title =	{{A Data Augmentation Approach for Sign-Language-To-Text Translation In-The-Wild}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{36:1--36:8},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.36},
  URN =		{urn:nbn:de:0030-drops-145728},
  doi =		{10.4230/OASIcs.LDK.2021.36},
  annote =	{Keywords: sing language, video recognition, end-to-end translation, data augmentation}
}
Document
A Review and Cluster Analysis of German Polarity Resources for Sentiment Analysis

Authors: Bettina M. J. Kern, Andreas Baumann, Thomas E. Kolb, Katharina Sekanina, Klaus Hofmann, Tanja Wissik, and Julia Neidhardt


Abstract
The domain of German polarity dictionaries is heterogeneous with many small dictionaries created for different purposes and using different methods. This paper aims to map out the landscape of freely available German polarity dictionaries by clustering them to uncover similarities and shared features. We find that, although most dictionaries seem to agree in their assessment of a word’s sentiment, subsets of them form groups of interrelated dictionaries. These dependencies are in most cases an immediate reflex of how these dictionaries were designed and compiled. As a consequence, we argue that sentiment evaluation should be based on multiple and diverse sentiment resources in order to avoid error propagation and amplification of potential biases.

Cite as

Bettina M. J. Kern, Andreas Baumann, Thomas E. Kolb, Katharina Sekanina, Klaus Hofmann, Tanja Wissik, and Julia Neidhardt. A Review and Cluster Analysis of German Polarity Resources for Sentiment Analysis. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 37:1-37:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{kern_et_al:OASIcs.LDK.2021.37,
  author =	{Kern, Bettina M. J. and Baumann, Andreas and Kolb, Thomas E. and Sekanina, Katharina and Hofmann, Klaus and Wissik, Tanja and Neidhardt, Julia},
  title =	{{A Review and Cluster Analysis of German Polarity Resources for Sentiment Analysis}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{37:1--37:17},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.37},
  URN =		{urn:nbn:de:0030-drops-145734},
  doi =		{10.4230/OASIcs.LDK.2021.37},
  annote =	{Keywords: cluster analysis, sentiment polarity, sentiment analysis, German, review}
}
Document
Exploring Causal Relationships Among Emotional and Topical Trajectories in Political Text Data

Authors: Andreas Baumann, Klaus Hofmann, Bettina Kern, Anna Marakasova, Julia Neidhardt, and Tanja Wissik


Abstract
We explore relationships between dynamics of emotion (arousal and valence) and topical stability in political discourse in two diachronic corpora of Austrian German. In doing so, we assess interactions among emotional and topical dynamics related to political parties as well as interactions between two different domains of discourse: debates in the parliament and journalistic media. Methodologically, we employ unsupervised techniques, time-series clustering and Granger-causal modeling to detect potential interactions. We find that emotional and topical dynamics in the media are only rarely a reflex of dynamics in parliamentary discourse.

Cite as

Andreas Baumann, Klaus Hofmann, Bettina Kern, Anna Marakasova, Julia Neidhardt, and Tanja Wissik. Exploring Causal Relationships Among Emotional and Topical Trajectories in Political Text Data. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 38:1-38:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{baumann_et_al:OASIcs.LDK.2021.38,
  author =	{Baumann, Andreas and Hofmann, Klaus and Kern, Bettina and Marakasova, Anna and Neidhardt, Julia and Wissik, Tanja},
  title =	{{Exploring Causal Relationships Among Emotional and Topical Trajectories in Political Text Data}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{38:1--38:8},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.38},
  URN =		{urn:nbn:de:0030-drops-145740},
  doi =		{10.4230/OASIcs.LDK.2021.38},
  annote =	{Keywords: time-series clustering, Granger causality, topical stability, emotion, political discourse}
}
Document
Calculating Argument Diversity in Online Threads

Authors: Cedric Waterschoot, Antal van den Bosch, and Ernst van den Hemel


Abstract
We propose a method for estimating argument diversity and interactivity in online discussion threads. Using a case study on the subject of Black Pete ("Zwarte Piet") in the Netherlands, the approach for automatic detection of echo chambers is presented. Dynamic thread scoring calculates the status of the discussion on the thread level, while individual messages receive a contribution score reflecting the extent to which the post contributed to the overall interactivity in the thread. We obtain platform-specific results. Gab hosts only echo chambers, while the majority of Reddit threads are balanced in terms of perspectives. Twitter threads cover the whole spectrum of interactivity. While the results based on the case study mirror previous research, this calculation is only the first step towards better understanding and automatic detection of echo effects in online discussions.

Cite as

Cedric Waterschoot, Antal van den Bosch, and Ernst van den Hemel. Calculating Argument Diversity in Online Threads. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 39:1-39:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{waterschoot_et_al:OASIcs.LDK.2021.39,
  author =	{Waterschoot, Cedric and van den Bosch, Antal and van den Hemel, Ernst},
  title =	{{Calculating Argument Diversity in Online Threads}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{39:1--39:9},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.39},
  URN =		{urn:nbn:de:0030-drops-145751},
  doi =		{10.4230/OASIcs.LDK.2021.39},
  annote =	{Keywords: Social Media, Echo Chamber, Interactivity, Argumentation, Stance}
}
Document
Linking Discourse Marker Inventories

Authors: Christian Chiarcos and Maxim Ionov


Abstract
The paper describes the first comprehensive edition of machine-readable discourse marker lexicons. Discourse markers such as and, because, but, though or thereafter are essential communicative signals in human conversation, as they indicate how an utterance relates to its communicative context. As much of this information is implicit or expressed differently in different languages, discourse parsing, context-adequate natural language generation and machine translation are considered particularly challenging aspects of Natural Language Processing. Providing this data in machine-readable, standard-compliant form will thus facilitate such technical tasks, and moreover, allow to explore techniques for translation inference to be applied to this particular group of lexical resources that was previously largely neglected in the context of Linguistic Linked (Open) Data.

Cite as

Christian Chiarcos and Maxim Ionov. Linking Discourse Marker Inventories. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 40:1-40:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{chiarcos_et_al:OASIcs.LDK.2021.40,
  author =	{Chiarcos, Christian and Ionov, Maxim},
  title =	{{Linking Discourse Marker Inventories}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{40:1--40:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.40},
  URN =		{urn:nbn:de:0030-drops-145769},
  doi =		{10.4230/OASIcs.LDK.2021.40},
  annote =	{Keywords: discourse processing, discourse markers, linked data, OntoLex, OLiA}
}
Document
Tackling Domain-Specific Winograd Schemas with Knowledge-Based Reasoning and Machine Learning

Authors: Suk Joon Hong and Brandon Bennett


Abstract
The Winograd Schema Challenge (WSC) is a commonsense reasoning task that requires background knowledge. In this paper, we contribute to tackling WSC in four ways. Firstly, we suggest a keyword method to define a restricted domain where distinctive high-level semantic patterns can be found. A thanking domain was defined by keywords, and the data set in this domain is used in our experiments. Secondly, we develop a high-level knowledge-based reasoning method using semantic roles which is based on the method of Sharma [Sharma, 2019]. Thirdly, we propose an ensemble method to combine knowledge-based reasoning and machine learning which shows the best performance in our experiments. As a machine learning method, we used Bidirectional Encoder Representations from Transformers (BERT) [Jacob Devlin et al., 2018; Vid Kocijan et al., 2019]. Lastly, in terms of evaluation, we suggest a "robust" accuracy measurement by modifying that of Trichelair et al. [Trichelair et al., 2018]. As with their switching method, we evaluate a model by considering its performance on trivial variants of each sentence in the test set.

Cite as

Suk Joon Hong and Brandon Bennett. Tackling Domain-Specific Winograd Schemas with Knowledge-Based Reasoning and Machine Learning. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 41:1-41:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{hong_et_al:OASIcs.LDK.2021.41,
  author =	{Hong, Suk Joon and Bennett, Brandon},
  title =	{{Tackling Domain-Specific Winograd Schemas with Knowledge-Based Reasoning and Machine Learning}},
  booktitle =	{3rd Conference on Language, Data and Knowledge (LDK 2021)},
  pages =	{41:1--41:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-199-3},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{93},
  editor =	{Gromann, Dagmar and S\'{e}rasset, Gilles and Declerck, Thierry and McCrae, John P. and Gracia, Jorge and Bosque-Gil, Julia and Bobillo, Fernando and Heinisch, Barbara},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2021.41},
  URN =		{urn:nbn:de:0030-drops-145779},
  doi =		{10.4230/OASIcs.LDK.2021.41},
  annote =	{Keywords: Commonsense Reasoning, Winograd Schema Challenge, Knowledge-based Reasoning, Machine Learning, Semantics}
}

Filters


Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail