Dagstuhl Seminar 14061: Statistical Techniques for Translating to Morphologically Rich Languages

Dagstuhl Seminar 14061

Statistical Techniques for Translating to Morphologically Rich Languages

( Feb 02 – Feb 07, 2014 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/14061

Organizers

Alexander M. Fraser (LMU München, DE)
Kevin Knight (USC - Marina del Rey, US)
Philipp Koehn (University of Edinburgh, GB)
Helmut Schmid (LMU München, DE)
Hans Uszkoreit (Universität des Saarlandes, DE)

Contact

Annette Beyer (for administrative matters)

Publications

Statistical Techniques for Translating to Morphologically Rich Languages (Dagstuhl Seminar 14061). Alexander M. Fraser, Kevin Knight, Philipp Koehn, Helmut Schmid, and Hans Uszkoreit. In Dagstuhl Reports, Volume 4, Issue 2, pp. 1-16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2014)

Schedule

Schedule

Motivation

Show Motivation

This Dagstuhl Seminar will bring together disparate communities working in the area of morphologically rich languages to discuss an important research problem: translation to morphologically rich languages. While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed when compared against previous generation rule-based systems, so this is a critical and timely topic. Current research in statistical techniques for translating to morphologically rich languages varies greatly with respect to the amount and the form of the linguistic knowledge used. This variation is strongest with respect to target language; for example, the resources currently used for translating to Czech are very different from those used for translating to German. Given this research diversity, there is a great need to move the discussion of these translation tasks and their related issues into a broader venue than that of the ACL Workshops on Machine Translation, which is primarily attended by statistical machine translation researchers.

It is clear that more linguistically sophisticated methods are required to solve many of the problems of translating to morphologically rich languages. It is critically important that SMT researchers and experts in statistical parsing and morphology who work with morphologically rich languages come together to discuss what sort of representations of linguistic features are appropriate, and which linguistic features can be accurately determined by state-of-the-art disambiguation techniques. We expect this interaction to create a new community crossing these research areas. We are also inviting a few experts in structured prediction who are interested in SMT and who have insight on how to jointly model some of these phenomena, rather than combining separate tools in ad-hoc pipelines as is currently done.

Some of the research questions to be addressed are:

Which linguistic features (from syntax, morphology and other areas such as co reference resolution) need to be modeled in SMT?
Which statistical models and tools should be used to annotate linguistic features on training data useful for SMT modeling?
How can we integrate these features into existing SMT models?
Which structured prediction techniques and types of features are appropriate for training the extended models and determining the best output translations?
What data sets should be used to allow a common test bed for evaluation?
How should evaluation be conducted, given the poor results of current automatic evaluation metrics on morphologically rich languages?

This Dagstuhl Seminar brings together researchers from four different communities (statistical machine translation, statistical parsing, morphology and structured prediction) to jointly address these questions.

Summary

Show Summary

This report documents the program and the outcomes of Dagstuhl Seminar 14061 "Statistical Techniques for Translating to Morphologically Rich Languages". The website of the seminar, which allows access to most of the materials created for and during the seminar, is http://www.dagstuhl.de/14061. The seminar on Statistical Techniques for Translating to Morphologically Rich Languages allowed disparate communities working on problems related to morphologically rich languages to meet to discuss an important research problem, translation to morphologically rich languages. While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed versus previous generation rule-based systems, so this is a critical and timely topic. Current research in statistical techniques for translating to morphologically rich languages varies greatly in the amount of linguistic knowledge used and the form of this linguistic knowledge. This varies most strongly by target language, for instance the resources currently used for translating to Czech are very different from those used for translating to German. The seminar met a pressing need to discuss the issues involved in these translation tasks in a more broad venue than the ACL Workshops on Machine Translation, which are primarily attended by statistical machine translation researchers.

Important background for the discussion was the recent realization that more linguistically sophisticated methods are required to solve many of the problems of translating to morphologically rich languages. Therefore it was critically important that researchers be able to interact with experts in statistical parsing and morphology who work with morphologically rich languages to discuss what sort of representations of linguistic features are appropriate and which linguistic features can be accurately determined by state of the art disambiguation techniques. This was an important step in creating a new community crossing these research areas. Additionally, a few experts in structured prediction were invited. The discussions took advantage of their insight in how to jointly model some of these phenomena, rather than combining separate tools in ad-hoc pipelines as is currently done. The overall discussion was driven by the following questions:

Which linguistic features (from syntax, morphology and other areas such as coreference resolution) need to be modeled in SMT?
Which statistical models and tools should be used to annotate linguistic features on training data useful for SMT modeling?
How can we integrate these features into existing SMT models?
Which structured prediction techniques and types of features are appropriate for training the extended models and determining the best output translations?
What data sets should be used to allow a common test bed for evaluation?
How should evaluation be conducted, given the poor results of current automatic evaluation metrics on morphologically rich languages?

The Dagstuhl seminar on Statistical Techniques for Translating to Morphologically Rich Languages addressed these questions by allowing four different communities to meet together: statistical machine translation, statistical parsing, morphology and structured prediction.

Outcome in brief. The Dagstuhl seminar on Statistical Techniques for Translating to Morphologically Rich Languages was a great success. The discussions held will play an important role in allowing researchers to significantly advance the state-of-the-art. In particular, strong and weak points in current research approaches were identified and proposals to address the weak points were made. In addition, the seminar acted as a valuable venue for more junior researchers to spend more time talking with senior researchers than is possible in a conference setting. Finally, several new community building ideas were discussed, including a DFG proposal connecting all of the major sites for statistical machine translation research in Germany, see below.

Invited Talks. We begin the detailed discussion with a brief idea about the three invited keynote talks (as well as the introductory overview and motivational talk). All of these talks were very well received, with several seminar participants commenting that they learned a significant amount by being able to see a synthesis of the problems, current approaches and possible future approaches to translating to morphologically rich languages. The three keynote talks were:

Philipp Koehn of the University of Edinburgh presented a general discussion of dealing with the phenomena of morphologically rich languages in translation.
Kristina Toutanova of Microsoft Research presented a detailed overview of the state-of-the-art in statistical machine translation research related to morphologically rich languages in translation.
Kevin Knight of the University of Southern California presented a vision of the future, where the field could go, in terms of both better modelling of morphologically rich languages, and the use of more language independent structure (at the semantic level) in translation.

After this, people interested in leading a discussion group held talks.

Discussion Groups. There were initially nine proposed topics for discussion groups (note that these are listed as topic-focused talks subsequently in the report):

Nivre/Petrov: Parallel dependency treebanks and linguistic resources
Tiedemann: The use of synthetic training data and pivot languages to overcome data sparseness
Kirchhoff: Language modeling
Dyer: Modeling morphemes vs. modeling words and smoothing with morphemes
Habash: Arabic morphology and deep morphology representation for MT
Williams/Koehn: Syntactic SMT for morphologically rich languages
Knight: Semantics
Webber: Discourse/aspects of semantics
Bojar/Hajic: Generating morphology for SMT

Following this all participants emailed the organizers with their discussion group preferences. In the end, all but two participants were assigned to their first preference. We eliminated two groups (on synthetic training data and generating morphology), and their proposers joined other groups.

Following initial group presentations by some groups on Wednesday morning, three groups dissolved and several decided to continue. The three new groups that were proposed were:

Virpiojia/Dyer: Unsupervised morphology for statistical machine translation
Wu/Lavie: Evaluation of machine translation output
Nivre/Knight: Universal Annotation and Abstract Meaning Representation

Highlights of what was accomplished by the discussion groups were:

Dyer and Virpiojia and groups looked at morphologically aware translation models which use morphology to cover the long-tail without requiring morphological modelling of very frequent tokens, and looked at the state-of-the-art in unsupervised modeling.
Kirchhoff and her group carried out a detailed survey of the state-of-the-art for language modeling of morphologically rich languages and documented this on the Wiki.
Nivre and his two groups (one co-led with Petrov) defined a new proposed annotation standard for working on two levels (surface forms and lemmas, including multi-word-entities and decomposed compounds).
Habash and his group carried out a literature review of attempts to deal with Arabic morphology in translation, discussing the strengths and weaknesses of the approaches, and identifying a new direction for future work.
Williams, Koehn and group looked at the application of unification to modelling agreement in multiple languages.
Knight and his two groups worked on general applications of semantically-aware processing to morphologically rich languages and on identifying areas where the Abstract Meaning Representation could be applied to this problem.
Webber and group created a list of resources and research papers on applying discourse modeling to statistical machine translation and looked at machine translation output to find errors caused by broken discourse constraints.
Wu, Lavie and group discussed and documented the different levels of linguistic analysis required for high quality automatic evaluation when the target language is morphologically rich.

See the individual abstracts for more information and further details.

Other activities. In addition to the formal work carried out in the talks and discussion groups, Dagstuhl offered an intimate environment strongly encouraging networking and discussion. The meal system of Dagstuhl, with random assignment of people to tables, is an excellent idea and was particularly useful for the more junior participants who did not know many of the senior researchers attending (several people mentioned informally that this was the best experience of this sort they have had). The informal evening activities centering around social gatherings and the music room were also very well attended and a variety of interesting discussions took place. The excursion to Trier was a welcome mid-week break and provided another networking opportunity, as well as being highly interesting for the vast majority of participants who had not previously visited a city with a similar historical background.

The seminar was unusual for Dagstuhl itself in that very few of the participants had participated in a Dagstuhl seminar previously. Due to the strongly positive reaction we anticipate that other research areas within Natural Language Processing will apply for Dagstuhl seminars.

We would like to take the opportunity here to thank Dagstuhl for the wonderful logistic support and for providing such a stimulating environment for our work.

Communities represented in more detail. The seminar was a success in terms of the strong participation of women and a good geographical distribution (although Asia could have been somewhat more strongly represented). Our only strong area of concern was that of the numerous participants from companies invited, only two attended (Kristina Toutanova of Microsoft Research and Slav Petrov of Google, who gave one of the keynotes and co-led a discussion group respectively). Nevertheless the networking opportunities were excellent and many participants informally told us that this was an excellent meeting which they expected to have a strong impact on their research.

One characteristic of the proposal which was successfully carried out was a meeting of four different communities: statistical machine translation, statistical parsing, morphology and structured prediction. In particular, we felt that the interaction between the statistical machine translation researchers and the researchers working on statistical parsing and morphology was highly productive and will likely lead to new techniques of analyzing morphologically rich languages which will be more useful in translation research than the current approaches. We believe that the Dagstuhl seminar has been unique in terms of providing the opportunity for these communities to meet together for five days and understand each others' perspective on research.

Conclusion and Impact. In conclusion, we believe the Dagstuhl seminar has met the goals we set out for it, in terms of providing a forum for discussion of the current problems with the state-of-the-art and allowing a focusing of research effort which was not previously present in the research community.

As we previously mentioned, in addition to the less quantifiable aspects in terms of networking and connections made, there were several prominent concrete outcomes of the Dagstuhl seminar. The new annotation standard suggested by the two Universal Annotation groups led by Nivre, Petrov and Knight is one strong outcome which will change the basic tools that the statistical machine translation community will have available. The Kirchhoff group is working on a position paper that will help to refocus effort on language modeling for morphologically rich languages, which will have an impact not only on machine translation research but also research on speech recognition and other research areas.

Five of the six most prominent researchers in machine translation in Germany were able to attend the Dagstuhl seminar, and while there have decided to launch a new research program in translating spoken language in an educational context, with a particular focus on translation to German (a morphologically rich language), by submitting a Paketantrag to the DFG. The work will be carried out with a view toward creating a DFG Schwerpunktprogramm focusing on Natural Language Processing for German after the successful completion of the work in the Paketantrag. The researchers are Fraser, van Genabith, Ney, Riezler, Uszkoreit, and they are joined by Alex Waibel (who was invited to the seminar but unable to attend due to scheduling conflicts). This new funding effort would not have been possible without the possibility to meet at Dagstuhl several times to find common ground and determine an overall strategy.

In short, we were very happy with the discussions, work and impact of the Dagstuhl seminar on translation to morphologically rich languages. We plan to apply to hold a second meeting at Dagstuhl in the summer of 2016 on the same topic.

Finally, we would like to once again thank the staff of Dagstuhl for facilitating these unique scientific discussions which we are confident will have a strong impact on future research on the important problem of statistical techniques for translation to morphologically rich languages.

Creative Commons BY 3.0 Unported license

Alexander M. Fraser, Kevin Knight, Philipp Koehn, Helmut Schmid, and Hans Uszkoreit

Participants

Show Participants

Arianna Bisazza (University of Amsterdam, NL) [dblp]
Fabienne Braune (LMU München, DE) [dblp]
Fabienne Cap (LMU München, DE) [dblp]
Marine Carpuat (NRC - Ottawa, CA) [dblp]
David Chiang (USC - Marina del Rey, US) [dblp]
Ann Clifton (Simon Fraser University - Burnaby, CA) [dblp]
Hal Daumé III (University of Maryland - College Park, US) [dblp]
Gideon Maillette de Buy Wenniger (University of Amsterdam, NL) [dblp]
Chris Dyer (Carnegie Mellon University, US) [dblp]
Andreas Eisele (European Commission Luxembourg, LU) [dblp]
Richard Farkas (University of Szeged, HU) [dblp]
Marcello Federico (Bruno Kessler Foundation - Trento, IT) [dblp]
Mark Fishel (Universität Zürich, CH) [dblp]
Anette Frank (Universität Heidelberg, DE) [dblp]
Alexander M. Fraser (LMU München, DE) [dblp]
Spence Green (Stanford University, US) [dblp]
Nizar Habash (Columbia University - New York, US) [dblp]
Jan Hajic (Charles University - Prague, CZ) [dblp]
Katrin Kirchhoff (Univ. of Washington - Seattle, US) [dblp]
Kevin Knight (USC - Marina del Rey, US) [dblp]
Philipp Koehn (University of Edinburgh, GB) [dblp]
Jonas Kuhn (Universität Stuttgart, DE) [dblp]
Alon Lavie (Carnegie Mellon University - Pittsburgh, US) [dblp]
Krister Lindén (University of Helsinki, FI) [dblp]
Andreas Maletti (Universität Stuttgart, DE) [dblp]
Maria Nadejde (University of Edinburgh, GB)
Preslav Nakov (QCRI - Doha, QA) [dblp]
Hermann Ney (RWTH Aachen, DE) [dblp]
Joakim Nivre (Uppsala University, SE) [dblp]
Slav Petrov (Google - New York, US) [dblp]
Maja Popovic (DFKI - Berlin, DE) [dblp]
Anita Ramm (LMU München, DE) [dblp]
Stefan Riezler (Universität Heidelberg, DE) [dblp]
Hassan Sajjad (QCRI - Doha, QA) [dblp]
Helmut Schmid (LMU München, DE) [dblp]
Hinrich Schütze (LMU München, DE) [dblp]
Khalil Sima'an (University of Amsterdam, NL) [dblp]
Sara Stymne (Uppsala University, SE) [dblp]
Jörg Tiedemann (Uppsala University, SE) [dblp]
Kristina Toutanova (Microsoft Corporation - Redmond, US) [dblp]
Hans Uszkoreit (Universität des Saarlandes, DE) [dblp]
Josef van Genabith (Dublin City University, IE) [dblp]
Sami Virpioja (Aalto University, FI) [dblp]
Stephan Vogel (QCRI - Doha, QA) [dblp]
Martin Volk (Universität Zürich, CH) [dblp]
Bonnie Webber (University of Edinburgh, GB) [dblp]
Marion Weller (LMU München, DE) [dblp]
Phil Williams (University of Edinburgh, GB) [dblp]
Shuly Wintner (University of Haifa, IL) [dblp]
Dekai Wu (HKUST - Kowloon, HK) [dblp]
Francois Yvon (University Paris Sud, FR) [dblp]

Classification

artificial intelligence / robotics

Keywords

Statistical Machine Translation
Statistical Parsing
Morphology
Structured Prediction Machine Learning
Natural Language Processing

Seminar 14061

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 14061

Statistical Techniques for Translating to Morphologically Rich Languages

( Feb 02 – Feb 07, 2014 )

Permalink

Organizers

Contact

Publications

Schedule

Motivation

Summary

Participants

Classification

Keywords