02.02.14 - 07.02.14, Seminar 14061

Statistical Techniques for Translating to Morphologically Rich Languages

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.


This Dagstuhl Seminar will bring together disparate communities working in the area of morphologically rich languages to discuss an important research problem: translation to morphologically rich languages. While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed when compared against previous generation rule-based systems, so this is a critical and timely topic. Current research in statistical techniques for translating to morphologically rich languages varies greatly with respect to the amount and the form of the linguistic knowledge used. This variation is strongest with respect to target language; for example, the resources currently used for translating to Czech are very different from those used for translating to German. Given this research diversity, there is a great need to move the discussion of these translation tasks and their related issues into a broader venue than that of the ACL Workshops on Machine Translation, which is primarily attended by statistical machine translation researchers.

It is clear that more linguistically sophisticated methods are required to solve many of the problems of translating to morphologically rich languages. It is critically important that SMT researchers and experts in statistical parsing and morphology who work with morphologically rich languages come together to discuss what sort of representations of linguistic features are appropriate, and which linguistic features can be accurately determined by state-of-the-art disambiguation techniques. We expect this interaction to create a new community crossing these research areas. We are also inviting a few experts in structured prediction who are interested in SMT and who have insight on how to jointly model some of these phenomena, rather than combining separate tools in ad-hoc pipelines as is currently done.

Some of the research questions to be addressed are:

  • Which linguistic features (from syntax, morphology and other areas such as co reference resolution) need to be modeled in SMT?
  • Which statistical models and tools should be used to annotate linguistic features on training data useful for SMT modeling?
  • How can we integrate these features into existing SMT models?
  • Which structured prediction techniques and types of features are appropriate for training the extended models and determining the best output translations?
  • What data sets should be used to allow a common test bed for evaluation?
  • How should evaluation be conducted, given the poor results of current automatic evaluation metrics on morphologically rich languages?

This Dagstuhl Seminar brings together researchers from four different communities (statistical machine translation, statistical parsing, morphology and structured prediction) to jointly address these questions.