Lingenio GmbH Sprachbeherrschung durch Sprachtechnologie    

Linguistic Engineering | Sprachtechnologie | Ingénieurie linguistique
 

Deutsch | English | Français
 

     
         
 
 


 Since roughly a decade statistical machine translation (SMT) predominates in academic research. However, most commercial MT suppliers continue to offer systems based on more traditional rule-based architectures (RBMT). Difficulties with replacing the translation engines in the product set-up may explain this discrepancy in part. However, the main reasons are that RBMT makes available a whole bunch of functions which SMT does not provide, including human-readable, fully worked out 'conventional' dictionaries, and that for a number of language pairs RBMT-quality is still higher.

SMT needs huge bilingual text corpora to compute satisfactory translation models, and it is inherently weak when dealing with rare data and non-local phenomena. Its advantages are low cost and robustness. The main disadvantages of RBMT are high cost and shortcomings with respect to resolving structural and lexical ambiguities.

We propose hybrid architecture for high quality machine translation which combines the strengths of both approaches and minimizes their weaknesses: At the core is a rule-based MT system which provides morphology, declarative grammars, semantic categories, and small dictionaries, but which avoids all expensive kinds of intellectual knowledge acquisition. Instead of manually working out large dictionaries and compiling information on disambiguation preference, we suggest a novel corpus-based bootstrapping method for automatically expanding dictionaries, and for training the analytical performance and the choice of transfer alternatives.

As bilingual corpora with good literal translations are a sparse resource, we focus in particular on exploiting comparable monolingual corpora. We locate unknown words and expressions, and then use a statistically tuned analysis component in combination with similarity assumptions to identify relations across languages. This approach should make it possible to overcome the data acquisition bottleneck of conventional SMT.

Project overview

We design and  implement a hybrid architecture  for high quality machine  translation (HyghTra)  which  combines  the  strengths  of  the  statistical  and  the  rule-based approach and minimizes their weaknesses. 


HyghTra  will  consist  of  a  rule-based  MT  core  system  which  provides morphology, declarative grammars, semantic categories, and small (cheap) bilingual dictionaries,  and  which  omits  all  kinds  of  (expensive)  disambiguating  preference knowledge.  Instead of compiling such knowledge and working out  large dictionaries manually,  we  make  use  of  a  bootstrapping  method  for  automatically  extending dictionaries  and  for  training  the  analytical  performance  and  the  choice  of  transfer alternatives, using monolingual and bilingual corpora. 


Since  bilingual  data  with  good  literal  translations  are  sparse,  we  focus  in particular on searching monolingual corpora  for new words and use  the statistically tuned  analysis  components  of  the  system  and  similarity  assumptions  to  crosslinguistically  relate  them  to  each  other.  This  should  overcome  the  data  acquisition bottleneck of conventional SMT to a significant degree.

Project Participants

University of Leeds    

 

Project details

Research area: FP7-PEOPLE-2009-IAPP Marie Curie IAPP transfer of knowledge programme
Project Acronym: HYGHTRA
Project Reference: 251534
Start Date: 2010-12-01
Duration: 48 months
Contract Type: Industry-Academia Partnerships and Pathways (IAPP)
End Date: 2014-11-30
Project Status: Execution

 
    FP7 reference number: 251534

 

 
      Letzte Änderung: 20.05.2011

© Copyright Lingenio GmbH 2011