1/31
These flashcards cover the definitions, history, models, evaluation methods, and tools associated with Machine Translation based on the lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Machine Translation
The process of converting text in one language into another while preserving its meaning.
Statistical Approach
Introduced by IBM in 1988, this approach uses large corpora of translated texts and statistical models to learn translation rules rather than relying on linguists to define transformations and lexicons.
Phrase-based Models
The currently dominant approach in statistical machine translation based on mapping short text chunks, typically 1 to 3 words long.
Fluency
A fine-grained distinction in human assessment used to evaluate if a translation flows naturally and is smooth.
Adequacy
A fine-grained distinction in human assessment used to evaluate if a translation conveys the full meaning of the original text.
Task-based Evaluation
An evaluation method where quality is tested by seeing if the translation fulfills an information need, such as an assessor being able to answer questions about the content of the translated text.
Automatic Evaluation Metrics
Computational methods like WER, BLEU, and METEOR used to frequently and cost-effectively rank machine translation systems, often validating their accuracy via correlation studies with human judges.
Matches
In automatic evaluation, words that appear in both the reference translation and the machine translation output.
Insertions
In automatic evaluation, words that appear only in the machine translation output and not in the reference.
Deletions
In automatic evaluation, words that appear only in the reference translation and are missing from the machine translation output.
PER (Position-independent error rate)
One of the earliest automatic evaluation metrics proposed for measuring translation accuracy.
Word Alignment
A fundamental step in statistical machine translation models that involves detecting word-level translations from parallel corpora.
Sentence-aligned Parallel Corpus
A collection of texts where each foreign sentence f is paired with its English translation e.
IBM Model 1
A very simplistic model for word alignment used as a stepping stone to more sophisticated models.
IBM Model 2
An alignment model that introduces the use of absolute word positions within sentences.
Fertility (IBM Model 3)
A concept introduced in IBM Model 3 describing the phenomenon where a single word can produce multiple words in translation.
Symmetrization
A process of refinement used to address fundamental flaws in the original IBM word alignment models.
Phrase Translation Table
The massive knowledge source used in phrase-based models to store mappings between short text chunks.
Hypotheses
The term used for partial translations, which are organized in stacks during the decoding process.
Beam Search
A decoding method that searches through the most promising part of the search space by illuminating a limited number of alternatives.
Cube Pruning
A popular variation of the decoding heuristic used in machine translation systems.
MERT (Minimum Error Rate Training)
A multi-dimensional optimization problem also known as parameter tuning.
Recursion
A fundamental property of language that is addressed by tree-based machine translation models.
Berkeley Word Aligner
A tool that integrates the idea of symmetrizing word alignments closely into the alignment method.
SRILM and IRSTLM
Toolkits developed for language modeling, with IRSTLM specifically targeting compact representation and scalable training.
Moses
The most widely used toolkit for machine translation, implementing most standard statistical methods and drawing on tools for alignment and language modeling.
Joshua
A more recent decoder focused on hierarchical and syntax-based translation models.
Apertium
A project aimed at constructing rule-based machine translation systems for many language pairs.
Canadian Hansards
A parallel corpus consisting of the proceedings of the Canadian parliament translated between French and English.
Europarl Corpus
A corpus consisting of translated proceedings of the European parliament, offering about 40 million words in each of 11 languages.
Acquis Corpus
A corpus of legal documents from the European Union covering 22 languages and up to 40 million words per language.
OPUS Project
A project that collects parallel corpora from various sources, including open source documentation and movie subtitles.