Comprehensive Study Guide to Machine Translation Problems, Localization, and Corpora

Fundamental Limitations of Computers in Machine Translation

Machine translation (MT) research has spent several decades attempting to automate the translation process. Despite significant advancements, computers still face fundamental problems characterized by four primary limitations:

Inability to Perform Vaguely Defined Tasks: Computers require highly precise and explicit instructions to function. They struggle significantly when faced with directives that are ambiguous, open-ended, or lack formal definition.
Inability to Learn from Experience: Unlike human beings, computers do not instinctively apply knowledge gained from past experiences to new, unstructured situations. Every new context generally requires explicit programming rather than intuitive adaptation.
Inability to Perform Common-Sense Reasoning: Computers lack the everyday, intuitive knowledge and contextual understanding that humans use to navigate reality. They cannot automatically "fill in the blanks" using logic that seems obvious to a human.
Difficulty in Dealing with Combinatorial Explosive Problems: In language translation, the number of potential combinations can grow at an exponential rate. These combinations often exceed the current computational capabilities of hardware and software systems.

The Four Core Problems of Machine Translation

Based on the physical and logical limitations of computers, translation challenges are categorized into four distinct problem areas:

1. The Analysis Problem

Core Limitation: Inability to Perform Vaguely Defined Tasks.
Description: For a computer to translate, it must first analyze the meaning of the source text. However, many words are polysemous, meaning they have different definitions depending on context.
Example: Consider the word "pen."
- It could refer to a tool used for writing.
- It could refer to an enclosure for keeping animals.
Human vs. Computer: A human translator utilizes context to instantly identify the correct meaning. A computer may fail to choose correctly if the surrounding context is not explicitly defined in a way the machine can parse.

2. The Transfer Problem

Core Limitation: Inability to Learn from Experience.
Description: Even if a computer accurately identifies the intended meaning, expressing that meaning in a target language is difficult because languages use different grammatical and syntactic structures.
Structural Comparison Example (English vs. French):
- English: "He ran into the room."
- French: "Il est entré dans la pièce."
Requirement: The machine must be programmed to recognize that these two vastly different structures convey the exact same underlying meaning.

3. The Synthesis Problem

Core Limitation: Inability to Perform Common-Sense Reasoning.
Description: After meaning is transferred, the computer must synthesize and select the most natural-sounding expression in the target language. There are often multiple valid ways to say something, but only one may fit the specific tone or situation.
Example: "I miss you"
- French: "Tu me manques."
- German: "Ich vermisse dich."
Difficulty: The computer struggles to decide which phrasing is most appropriate based on social nuances, situational context, or specific tones that common sense would dictate to a human.

4. The Problem of Description

Core Limitation: Difficulty in Dealing with Combinatorial Explosive Problems.
Description: An effective MT system requires an immense repository of knowledge, including grammar rules, vocabulary, and cultural nuances. Collecting and organizing this information is a monumental task because language variation is nearly infinite.
Necessary Knowledge Components:
- Grammar Rules: For instance, complex systems of subject-verb agreement.
- Vocabulary: Managing idioms and figurative language.
- Cultural Knowledge: Determining how to translate jokes, puns, or local cultural references that have no direct equivalent.

Error Classification in Post-Editing (SAE J2450)

To ensure objective quality evaluation during the post-editing phase of machine translation, many organizations utilize standardized error categories. A prominent standard is $SAE\,J2450$ , developed by the Society of Automotive Engineers. It is used extensively in technical and post-editing assessments and defines $7$ key error types:

a. Wrong Term (WT): The use of an incorrect technical or non-technical term.
- Incorrect: "Open the door of the printer to replace the cartridge."
- Correct: "Open the cover of the printer to replace the cartridge."
- Reasoning: "Cover" is the specific technical term for a printer component.
b. Structural Error (SE): Faults in sentence structure or basic grammar.
- Incorrect: "I no want nothing."
- Correct: "I don't want anything."
- Reasoning: The incorrect version lacks a negative helping verb.
c. Omission (OM): Crucial information from the source text is missing in the output.
- Incorrect: "Put in the oven."
- Correct: "Put the cake in the oven."
- Reasoning: Without the noun "cake," the sentence is unclear.
d. Word-Structure or Agreement Error (SA): Problems with morphological forms or parts of speech that do not match.
- Incorrect: "She are a teacher."
- Correct: "She is a teacher."
- Reasoning: Failure of subject-verb agreement.
e. Misspelling (SP): Orthographic or typographical mistakes.
- Incorrect: "Turn off the machin."
- Correct: "Turn off the machine."
f. Punctuation Error (PE): Missing or incorrectly used punctuation marks.
- Incorrect: "Lets go."
- Correct: "Let's go."
- Reasoning: Missing apostrophe for the contraction.
g. Miscellaneous Error (ME): Errors that do not fall into the previous categories, such as unnecessary word additions.
- Incorrect: "The car it drives very fast."
- Correct: "The car drives very fast."
- Reasoning: Double subject usage is redundant.

Localization, Internationalization, and Globalization

Localization (L10n)

Definition by LISA (Localization Industry Standards Association): "Making a product linguistically, technically, and culturally appropriate for its target market."
Etymology: The term comes from "locale," which refers to a region's unique traits. The abbreviation $L10n$ is used because there are $10$ letters between the 'L' and the 'n' in "Localization."
Key Activities:
- Translating text accurately.
- Adjusting cultural references like symbols and idioms.
- Updating technical specifications, such as changing paper sizes (e.g., to $A4$ in Germany) or currency (e.g., to the \text{Euro } \euro).

Internationalization (i18n)

Definition: The process of designing and developing a product so it can support multiple languages and formats from the start without requiring hard-coded changes to the core software.
Etymology: The abbreviation $i18n$ signifies the $18$ letters between 'I' and 'n'.
Example: Implementing Unicode so a software program can handle Japanese characters alongside Latin script.

Globalization (g11n)

Definition: The comprehensive process of bringing a product to international markets. It acts as an umbrella term covering internationalization, localization, marketing, sales, and global support.
Etymology: The abbreviation $g11n$ signifies the $11$ letters between 'G' and 'n'.
Web Globalization: Utilizing e-commerce to manage multilingual websites for international reach.
Example Workflow:
1. i18n: Build software that inherently supports multiple languages.
2. L10n: Adapt the software for Japan by implementing the Yen, local date formats, and Japanese text.
3. g11n: Market and sell the adapted product to the Japanese audience.

The Use of Corpora in Translation

Definition of a Corpus

A corpus is a collection of texts stored on a computer, typically utilized for automatic or semi-automatic linguistic analysis. A corpus consists of full texts or specific extracts.

John Sinclair ( $1992$ ): Defines it as a collection of texts representing a language, dialect, or a specific subset of a language.
EAGLES Project ( $1996$ ): Defines it as a collection of language samples selected and organized according to explicit linguistic criteria.

Current Applications of Corpora

Terminology Research: Relying on linguistic and statistical analysis to build dictionaries.
Tool Improvement: Enhancing translation memories and MT systems.
Translation Studies: Researching the actual product and process of translation.
Contrastive Linguistics: Scholarly comparison between different languages.
Translator Training: Creating specialized corpora to help students understand text types.
Professional Translation: Facilitating literary analysis, context retrieval, and comparing previous versions of translated texts.

Corpus Typology

Corpora are categorized into several types based on their composition and language count:

Parallel Corpus (Bi-/Multilingual):
- Mono-directional: Texts in Language A translated into Language B.
- Bi-directional: Texts in both Language A and Language B, with translations moving in both directions.
Comparable Corpus:
- Bilingual: Two collections of original texts (not translations) in Language A and Language B that share similar genres, topics, time frames, and functions.
Monolingual Corpus:
- Mono-SL: Contains translations from one specific source language alongside comparable original texts in the target language.
- Multi-SL: Contains translations from multiple source languages alongside comparable original texts.

Specialized Concepts and Minority Languages

Sublanguage

Description: A smaller, specialized version of a full language used by experts in a specific field (e.g., engineers, doctors, weather forecasters).
Distinction: Unlike "controlled language," which is governed by strict, human-made rules, a sublanguage develops naturally over time through professional usage.

Minority Languages

A minority language is defined as any language spoken by fewer than $50\%$ of the population within a geopolitical area (such as a region or country). The European Charter for Regional or Minority Languages uses two criteria:

Numerical Size: The population of speakers is small relative to the dominant language(s).
Non-official Status: The language lacks formal recognition and is generally not used in education, government administration, or institutional media.