Transformer Impact: Has Machine Translation Been Solved?

Google recently announced their release of 110 new languages on Google Translate as part of their 1000 languages initiative launched in 2022. In 2022, at the start they added 24 languages. With the latest 110 more, it’s now 243 languages. This quick expansion was possible thanks to the Zero-Shot Machine Translation, a technology where machine learning models learn to translate into another language without prior examples. But in the future we will see together if this advancement can be the ultimate solution to the challenge of machine translation, and in the meanwhile we can explore the ways it can happen. But first its story.

How Was it Before?

Statistical Machine Translation (SMT)

This was the original method that Google Translate used. It relied on statistical models. They analyzed large parallel corpora, collections of aligned sentence translations, to determine the most likely translations. First the system translated text into English as a middle step before converting it into the target language, and it needed to cross-reference phrases with extensive datasets from United Nations and European Parliament transcripts. It’s different to traditional approaches that necessitated compiling exhaustive grammatical rules. And its statistical approach let it adapt and learn from data without relying on static linguistic frameworks that could quickly become completely unnecessary.

But there are some disadvantages to this approach, too. First Google Translate used phrase-based translation where the system broke down sentences into phrases and translated them individually. This was an improvement over word-for-word translation but still had limitations like awkward phrasing and context errors. It just didn’t fully understand the nuances as we do. Also, SMT heavily relies on having parallel corpora, and any relatively rare language would be hard to translate because it doesn’t have enough parallel data.

Neural Machine Translation (NMT)

In 2016, Google made the switch to Neural Machine Translation. It uses deep learning models to translate entire sentences as a whole and at once, giving more fluent and accurate translations. NMT operates similarly to having a sophisticated multilingual assistant within your computer. Using a sequence-to-sequence (seq2seq) architecture NMT processes a sentence in one language to understand its meaning. Then – generates a corresponding sentence in another language. This method uses huge datasets for learning, in contrast to Statistical Machine Translation which relies on statistical models analyzing large parallel corpora to determine the most probable translations. Unlike SMT, which focused on phrase-based translation and needed a lot of manual effort to develop and maintain linguistic rules and dictionaries, NMT's power to process entire sequences of words lets it capture the nuanced context of language more effectively. So it has improved translation quality across various language pairs, often getting to levels of fluency and accuracy comparable to human translators.

In fact, traditional NMT models used Recurrent Neural Networks – RNNs – as the core architecture, since they are designed to process sequential data by maintaining a hidden state that evolves as each new input (word or token) is processed. This hidden state serves as a sort of a memory that captures the context of the preceding inputs, letting the model learn dependencies over time. But, RNNs were computationally expensive and difficult to parallelize effectively, which was limiting how scalable they are.

Introduction of Transformers

In 2017, Google Research published the paper titled “Attention is All You Need,” introducing transformers to the world and marking a pivotal shift away from RNNs in neural network architecture.

Transformers rely only on the attention mechanism, – self-attention, which allows neural machine translation models to focus selectively on the most critical parts of input sequences. Unlike RNNs, which process words in a sequence within sentences, self-attention evaluates each token across the entire text, determining which others are crucial for understanding its context. This simultaneous computation of all words enables transformers to effectively capture both short and long-range dependencies without relying on recurrent connections or convolutional filters.

So by eliminating recurrence, transformers offer several key benefits:

Parallelizability: Attention mechanisms can compute in parallel across different segments of the sequence, which accelerates training on modern hardware such as GPUs.
Training Efficiency: They also require significantly less training time compared to traditional RNN-based or CNN-based models, delivering better performance in tasks like machine translation.

Zero-Shot Machine Translation and PaLM 2

In 2022, Google released support for 24 new languages using Zero-Shot Machine Translation, marking a significant milestone in machine translation technology. They also announced the 1,000 Languages Initiative, aimed at supporting the world's 1,000 most spoken languages. They have now rolled out 110 more languages. Zero-shot machine translation enables translation without parallel data between source and target languages, eliminating the need to create training data for each language pair — a process previously costly and time-consuming, and for some pair languages also impossible.

This advancement became possible because of the architecture and self-attention mechanisms of transformers. Thetransformer model's capability to learn contextual relationships across languages, as a combo with its scalability to handle multiple languages simultaneously, enabled the development of more efficient and effective multilingual translation systems. However, zero-shot models generally show lower quality than those trained on parallel data.

Then, building on the progress of transformers, Google introduced PaLM 2 in 2023, which made the way for the release of 110 new languages in 2024. PaLM 2 significantly enhanced Translate's ability to learn closely related languages such as Awadhi and Marwadi (related to Hindi) and French creoles like Seychellois and Mauritian Creole. The improvements in PaLM 2's, such as compute-optimal scaling, enhanced datasets, and refined design—enabled more efficient language learning and supported Google's ongoing efforts to make language support better and bigger and accommodate diverse linguistic nuances.

Can we claim that the challenge of machine translation has been fully tackled with transformers?

The evolution we are talking about took 18 years from Google's adoption of SMT to the recent 110 additional languages using Zero-Shot Machine Translation. This represents a huge leap that can potentially reduce the need for extensive parallel corpus collection—a historically and very labor-extensive task the industry has pursued for over two decades. But, asserting that machine translation is completely addressed would be premature, considering both technical and ethical considerations.

Current models still struggle with context and coherence and make subtle mistakes that can change the meaning you intended for a text. These issues are very present in longer, more complex sentences where maintaining the logical flow and understanding nuances is needed for results. Also, cultural nuances and idiomatic expressions too often get lost or lose meaning, causing translations that may be grammatically correct but don't have the intended impact or sound unnatural.

Data for Pre-training: PaLM 2 and similar models are pre trained on a diverse multilingual text corpus, surpassing its predecessor PaLM. This enhancement equips PaLM 2 to excel in multilingual tasks, underscoring the continued importance of traditional datasets for improving translation quality.

Domain-specific or Rare Languages: In specialized domains like legal, medical, or technical fields, parallel corpora ensures models encounter specific terminologies and language nuances. Advanced models may struggle with domain-specific jargon or evolving language trends, posing challenges for Zero-Shot Machine Translation. Also Low-Resource Languages are still poorly translated, because they do not have the data they need to train accurate models

Benchmarking: Parallel corpora remain essential for evaluating and benchmarking translation model performance, particularly challenging for languages lacking sufficient parallel corpus data.The automated metrics like BLEU, BLERT, and METEOR have limitations assessing nuance in translation quality apart from grammar. But then, we humans are hindered by our biases. Also, there are not too many qualified evaluators out there, and finding the perfect bilingual evaluator for each pair of languages to catch subtle errors.

Resource Intensity: The resource-intensive nature of training and deploying LLMs remains a barrier, limiting accessibility for some applications or organizations.

Cultural preservation. The ethical dimension is profound. As Isaac Caswell, a Google Translate Research Scientist, describes Zero-Shot Machine Translation: “You can think of it as a polyglot that knows lots of languages. But then additionally, it gets to see text in 1,000 more languages that isn’t translated. You can imagine if you’re some big polyglot, and then you just start reading novels in another language, you can start to piece together what it could mean based on your knowledge of language in general.” Yet, it's crucial to consider the long-term impact on minor languages lacking parallel corpora, potentially affecting cultural preservation when reliance shifts away from the languages themselves.

#000, #2022, #2023, #2024, #Accessibility, #Applications, #Approach, #Architecture, #Attention, #AttentionMechanism, #Barrier, #Benchmarking, #Building, #Capture, #Challenge, #Change, #CNN, #Collections, #Computation, #Computer, #Data, #Datasets, #DeepLearning, #Deploying, #Design, #Development, #Domains, #Efficiency, #English, #EuropeanParliament, #Evolution, #Excel, #Filters, #Focus, #Future, #Giving, #Google, #GoogleTranslate, #GPUs, #Grammar, #Hardware, #How, #Human, #Humans, #Impact, #Industry, #Issues, #It, #Language, #LanguageLearning, #Languages, #Learn, #Learning, #Legal, #LESS, #Limiting, #LLMs, #MachineLearning, #MachineTranslation, #Medical, #Memory, #Meteor, #Method, #Metrics, #Milestone, #Model, #ModelPerformance, #Models, #Nature, #Network, #NetworkArchitecture, #Networks, #Neural, #NeuralMachineTranslation, #NeuralNetwork, #NeuralNetworks, #One, #Organizations, #PaLM, #Palm2, #Paper, #Parliament, #Performance, #Power, #Process, #Reading, #Relationships, #Reliance, #Research, #RNN, #Rules, #Scaling, #SelfAttention, #Sequential, #Smt, #Sound, #Technology, #Text, #ThoughtLeaders, #Time, #Training, #Transformer, #Transformers, #Translate, #Trends, #UnitedNations, #Word, #ZeroShot

Published on The Digital Insider at https://is.gd/Kggovc.

Julio Marchi © Speaks Out Network

Search This Blog