Transformers in Machine Learning: introduction (1/5)

4 min readJan 20, 2025

Since their introduction, transformers have revolutionized the field of artificial intelligence, especially in natural language processing (NLP) applications. This model, introduced by Vaswani et al. in 2017 in their paper “Attention is all you need,” has become the cornerstone of technologies like GPT and BERT. But what exactly is a transformer, and how does it work? This article aims to explain the key concepts of transformers and illustrate their functionality with a practical example: translating English to French.

1. What is a transformer?

Transformers are a neural network architecture designed to process sequences of data, such as sentences or documents. Their main innovation lies in the attention mechanism, which allows the model to focus on relevant parts of a sequence while ignoring less useful information.

Difference from previous models:

RNNs (recurrent neural networks): These process words one at a time sequentially. For example, to analyze the sentence “The cat sits on the mat,” they must read “The,” then “cat,” and so on, which slows down processing for long sequences.
LSTMs (long short-term memory): These improved the ability to manage long-term dependencies but still suffered from sequential processing.
Transformers: In contrast, they analyze the entire sentence in parallel. For instance, each word considers all other words in the sentence, capturing relationships like “cat” being connected to “mat” in our example.

2. Structure of a transformer

A transformer consists of two main parts:

The encoder: Transforms an input sequence into a rich internal representation.
The decoder: Uses this representation to produce an output, such as a translation or a summary.

Component details:

Multi-head attention mechanism: Allows focusing on different parts of the sequence in parallel. For example, in a complex sentence, one “head” may analyze grammatical relationships while another focuses on the overall context.
Fully connected feed-forward networks: Apply nonlinear transformations to better capture relationships.
Normalization and residual connections: Help stabilize training by preserving essential information.

3. Practical example: Automatic translation

Let’s use the example of translating a sentence from English (“The cat sits on the mat”) to French (“Le chat est assis sur le tapis”). Here are the detailed steps:

3.1. Word representation (tokenization)

The first step is to convert words into machine-readable vectors.

Tokenization: The sentence is divided into “tokens” (“The,” “cat,” “sits,” etc.). If a word is unknown, it can be split into subwords (“sit” and “s”).
Embedding: Each token is converted into a dense vector (a numerical representation) encoding its meaning. For instance, “cat” could correspond to a vector like [0.2, 0.8, 0.1, …].

3.2. Adding positional information

Transformers have no innate understanding of word order. To address this, “positional embeddings” are added to indicate the position of each word in the sentence. For example, the positional vector for “The” could be [0.0, 1.0, 0.5, …] to signify it is the first word.

3.3. The encoder

The encoder processes the sequence of vectors and produces enriched representations that capture relationships between words.

Example: The word “cat” is enriched with information about its relationship with “sits” and “mat.” Thus, the vector for “cat” reflects its role as the subject that “sits on” something.

3.4. The decoder

The decoder generates the translation word by word, using:

The representations produced by the encoder.
The words predicted previously.

Example:

The decoder starts with the special token “” and predicts “Le.”
Then, based on “Le” and the encoder’s representations, it predicts “chat.”
This process continues until the last token is reached.

3.5. Final result

The final output is “Le chat est assis sur le tapis.” Each word is selected based on relationships captured by the attention mechanism.

4. Python use case

Here’s a simple example using Hugging Face’s transformers library to translate a sentence:

from transformers import MarianMTModel, MarianTokenizer

# to load the model and the tokenizer 
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# the text to translate
text = "The cat sits on the mat."

# the tokenization
encoded_text = tokenizer.prepare_seq2seq_batch([text], return_tensors="pt")

# the translation
translated = model.generate(**encoded_text)

# finally, the decoding
translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(translated_text[0])

Output: “Le chat est assis sur le tapis.”

Conclusion

Transformers represent a major leap forward in machine learning, enabling models to tackle complex tasks with impressive efficiency. By understanding their key concepts and applying them to real-world cases, like automatic translation, we can better appreciate their impact and potential. The next step? Explore practical implementations with libraries (like PyTorch or TensorFlow)

Sirine Amrane