Transformer Architecture

Get pricing

Home page / Glossary /

Transformer Architecture

Generative AI

Home page / Glossary /

Transformer Architecture

Generative AI

The Transformer architecture is a deep learning model architecture introduced in the seminal paper *Attention is All You Need* by Vaswani et al. in 2017. It is designed primarily for natural language processing (NLP) tasks, enabling highly effective machine translation, text generation, and language understanding. Transformers depart from traditional sequence models, such as recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), by replacing recurrence with a mechanism called *self-attention*, allowing the model to capture long-range dependencies in data more efficiently. This architectural shift has led to significant advances in NLP and has influenced other domains, including computer vision and speech processing.

Key Components and Structure of the Transformer

The Transformer model is composed of an encoder-decoder structure, where each part consists of layers built from self-attention and feed-forward networks. The encoder processes the input sequence, generating intermediate representations, which the decoder then uses to produce the output sequence. Both encoder and decoder stacks are made of identical layers, with each layer containing a self-attention mechanism and a feed-forward neural network.

Encoder: The encoder stack is made up of multiple identical layers (usually 6–12 in typical implementations). Each layer consists of:
- Multi-Head Self-Attention: The self-attention mechanism allows the model to focus on different parts of the input sequence, capturing dependencies regardless of distance within the sequence. Multi-head attention enables multiple perspectives by splitting attention across several *heads*, each operating independently and then concatenating the results.
- Feed-Forward Neural Network: A fully connected network that applies a linear transformation followed by a non-linear activation (usually ReLU), further processing the self-attention outputs.
- Residual Connections and Layer Normalization: Residual connections add the input of each sublayer to its output, followed by layer normalization, which stabilizes training and improves convergence.
Decoder: Like the encoder, the decoder has multiple identical layers with the following components:
- Masked Multi-Head Self-Attention: The decoder’s first attention mechanism is masked, meaning it only attends to previous tokens in the output sequence to prevent seeing future information during training.
- Encoder-Decoder Attention: This cross-attention mechanism allows the decoder to focus on relevant parts of the encoder’s output when generating each token in the output sequence.
- Feed-Forward Neural Network: Similar to the encoder, the decoder also includes a feed-forward neural network.
- Residual Connections and Layer Normalization: These are applied similarly as in the encoder to improve training dynamics.

Self-Attention Mechanism

Self-attention is the core innovation of the Transformer, enabling the model to weigh the relevance of each token in a sequence with respect to others. It operates by projecting each token in the sequence into three vectors: *query (Q)*, *key (K)*, and *value (V)*. The attention score between any two tokens is computed by taking the dot product of their query and key vectors, which is then scaled and passed through a softmax function to obtain the attention weights. The weighted sum of the values gives the final self-attention output for each token.

The self-attention calculation can be expressed as follows:

Attention Scores: For a given input sequence of token embeddings `X`, the queries `Q`, keys `K`, and values `V` are derived as linear transformations of `X`. The attention score `A` is calculated by:
`A = softmax((Q * K^T) / sqrt(d_k)) * V`
Here:
- `Q`, `K`, and `V` are the query, key, and value matrices, respectively,
- `d_k` is the dimensionality of the keys, used for scaling.
Multi-Head Attention: In multi-head attention, the input is split into multiple heads (each with its own set of Q, K, and V matrices), and each head performs the attention operation independently. The outputs of all heads are concatenated and linearly transformed to form the final attention output. This allows the model to attend to information from multiple representation subspaces, enhancing its ability to capture diverse patterns.

Positional Encoding

Since the Transformer architecture has no inherent sense of sequence (unlike RNNs), it uses positional encodings to inject information about token positions in the sequence. Positional encodings are added to the input embeddings at each position to allow the model to differentiate token order. The positional encoding for a token at position `pos` with dimension `i` is given by:

`PE(pos, 2i) = sin(pos / 10000^(2i / d_model))`
`PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))`

Here:

`d_model` is the dimension of the model, and
the sine and cosine functions are used to generate distinct values for each position.

Mathematical Formulation and Training Objective

The Transformer’s objective during training is to maximize the likelihood of generating the correct output sequence `Y` given the input sequence `X`. For each token `y_t` in the output sequence, the model computes the conditional probability `P(y_t | y_1, ..., y_(t-1), X)` using the softmax function on the final decoder output.

The loss function `L` for a given training example, typically cross-entropy loss, is given by:

`L = -Σ log P(y_t | y_1, ..., y_(t-1), X)`

where the sum is taken over all tokens in the target sequence `Y`.

The Transformer architecture has become foundational in NLP and beyond, as its self-attention mechanism allows it to process sequences with lower computational complexity than RNNs, scaling efficiently for large datasets and long sequences. It has led to the development of widely used models, including BERT, GPT, T5, and other transformer-based models, all of which leverage this architecture’s efficiency and effectiveness in learning contextual information from vast amounts of text data.

Back

Generative AI