Self-attention

Get pricing

Home page / Glossary /

Self-attention

Generative AI

Home page / Glossary /

Self-attention

Generative AI

Self-Attention is a mechanism in artificial intelligence and machine learning, particularly in neural networks, that allows a model to weigh the importance of different elements within the same input sequence. Self-attention, also known as intra-attention, identifies which parts of an input sequence are more relevant to each other, regardless of their positional distance. This mechanism was first formalized in the Transformer model, a neural network architecture widely used in natural language processing (NLP) and other sequence-based applications. Self-attention enables models to capture complex dependencies within a sequence, thereby enhancing the model's understanding of relationships and improving its predictive accuracy.

‍

Core Mechanism of Self-Attention

The self-attention mechanism operates through three main steps:

Generating Queries, Keys, and Values: For each input element in a sequence (e.g., a word in a sentence), the model generates three vector representations:
- Query (Q): Encodes what information each element is seeking from the sequence.
- Key (K): Represents the characteristics of each element that could be useful to the other elements.
- Value (V): Contains the information that will ultimately be used to form the output.
  ‍
  These vectors are calculated as linear transformations of the input embedding, defined by learned weight matrices ( W_Q ), ( W_K ), and ( W_V ):
  - Q = X * W_Q
  - K = X * W_K
  - V = X * W_V
    
    where X represents the input embedding, and W_Q, W_K, and W_V are the respective weight matrices for the query, key, and value transformations.
    ‍
Calculating Attention Scores: The model then computes a score to assess the relevance of each query-key pair within the sequence. This score is obtained by taking the dot product of each query vector with the key vectors of other elements:
Score_{i,j} = Q_i * K_j^T
where Score_{i,j} represents the attention score for element i attending to element j.
‍
Applying Softmax Normalization: These raw attention scores are then normalized using the softmax function, converting them into a probability distribution. This ensures that all scores add up to 1 and can be interpreted as weights:
Attention_{i,j} = exp(Score_{i,j}) / Σ exp(Score_{i,k})
‍
Weighted Summation of Values: Finally, each query’s output is formed by taking a weighted sum of all values in the sequence, with each weight corresponding to the attention score:
Output_i = Σ Attention_{i,j} * V_j

This results in each element of the sequence containing a summary of the relevant information across the entire sequence, dynamically emphasizing parts deemed more important.

‍

Characteristics and Functions

Self-attention offers several distinguishing characteristics that make it essential in sequence-based models:

Parallel Processing: Unlike recurrent neural networks (RNNs), self-attention operates in parallel, processing all elements of a sequence simultaneously rather than sequentially. This improves computation speed and efficiency, especially with long sequences.
‍
Contextual Flexibility: Self-attention allows the model to adaptively focus on relevant information, irrespective of positional distance. This flexibility is particularly beneficial in tasks where meaning depends on context, such as language understanding and machine translation.
‍
Long-Range Dependency Capture: By allowing each element in a sequence to attend to every other element, self-attention can capture long-range dependencies, avoiding the limitations of traditional RNNs that struggle with distant connections due to vanishing or exploding gradients.

‍

Scaled Dot-Product Attention

The scaled dot-product attention variant is commonly used to stabilize the self-attention mechanism when the dimensionality of the input is high. Without scaling, large magnitudes in the dot product of query and key vectors can produce large gradients, which destabilize training. Scaled dot-product attention introduces a scaling factor, dividing each attention score by the square root of the dimension of the key vectors, denoted as \( d_k \):

Scaled Score_{i,j} = (Q_i * K_j^T) / sqrt(d_k)

This scaling factor reduces the magnitude of the scores, making the softmax function smoother and gradients more stable.

‍

Multi-Head Attention

To capture diverse patterns in a sequence, multi-head attention extends self-attention by creating multiple attention heads. Each head independently computes self-attention with distinct learned transformations, enabling the model to focus on different aspects of the sequence. The results of each head are concatenated and linearly transformed to produce the final output of the self-attention layer:

MultiHead(Q, K, V) = Concat(head_1, head_2, ..., head_h) * W_O

where each head_i is computed as Attention(Q * W_{Q_i}, K * W_{K_i}, V * W_{V_i}), and W_O is a learned output projection weight matrix.

Multi-head attention provides the model with multiple representation subspaces, which allows it to capture more complex relationships within the data.

Self-attention was first applied extensively in natural language processing, where it addressed issues faced by recurrent neural networks, such as limited scalability with long sequences and ineffective handling of long-range dependencies. With self-attention, models can attend to distant parts of a sentence as easily as adjacent parts, greatly improving context comprehension in tasks like language translation, text summarization, and sentiment analysis. Self-attention has since been adapted to other domains, including:

Vision Transformers (ViTs) in computer vision, where self-attention processes patches of an image, enabling the model to capture global and local relationships.
‍
Audio Processing for tasks like speech recognition, where self-attention can handle long audio sequen

Self-attention's ability to emphasize relevant information based on learned relationships within the input has made it foundational in modern deep learning architectures.

Back

Generative AI