Self-Attention is a mechanism in artificial intelligence and machine learning, particularly in neural networks, that allows a model to weigh the importance of different elements within the same input sequence. Self-attention, also known as intra-attention, identifies which parts of an input sequence are more relevant to each other, regardless of their positional distance. This mechanism was first formalized in the Transformer model, a neural network architecture widely used in natural language processing (NLP) and other sequence-based applications. Self-attention enables models to capture complex dependencies within a sequence, thereby enhancing the model's understanding of relationships and improving its predictive accuracy.
The self-attention mechanism operates through three main steps:
This results in each element of the sequence containing a summary of the relevant information across the entire sequence, dynamically emphasizing parts deemed more important.
Self-attention offers several distinguishing characteristics that make it essential in sequence-based models:
The scaled dot-product attention variant is commonly used to stabilize the self-attention mechanism when the dimensionality of the input is high. Without scaling, large magnitudes in the dot product of query and key vectors can produce large gradients, which destabilize training. Scaled dot-product attention introduces a scaling factor, dividing each attention score by the square root of the dimension of the key vectors, denoted as \( d_k \):
Scaled Score_{i,j} = (Q_i * K_j^T) / sqrt(d_k)
This scaling factor reduces the magnitude of the scores, making the softmax function smoother and gradients more stable.
To capture diverse patterns in a sequence, multi-head attention extends self-attention by creating multiple attention heads. Each head independently computes self-attention with distinct learned transformations, enabling the model to focus on different aspects of the sequence. The results of each head are concatenated and linearly transformed to produce the final output of the self-attention layer:
MultiHead(Q, K, V) = Concat(head_1, head_2, ..., head_h) * W_O
where each head_i is computed as Attention(Q * W_{Q_i}, K * W_{K_i}, V * W_{V_i}), and W_O is a learned output projection weight matrix.
Multi-head attention provides the model with multiple representation subspaces, which allows it to capture more complex relationships within the data.
Self-attention was first applied extensively in natural language processing, where it addressed issues faced by recurrent neural networks, such as limited scalability with long sequences and ineffective handling of long-range dependencies. With self-attention, models can attend to distant parts of a sentence as easily as adjacent parts, greatly improving context comprehension in tasks like language translation, text summarization, and sentiment analysis. Self-attention has since been adapted to other domains, including:
Self-attention's ability to emphasize relevant information based on learned relationships within the input has made it foundational in modern deep learning architectures.