Layer normalization is a normalization technique used in neural networks to stabilize and accelerate training by normalizing activations across features for each individual sample. It is widely used in transformer architectures, recurrent neural networks (RNNs), and generative AI models—especially where batch size may be small or variable.
Core Characteristics
- Definition and Purpose
Layer normalization standardizes layer activations by computing the mean and variance across features for a single training example. This improves model stability and reduces internal covariate shift, enabling faster and more reliable training.
- Mathematical Formulation
For an input vector x=[x1,x2,...,xn]x = [x_1, x_2, ..., x_n]x=[x1,x2,...,xn]:
Mean:
μ=1n∑xi\mu = \frac{1}{n} \sum x_iμ=n1∑xi
Variance:
σ2=1n∑(xi−μ)2\sigma^2 = \frac{1}{n} \sum (x_i - \mu)^2σ2=n1∑(xi−μ)2
Normalization:
xnorm=xi−μσ2+ϵx_{norm} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}xnorm=σ2+ϵxi−μ
Affine Transformation:
y=γxnorm+βy = \gamma x_{norm} + \betay=γxnorm+β
where γ\gammaγ and β\betaβ are learnable parameters.
- Advantages Over Batch Normalization
Layer normalization does not depend on batch size, making it more effective for variable-length sequences, small-batch training, or streaming inference. It is also easier to incorporate into sequential architectures where normalization must occur at every time step.
- Implementation in Neural Networks
Layer normalization is built into modern NLP and transformer models such as GPT, BERT, T5, and Llama. It is applied before or after key components (attention blocks, feed-forward layers) to help maintain stable gradients.
- Computational Efficiency
Complexity is O(n)O(n)O(n), proportional to the number of features. Unlike batch normalization, there is no need to compute aggregation across samples, making it efficient in distributed and online learning environments.
- Usage Scenarios
Layer normalization is crucial in:
- Relation to Other Normalization Techniques
| Technique |
Normalizes Across |
Best For |
| Batch Normalization |
Batch dimension |
Large batch training |
| Layer Normalization |
Feature dimension |
Transformers, RNNs |
| Instance Normalization |
Per-instance per-channel |
Style transfer |
| Group Normalization |
Channel groups |
Computer vision tasks |
Related Terms