t-SNE (t-Distributed Stochastic Neighbor Embedding)

Get pricing

Home page / Glossary /

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Data Science

Home page / Glossary /

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Data Science

t-SNE, or t-Distributed Stochastic Neighbor Embedding, is a nonlinear dimensionality reduction technique primarily used for visualizing high-dimensional data. Developed by Laurens van der Maaten and Geoffrey Hinton, t-SNE maps multi-dimensional data points into two or three dimensions, enabling the visualization of complex structures in the data. Unlike linear dimensionality reduction techniques like PCA, t-SNE focuses on preserving the local structure within data clusters, making it highly effective for exploring and visualizing complex patterns in machine learning and data science.

‍

Core Characteristics of t-SNE

Dimensionality Reduction with a Focus on Local Structure:
- t-SNE preserves local structures, meaning it maintains the relative distances between nearby points, creating clusters that reflect local similarities within high-dimensional data. This property makes t-SNE especially suitable for visualizing datasets with intricate clustering or non-linear structures.
- By emphasizing pairwise similarity at a local scale, t-SNE reveals clusters within the data, making patterns and groupings visually apparent without explicitly preserving global distances.
  ‍
Stochastic Neighbor Embedding and Similarity in High Dimensions:
- t-SNE starts by computing the pairwise similarities between data points in the high-dimensional space. For each point, i, in a high-dimensional space, it calculates the conditional probability, p_ij, that point j is a neighbor of point i, based on the distance between them:
- p_ij = exp(-||x_i - x_j||² / (2 * σ_i²)) / Σ exp(-||x_i - x_k||² / (2 * σ_i²))
  where x_i and x_j are data points, ||x_i - x_j|| is their Euclidean distance, and σ_i is a bandwidth parameter that adjusts local similarity.
  ‍
Mapping Similarities to Low Dimensions with t-Distribution:
- In the lower-dimensional space (typically 2D or 3D), t-SNE aims to create a new set of points, y_i, that maintain similar local relationships. It defines a new pairwise similarity, q_ij, using the Student’s t-distribution, which allows for modeling with heavy tails:
  q_ij = (1 + ||y_i - y_j||²)⁻¹ / Σ (1 + ||y_i - y_k||²)⁻¹
- The t-distribution, particularly with one degree of freedom, helps prevent the crowding problem (common in other dimensionality reduction techniques), where points become too densely packed in low-dimensional space.
  ‍
Cost Function: Kullback-Leibler Divergence:
- t-SNE optimizes the positions of the low-dimensional points, y_i, by minimizing the Kullback-Leibler (KL) divergence between the high-dimensional and low-dimensional distributions. The KL divergence is calculated as:
  KL(P || Q) = Σ p_ij * log(p_ij / q_ij)
  where p_ij and q_ij are the probabilities for pairs of points i and j in high and low dimensions, respectively.
- This objective function prioritizes preserving local similarity, as it assigns higher weights to short distances (close neighbors) and lower weights to long distances, encouraging the low-dimensional embedding to replicate the neighborhood structure of the original data.
  ‍
Perplexity and Hyperparameter Tuning:
- Perplexity is a key hyperparameter in t-SNE that controls the effective number of neighbors for each point, influencing the granularity of the clustering in the output. Higher perplexity values consider more neighbors for each point, resulting in fewer but larger clusters, while lower values emphasize smaller, more local clusters.
- Perplexity generally ranges from 5 to 50 and is selected based on the dataset’s complexity. It acts as a smooth measure of the number of nearby data points used to compute similarity, balancing the balance between local and global structure in the embedding.
  ‍
Gradient Descent Optimization:
- The positions of the low-dimensional points are optimized using gradient descent to minimize the KL divergence. Starting with random initial positions, the algorithm iteratively updates each point's position to reduce the divergence between high- and low-dimensional similarity distributions.
- This iterative process continues until the cost function converges, ensuring that the relative positions of points in the reduced space reflect the local similarities of the original high-dimensional data.

In data science, t-SNE is highly regarded for visualizing high-dimensional data, including image embeddings, word vectors, and neural network activations. It is commonly used in exploratory data analysis to understand clustering behavior and assess the separability of classes in a dataset, especially in applications such as image classification, NLP, genomics, and customer segmentation. While t-SNE does not explicitly provide information for supervised learning, it offers valuable insights into the structure of data, helping analysts uncover latent patterns or cluster distributions in datasets that are otherwise challenging to visualize in high dimensions.

By effectively mapping complex data into a visually interpretable form, t-SNE aids in identifying patterns, clusters, and anomalies, making it an indispensable tool for data-driven insights and visualization in machine learning and AI.

Back

Data Science