Word Embeddings

Word embeddings are a class of techniques in natural language processing (NLP) that represent words as dense vectors of fixed dimensions, capturing semantic relationships between words based on their usage and context. Unlike traditional bag-of-words models, where words are represented as sparse vectors, word embeddings are designed to encode semantic similarity by placing similar words close together in a continuous vector space. This compact representation allows models to perform effectively on NLP tasks, such as sentiment analysis, named entity recognition, and machine translation, by understanding word relationships based on context rather than simple frequency.

Core Characteristics of Word Embeddings

Vector Representation:
- Word embeddings represent each word as a fixed-length vector, where each dimension corresponds to a learned feature capturing semantic properties. In contrast to one-hot encoding, where each word is represented by a high-dimensional sparse vector, embeddings use dense vectors, typically with dimensions ranging from 50 to 300.
- This representation condenses meaning into fewer dimensions, significantly reducing memory and computation requirements compared to sparse encoding.
Semantic Proximity and Vector Space:
- Words that appear in similar contexts are placed closer together in the vector space, reflecting their semantic proximity. For example, words like "king" and "queen" or "apple" and "orange" would have similar vector representations due to their contextual similarity.
- The Euclidean or cosine distance between vectors can measure similarity, with smaller distances indicating higher similarity. This structure allows embedding models to perform analogical reasoning, such as finding the relationship \( king - man + woman ≈ queen \), by exploiting linear relationships in the vector space.
Learning Techniques:
- Word embeddings are generally learned from large corpora using unsupervised learning techniques, such as Word2Vec, GloVe, or FastText:
- Word2Vec: Developed by Mikolov et al., Word2Vec uses neural networks to produce embeddings, employing either the Continuous Bag of Words (CBOW) or Skip-gram model. The CBOW model predicts a word based on its surrounding words, while Skip-gram predicts surrounding words given a target word. Both models aim to optimize vector positions so that similar words share similar contexts. - The objective function for Skip-gram is:
  J = (1/T) * Σ log P(context | word)
  where T is the total number of words, and P(context | word) is the probability of the context given a target word.
- GloVe (Global Vectors for Word Representation): GloVe, developed by Stanford researchers, combines local and global statistics to learn embeddings. It models word co-occurrence counts to capture semantic relationships by minimizing the difference between the dot product of word vectors and the log of word co-occurrence counts.
- The objective function in GloVe is:
  J = Σ (f(X_ij) * (w_i^T * w_j + b_i + b_j - log(X_ij))²)
  where \( X_ij \) represents co-occurrence counts, \( w \) represents word vectors, and \( b \) represents biases.
- FastText: Developed by Facebook, FastText extends Word2Vec by representing each word as a collection of character n-grams. This approach allows FastText to generate embeddings for out-of-vocabulary words by composing them from known subwords, improving performance for languages with complex morphology.
Contextualized Embeddings:
- Traditional embeddings like Word2Vec and GloVe create a single vector for each word, regardless of context, leading to limitations for polysemous words (words with multiple meanings). Contextualized embeddings, generated by models like BERT and GPT, overcome this limitation by generating context-specific embeddings based on surrounding words. For example, "bank" would have different embeddings depending on whether it appears in the context of finance or a river.
- Contextual embeddings capture a word’s meaning based on its specific usage, improving performance on tasks requiring disambiguation or contextual understanding.
Mathematical Representation and Operations:
- In word embeddings, each word \( w \) is represented as a vector \( v_w \) of fixed dimensions. The distance or similarity between two words \( w_1 \) and \( w_2 \) can be measured using cosine similarity:
- Cosine Similarity = (v_w1 • v_w2) / (||v_w1|| * ||v_w2||)
  where \( • \) denotes the dot product, and \( ||v|| \) represents the vector norm.
- Analogies and relationships between words are modeled through vector arithmetic. Given vectors for words like "king" (v_king), "man" (v_man), and "woman" (v_woman), one can compute "queen" (v_queen) by:
  v_queen ≈ v_king - v_man + v_woman
- This capability allows embedding models to capture analogical reasoning, associating similar concepts based on arithmetic transformations.
Evaluation Metrics:
- Word embeddings are evaluated using intrinsic and extrinsic methods:
  - Intrinsic Evaluation: Measures how well embeddings capture semantic relationships by assessing them on tasks like word similarity and analogy. Datasets like WordSim-353 and Google’s Word Analogy dataset are commonly used for intrinsic evaluation.
  - Extrinsic Evaluation: Evaluates embeddings in the context of downstream tasks, such as named entity recognition, text classification, or machine translation, to measure their effectiveness in real-world applications.

Word embeddings are essential in NLP and data science, enabling models to learn and utilize word meanings in a continuous vector space, facilitating efficient and accurate text-based predictions. They support diverse applications, from sentiment analysis and question answering to chatbots and recommendation systems. Embeddings enable models to capture semantic richness in textual data, making them a foundational technique for modern NLP and machine learning pipelines, where understanding word relationships and context is critical.

Back