Word embeddings are vector representations of words that capture their meanings, relationships, and contexts in a numerical format. Two popular techniques for generating word embeddings are Word2Vec and GloVe.

1. Word2Vec

Word2Vec is a neural network-based approach developed by Google that learns word embeddings by predicting the context of words. It comes in two main architectures:

CBOW (Continuous Bag of Words): Predicts a target word from surrounding context words.
Skip-gram: Predicts surrounding context words given a target word.

? Key Features of Word2Vec:

Learns word meanings based on context.
Produces high-dimensional vector representations.
Captures semantic relationships (e.g., king – man + woman = queen).

2. GloVe (Global Vectors for Word Representation)

GloVe is a matrix factorization-based approach that constructs word embeddings using word co-occurrence statistics from a large corpus.

? Key Features of GloVe:

Based on word co-occurrence rather than predicting context.
Captures both syntactic and semantic relationships.
Often performs better on analogy tasks and semantic similarity.

Comparison: Word2Vec vs. GloVe

Feature	Word2Vec	GloVe
Approach	Predictive (Neural Network-based)	Count-based (Matrix Factorization)
Training Data	Local context of words	Global word co-occurrence matrix
Performance	Works well for smaller datasets	Better for large corpora
Example Models	Google’s pre-trained Word2Vec	Stanford’s pre-trained GloVe

Both techniques have been widely used in NLP tasks such as sentiment analysis, machine translation, and text similarity.

Word2Vec in Detail (with Python Code)

Word2Vec is a sophisticated technique that learns vector representations of words through the application of a shallow neural network architecture. By analyzing large amounts of text data, it captures the intricate semantic meaning behind words, ensuring that words with similar meanings are represented by similar vector representations. This capability allows for a deeper understanding of language, as the model can uncover relationships and similarities between words, effectively distinguishing nuances and context. Through this process, Word2Vec has become a fundamental tool in natural language processing, enabling various applications such as recommendation systems, sentiment analysis, and language translation, thereby enhancing the overall efficacy of machine learning models in understanding human language.

You can implement the following in Colab. and this is the explanations of the codes:

1. Install & Import Necessary Libraries

To use Word2Vec, you need the gensim library. If you haven’t installed it, run:

pip install gensim

Then, import the required libraries:

import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

2. Prepare Text Data

Word2Vec requires a list of tokenized sentences as input. Let’s use some sample text:

# Sample text corpus
sentences = [
    "Word embeddings capture the meaning of words.",
    "Machine learning models benefit from word vectors.",
    "Natural language processing relies on word representations.",
    "Deep learning and NLP work well together.",
    "GloVe and Word2Vec are popular word embedding techniques."
]

# Tokenizing sentences
tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]

print(tokenized_sentences)

? Output:

[['word', 'embeddings', 'capture', 'the', 'meaning', 'of', 'words'],
 ['machine', 'learning', 'models', 'benefit', 'from', 'word', 'vectors'],
 ['natural', 'language', 'processing', 'relies', 'on', 'word', 'representations'],
 ['deep', 'learning', 'and', 'nlp', 'work', 'well', 'together'],
 ['glove', 'and', 'word2vec', 'are', 'popular', 'word', 'embedding', 'techniques']]

Each sentence is now tokenized into words.

3. Train a Word2Vec Model

Now, train a Word2Vec model on the tokenized data.

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=0)  # CBOW Model

# Print the vocabulary
print(model.wv.index_to_key)

? Parameters Explained:

vector_size=100: Each word is represented by a 100-dimensional vector.
window=5: Context window size.
min_count=1: Ignores words appearing less than once.
workers=4: Uses 4 CPU cores for training.
sg=0: Uses CBOW (if sg=1, it uses Skip-gram).

Explainations of CBOW and Skip-gram:

CBOW, or Continuous Bag of Words, is a popular model in natural language processing (NLP) that predicts a target word based on its surrounding context words within a specified window. This model operates under the assumption that words appearing in similar contexts tend to have similar meanings, effectively capturing the semantic relationships between words. In practice, CBOW takes multiple context words as input and predicts the likelihood of a target word occurring in that context, utilizing neural networks for efficient computation. This method contrasts with other models, such as Skip-Gram, which focuses on predicting context words given a target word, showcasing different approaches to understanding word representations in the vast landscape of language data.
What’s Skip-gram in NLP? The Skip-gram model is a popular method in Natural Language Processing (NLP) used for word representation and embedding. It predicts the context words given a target word by utilizing a large corpus of text. This approach helps in building vector representations of words, allowing machines to understand the context and relationships between them, which is pivotal for various NLP tasks such as sentiment analysis and language translation.

4. Get Word Embeddings

After training, you can get the vector representation of any word.

# Get vector for a word
word_vec = model.wv["word"]
print(word_vec[:10])  # Print first 10 values of the vector

5. Find Similar Words

Word2Vec allows you to find similar words based on their vector representation.

# Find similar words
similar_words = model.wv.most_similar("word")
print(similar_words)

6. Save & Load the Model

To reuse the trained model, you can save and load it.

# Save the model
model.save("word2vec.model")

# Load the model
new_model = Word2Vec.load("word2vec.model")
print(new_model.wv.most_similar("word"))

7. Using Skip-gram Instead of CBOW

To train using Skip-gram, set sg=1:

# Train using Skip-gram instead of CBOW
model_skipgram = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1)

Key Takeaways

? CBOW vs. Skip-gram

CBOW (sg=0): Predicts a word from its context (faster, works well for frequent words).
Skip-gram (sg=1): Predicts context from a word (better for infrequent words).

? Word Similarity

Word2Vec captures semantic meaning: king – man + woman = queen.

GloVe (Global Vectors for Word Representation)

GloVe is a word embedding technique that represents words as vectors based on their co-occurrence in a large text corpus. Unlike Word2Vec, which learns embeddings through a neural network, GloVe relies on matrix factorization over a word co-occurrence matrix.

1. How GloVe Works

Build a Co-occurrence Matrix: Counts how often words appear together in a large text corpus.
Apply Matrix Factorization: Decomposes the co-occurrence matrix into lower-dimensional word vectors.
Generate Word Embeddings: Words with similar meanings get similar vectors.

? Example:

The words “king”, “queen”, and “royal” will have embeddings that reflect their relationships.
Arithmetic operations like:
? king - man + woman ? queen still work with GloVe.

2. Using Pre-trained GloVe Embeddings in Python

Instead of training GloVe from scratch, we can use pre-trained GloVe embeddings from Stanford. They provide word vectors trained on massive datasets like Wikipedia.

Step 1: Download Pre-trained GloVe Vectors

Download GloVe embeddings from Stanford NLP (e.g., glove.6B.zip which contains different dimensions like 50D, 100D, 200D, etc.).

Extract the file and choose a vector size (glove.6B.100d.txt for 100-dimensional vectors).

Step 2: Load GloVe into Python

import numpy as np

# Load GloVe embeddings into a dictionary
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]  # First word is the key
            vector = np.asarray(values[1:], dtype="float32")  # The rest are vector values
            embeddings_index[word] = vector
    return embeddings_index

# Load GloVe (use correct path)
glove_path = "glove.6B.100d.txt"  # Ensure the file exists
glove_embeddings = load_glove_embeddings(glove_path)

# Check word vector for "king"
print("Vector for 'king':", glove_embeddings["king"][:10])  # Print first 10 values

Step 3: Find Similar Words Using Cosine Similarity

Since GloVe embeddings do not provide a built-in similarity function (like Word2Vec), we calculate cosine similarity manually.

from sklearn.metrics.pairwise import cosine_similarity

# Function to find most similar words
def find_similar_words(word, embeddings, top_n=5):
    if word not in embeddings:
        return "Word not found in GloVe vocabulary."
    
    word_vec = embeddings[word].reshape(1, -1)  # Reshape for similarity calculation
    similarities = {}
    
    for other_word, other_vec in embeddings.items():
        other_vec = other_vec.reshape(1, -1)
        sim = cosine_similarity(word_vec, other_vec)[0][0]  # Compute cosine similarity
        similarities[other_word] = sim
    
    # Sort by similarity (excluding the word itself)
    sorted_words = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    return sorted_words[1:top_n+1]  # Return top N similar words

# Find similar words to "king"
similar_to_king = find_similar_words("king", glove_embeddings, top_n=5)
print("Words similar to 'king':", similar_to_king)

4. GloVe vs. Word2Vec

Feature	GloVe	Word2Vec
Approach	Matrix Factorization (Co-occurrence Statistics)	Predictive Neural Network (CBOW / Skip-gram)
Focus	Captures global context	Captures local context
Training Data	Word Co-occurrence Matrix	Sliding Window over Text
Performance	Works well for large corpora	Works well for smaller datasets
Analogy Task	Performs well	Performs well
Pre-trained Models	Available (Stanford)	Available (Google News)

5. When to Use GloVe vs. Word2Vec

? Use GloVe if:

You need a pre-trained model that captures global co-occurrence information.
You work with large-scale datasets like Wikipedia or Common Crawl.
You prefer embeddings that are more stable and less affected by corpus bias.

? Use Word2Vec if:

You need a model that captures context dynamically.
Your dataset is relatively small.
You want custom training with CBOW or Skip-gram.

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Understanding Word Embeddings: Word2Vec vs GloVe + Python Code

1. Word2Vec

2. GloVe (Global Vectors for Word Representation)

Comparison: Word2Vec vs. GloVe

Word2Vec in Detail (with Python Code)

1. Install & Import Necessary Libraries

2. Prepare Text Data

3. Train a Word2Vec Model

4. Get Word Embeddings

5. Find Similar Words

6. Save & Load the Model

7. Using Skip-gram Instead of CBOW

Key Takeaways

GloVe (Global Vectors for Word Representation)

1. How GloVe Works

2. Using Pre-trained GloVe Embeddings in Python

Step 1: Download Pre-trained GloVe Vectors

Step 2: Load GloVe into Python

Step 3: Find Similar Words Using Cosine Similarity

4. GloVe vs. Word2Vec

5. When to Use GloVe vs. Word2Vec

Like this:

Related

Discover more from Science Comics

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

1. Word2Vec

2. GloVe (Global Vectors for Word Representation)

Comparison: Word2Vec vs. GloVe

Word2Vec in Detail (with Python Code)

1. Install & Import Necessary Libraries

2. Prepare Text Data

3. Train a Word2Vec Model

4. Get Word Embeddings

5. Find Similar Words

6. Save & Load the Model

7. Using Skip-gram Instead of CBOW

Key Takeaways

GloVe (Global Vectors for Word Representation)

1. How GloVe Works

2. Using Pre-trained GloVe Embeddings in Python

Step 1: Download Pre-trained GloVe Vectors

Step 2: Load GloVe into Python

Step 3: Find Similar Words Using Cosine Similarity

4. GloVe vs. Word2Vec

5. When to Use GloVe vs. Word2Vec

Share this:

Like this:

Related

Discover more from Science Comics

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply