Understanding Word Embeddings: Word2Vec vs GloVe + Python Code

Word embeddings are vector representations of words that capture their meanings, relationships, and contexts in a numerical format. Two popular techniques for generating word embeddings are Word2Vec and GloVe.

1. Word2Vec

Word2Vec is a neural network-based approach developed by Google that learns word embeddings by predicting the context of words. It comes in two main architectures:

  • CBOW (Continuous Bag of Words): Predicts a target word from surrounding context words.
  • Skip-gram: Predicts surrounding context words given a target word.

? Key Features of Word2Vec:

  • Learns word meanings based on context.
  • Produces high-dimensional vector representations.
  • Captures semantic relationships (e.g., king – man + woman = queen).

2. GloVe (Global Vectors for Word Representation)

GloVe is a matrix factorization-based approach that constructs word embeddings using word co-occurrence statistics from a large corpus.

? Key Features of GloVe:

  • Based on word co-occurrence rather than predicting context.
  • Captures both syntactic and semantic relationships.
  • Often performs better on analogy tasks and semantic similarity.

Comparison: Word2Vec vs. GloVe

FeatureWord2VecGloVe
ApproachPredictive (Neural Network-based)Count-based (Matrix Factorization)
Training DataLocal context of wordsGlobal word co-occurrence matrix
PerformanceWorks well for smaller datasetsBetter for large corpora
Example ModelsGoogle’s pre-trained Word2VecStanford’s pre-trained GloVe

Both techniques have been widely used in NLP tasks such as sentiment analysis, machine translation, and text similarity.

Word2Vec in Detail (with Python Code)

Word2Vec is a sophisticated technique that learns vector representations of words through the application of a shallow neural network architecture. By analyzing large amounts of text data, it captures the intricate semantic meaning behind words, ensuring that words with similar meanings are represented by similar vector representations. This capability allows for a deeper understanding of language, as the model can uncover relationships and similarities between words, effectively distinguishing nuances and context. Through this process, Word2Vec has become a fundamental tool in natural language processing, enabling various applications such as recommendation systems, sentiment analysis, and language translation, thereby enhancing the overall efficacy of machine learning models in understanding human language.

You can implement the following in Colab. and this is the explanations of the codes:


1. Install & Import Necessary Libraries

To use Word2Vec, you need the gensim library. If you haven’t installed it, run:

pip install gensim

Then, import the required libraries:

import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

2. Prepare Text Data

Word2Vec requires a list of tokenized sentences as input. Let’s use some sample text:

# Sample text corpus
sentences = [
    "Word embeddings capture the meaning of words.",
    "Machine learning models benefit from word vectors.",
    "Natural language processing relies on word representations.",
    "Deep learning and NLP work well together.",
    "GloVe and Word2Vec are popular word embedding techniques."
]

# Tokenizing sentences
tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]

print(tokenized_sentences)

? Output:

[['word', 'embeddings', 'capture', 'the', 'meaning', 'of', 'words'],
 ['machine', 'learning', 'models', 'benefit', 'from', 'word', 'vectors'],
 ['natural', 'language', 'processing', 'relies', 'on', 'word', 'representations'],
 ['deep', 'learning', 'and', 'nlp', 'work', 'well', 'together'],
 ['glove', 'and', 'word2vec', 'are', 'popular', 'word', 'embedding', 'techniques']]

Each sentence is now tokenized into words.


3. Train a Word2Vec Model

Now, train a Word2Vec model on the tokenized data.

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=0)  # CBOW Model

# Print the vocabulary
print(model.wv.index_to_key)

? Parameters Explained:

  • vector_size=100: Each word is represented by a 100-dimensional vector.
  • window=5: Context window size.
  • min_count=1: Ignores words appearing less than once.
  • workers=4: Uses 4 CPU cores for training.
  • sg=0: Uses CBOW (if sg=1, it uses Skip-gram).

Explainations of CBOW and Skip-gram:

  • CBOW, or Continuous Bag of Words, is a popular model in natural language processing (NLP) that predicts a target word based on its surrounding context words within a specified window. This model operates under the assumption that words appearing in similar contexts tend to have similar meanings, effectively capturing the semantic relationships between words. In practice, CBOW takes multiple context words as input and predicts the likelihood of a target word occurring in that context, utilizing neural networks for efficient computation. This method contrasts with other models, such as Skip-Gram, which focuses on predicting context words given a target word, showcasing different approaches to understanding word representations in the vast landscape of language data.
  • What’s Skip-gram in NLP? The Skip-gram model is a popular method in Natural Language Processing (NLP) used for word representation and embedding. It predicts the context words given a target word by utilizing a large corpus of text. This approach helps in building vector representations of words, allowing machines to understand the context and relationships between them, which is pivotal for various NLP tasks such as sentiment analysis and language translation.

4. Get Word Embeddings

After training, you can get the vector representation of any word.

# Get vector for a word
word_vec = model.wv["word"]
print(word_vec[:10])  # Print first 10 values of the vector

5. Find Similar Words

Word2Vec allows you to find similar words based on their vector representation.

# Find similar words
similar_words = model.wv.most_similar("word")
print(similar_words)

6. Save & Load the Model

To reuse the trained model, you can save and load it.

# Save the model
model.save("word2vec.model")

# Load the model
new_model = Word2Vec.load("word2vec.model")
print(new_model.wv.most_similar("word"))

7. Using Skip-gram Instead of CBOW

To train using Skip-gram, set sg=1:

# Train using Skip-gram instead of CBOW
model_skipgram = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1)

Key Takeaways

? CBOW vs. Skip-gram

  • CBOW (sg=0): Predicts a word from its context (faster, works well for frequent words).
  • Skip-gram (sg=1): Predicts context from a word (better for infrequent words).

? Word Similarity

  • Word2Vec captures semantic meaning: king – man + woman = queen.

GloVe (Global Vectors for Word Representation)

GloVe is a word embedding technique that represents words as vectors based on their co-occurrence in a large text corpus. Unlike Word2Vec, which learns embeddings through a neural network, GloVe relies on matrix factorization over a word co-occurrence matrix.


1. How GloVe Works

  1. Build a Co-occurrence Matrix: Counts how often words appear together in a large text corpus.
  2. Apply Matrix Factorization: Decomposes the co-occurrence matrix into lower-dimensional word vectors.
  3. Generate Word Embeddings: Words with similar meanings get similar vectors.

? Example:

  • The words “king”, “queen”, and “royal” will have embeddings that reflect their relationships.
  • Arithmetic operations like:
    ? king - man + woman ? queen still work with GloVe.

2. Using Pre-trained GloVe Embeddings in Python

Instead of training GloVe from scratch, we can use pre-trained GloVe embeddings from Stanford. They provide word vectors trained on massive datasets like Wikipedia.

Step 1: Download Pre-trained GloVe Vectors

Download GloVe embeddings from Stanford NLP (e.g., glove.6B.zip which contains different dimensions like 50D, 100D, 200D, etc.).

Extract the file and choose a vector size (glove.6B.100d.txt for 100-dimensional vectors).


Step 2: Load GloVe into Python

import numpy as np

# Load GloVe embeddings into a dictionary
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]  # First word is the key
            vector = np.asarray(values[1:], dtype="float32")  # The rest are vector values
            embeddings_index[word] = vector
    return embeddings_index

# Load GloVe (use correct path)
glove_path = "glove.6B.100d.txt"  # Ensure the file exists
glove_embeddings = load_glove_embeddings(glove_path)

# Check word vector for "king"
print("Vector for 'king':", glove_embeddings["king"][:10])  # Print first 10 values

Step 3: Find Similar Words Using Cosine Similarity

Since GloVe embeddings do not provide a built-in similarity function (like Word2Vec), we calculate cosine similarity manually.

from sklearn.metrics.pairwise import cosine_similarity

# Function to find most similar words
def find_similar_words(word, embeddings, top_n=5):
    if word not in embeddings:
        return "Word not found in GloVe vocabulary."
    
    word_vec = embeddings[word].reshape(1, -1)  # Reshape for similarity calculation
    similarities = {}
    
    for other_word, other_vec in embeddings.items():
        other_vec = other_vec.reshape(1, -1)
        sim = cosine_similarity(word_vec, other_vec)[0][0]  # Compute cosine similarity
        similarities[other_word] = sim
    
    # Sort by similarity (excluding the word itself)
    sorted_words = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    return sorted_words[1:top_n+1]  # Return top N similar words

# Find similar words to "king"
similar_to_king = find_similar_words("king", glove_embeddings, top_n=5)
print("Words similar to 'king':", similar_to_king)


4. GloVe vs. Word2Vec

FeatureGloVeWord2Vec
ApproachMatrix Factorization (Co-occurrence Statistics)Predictive Neural Network (CBOW / Skip-gram)
FocusCaptures global contextCaptures local context
Training DataWord Co-occurrence MatrixSliding Window over Text
PerformanceWorks well for large corporaWorks well for smaller datasets
Analogy TaskPerforms wellPerforms well
Pre-trained ModelsAvailable (Stanford)Available (Google News)

5. When to Use GloVe vs. Word2Vec

? Use GloVe if:

  • You need a pre-trained model that captures global co-occurrence information.
  • You work with large-scale datasets like Wikipedia or Common Crawl.
  • You prefer embeddings that are more stable and less affected by corpus bias.

? Use Word2Vec if:

  • You need a model that captures context dynamically.
  • Your dataset is relatively small.
  • You want custom training with CBOW or Skip-gram.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!