Word Embeddings in PyTorch: A Complete Guide

This guide covers:

  • Implementing word embeddings in PyTorch
  • Training word embeddings
  • Saving the trained embeddings
  • Loading the saved embeddings for reuse

Run in Colab


1. Implementing Word Embeddings in PyTorch

PyTorch provides nn.Embedding for creating word embeddings.

import torch
import torch.nn as nn

# Define the vocabulary size and embedding dimension
vocab_size = 10  # Example vocabulary size
embedding_dim = 5  # Dimension of word vectors

# Create an embedding layer
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Example input (word indices)
word_indices = torch.tensor([1, 3, 5, 7])  # Example words

# Get embeddings for input words
word_embeddings = embedding_layer(word_indices)
print(word_embeddings)
  • nn.Embedding(num_embeddings, embedding_dim): Creates an embedding matrix of size [vocab_size, embedding_dim]
  • word_indices: Index values for words
  • The output will be a tensor of shape [num_words, embedding_dim]

2. Training Word Embeddings in PyTorch

Training embeddings typically involves using a neural network like Skip-gram or CBOW.

Dataset Preparation

import torch
import torch.nn as nn
import torch.optim as optim

# Sample dataset (word pairs)
data = [(0, 1), (1, 2), (2, 3), (3, 4)]  # (center word, context word)
vocab_size = 5  # Number of unique words
embedding_dim = 3

# Create model
class WordEmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(WordEmbeddingModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, center_word):
        embed = self.embeddings(center_word)
        output = self.linear(embed)
        return output

# Initialize model
model = WordEmbeddingModel(vocab_size, embedding_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    total_loss = 0
    for center, context in data:
        center_tensor = torch.tensor([center], dtype=torch.long)
        target_tensor = torch.tensor([context], dtype=torch.long)

        optimizer.zero_grad()
        output = model(center_tensor)
        loss = criterion(output, target_tensor)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {total_loss}")
  • The model learns embeddings by predicting context words given a center word.
  • nn.CrossEntropyLoss() is used for training.
  • SGD optimizer updates embeddings based on loss.

3. Saving the Trained Embedding Model

Once trained, we save the embedding layer.

# Save only the embedding layer weights
torch.save(model.embeddings.state_dict(), "word_embeddings.pth")

# Save the entire model (optional)
torch.save(model.state_dict(), "embedding_model.pth")
  • torch.save(model.embeddings.state_dict(), "word_embeddings.pth") saves just the embedding weights.
  • torch.save(model.state_dict(), "embedding_model.pth") saves the full model.

4. Loading the Saved Embeddings

To reuse the trained embeddings:

# Load model structure
loaded_model = WordEmbeddingModel(vocab_size, embedding_dim)

# Load saved embeddings
loaded_model.embeddings.load_state_dict(torch.load("word_embeddings.pth"))

# Example usage
word_idx = torch.tensor([2])  # Example word index
print(loaded_model.embeddings(word_idx))
  • WordEmbeddingModel(vocab_size, embedding_dim): Recreate the model structure.
  • load_state_dict(torch.load("word_embeddings.pth")): Load saved embeddings.
  • Now, embeddings can be used in further training or inference.

Implement in Colab


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!