Sentiment Analysis in Python: TextBlob, NLTK & Transformers

This post will detail how to implement three sentiment analysis techniques: TextBlob, NLTK with Vader, and transformers in Python. But first, let’s dive into the basic characteristics of the 3 techniques:

TextBlob

  • Technique: Uses rule-based algorithms based on lexicons to calculate sentiment scores (polarity and subjectivity).
  • Strengths: Easy to use for quick sentiment analysis. It provides polarity (negative to positive range: [-1, 1]) and subjectivity (objective to subjective range: [0, 1]).
  • Limitations: It lacks the depth and nuance of more advanced models. The results might not be as accurate for complex sentences or sarcasm.

NLTK (VADER Model)

  • Technique: Uses a pre-trained rule-based model (VADER) specifically designed for analyzing sentiments in text, including social media content.
  • Strengths: Can handle more nuanced text like emojis, slang, and social media phrases. Outputs detailed sentiment scores (positive, negative, neutral, and compound).
  • Limitations: Still rule-based, so it may struggle with very complex language structures or contexts outside its training.

Transformers (via Hugging Face)

  • Technique: Uses deep learning and pre-trained transformer models like BERT or RoBERTa for sentiment analysis. These models learn contextual word meanings from vast datasets.
  • Strengths: Highly accurate and effective at understanding nuanced and complex sentiment, including sarcasm and multi-dimensional context. Great for large-scale analysis and fine-tuning.
  • Limitations: Requires more computational power and time. May be overkill for simpler sentiment tasks.

Summary Comparison

FeatureTextBlobNLTK (VADER)Transformers
Ease of UseVery easyEasyModerate (setup needed)
AccuracyModerateGoodHigh
Context HandlingLimitedBetterExcellent
PerformanceLightweightLightweightRequires more resources

Python Implementation

TextBlob

Here’s a basic Python example for analyzing sentiment in text data using the TextBlob library. The TextBlob library provides a simple interface for performing natural language processing tasks like sentiment analysis.

You can install the TextBlob library by running pip install textblob.

from textblob import TextBlob

# Example function to analyze sentiment
def analyze_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment
    return {
        'polarity': sentiment.polarity,  # Range [-1,1], where -1 is negative and +1 is positive
        'subjectivity': sentiment.subjectivity  # Range [0,1], where 0 is objective and 1 is subjective
    }

# Sample text data for sentiment analysis
sample_texts = [
    "I absolutely love this product! It's amazing!",
    "It's okay, but I expected something better.",
    "This is the worst experience I've ever had."
]

# Analyze sentiments of sample texts
for text in sample_texts:
    sentiment_analysis = analyze_sentiment(text)
    print(f"Text: {text}")
    print(f"Sentiment Analysis: {sentiment_analysis}")
    print("-" * 40)

Run on Colab

Output Description:

It outputs the polarity (positive/negative) and subjectivity (objective/subjective) for each piece of text:

Text: I absolutely love this product! It's amazing!
Sentiment Analysis: {'polarity': 0.6875, 'subjectivity': 0.75}
----------------------------------------
Text: It's okay, but I expected something better.
Sentiment Analysis: {'polarity': 0.3, 'subjectivity': 0.4666666666666666}
----------------------------------------
Text: This is the worst experience I've ever had.
Sentiment Analysis: {'polarity': -1.0, 'subjectivity': 1.0}

NLTK with VADER pre-trained model

Install the NLTK library by running pip install nltk.

Let’s explore an alternative sentiment analysis approach using the NLTK (Natural Language Toolkit) library:

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import download

# Download necessary resources
download('vader_lexicon')

# Initialize the Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

# Sample text data for sentiment analysis
sample_texts = [
    "I absolutely love this product! It's amazing!",
    "It's okay, but I expected something better.",
    "This is the worst experience I've ever had."
]

# Analyze sentiments of sample texts
for text in sample_texts:
    sentiment_score = sia.polarity_scores(text)
    print(f"Text: {text}")
    print(f"Sentiment Scores: {sentiment_score}")
    print("-" * 40)

Run on Colab

How It Works:

  • The SentimentIntensityAnalyzer from NLTK uses a pre-trained model (VADER) to determine sentiment scores.
  • You’ll receive four scores:
    • Positive, Negative, Neutral: Proportions for each sentiment type.
    • Compound: A single sentiment score combining the above.
  • Output for this example:
Text: I absolutely love this product! It's amazing!
Sentiment Scores: {'neg': 0.0, 'neu': 0.311, 'pos': 0.689, 'compound': 0.8713}
----------------------------------------
Text: It's okay, but I expected something better.
Sentiment Scores: {'neg': 0.0, 'neu': 0.43, 'pos': 0.57, 'compound': 0.6486}
----------------------------------------
Text: This is the worst experience I've ever had.
Sentiment Scores: {'neg': 0.369, 'neu': 0.631, 'pos': 0.0, 'compound': -0.6249}
----------------------------------------
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

Transformer

Let’s explore Transformer for sentiment analysis. Install the required library by running pip install transformers. Here’s an example using Hugging Face’s transformers library for a transformer-based sentiment analysis model:

Example with transformers:

from transformers import pipeline

# Initialize a sentiment-analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Sample text data
sample_texts = [
    "I absolutely love this product! It's amazing!",
    "It's okay, but I expected something better.",
    "This is the worst experience I've ever had."
]

# Analyze sentiments of sample texts
for text in sample_texts:
    result = sentiment_analyzer(text)
    print(f"Text: {text}")
    print(f"Sentiment Analysis: {result}")
    print("-" * 40)

Run on Colab

How It Works:

Run the script, and it will classify texts into categories like “POSITIVE” or “NEGATIVE” with confidence scores. Output for this example

Device set to use cpu
Text: I absolutely love this product! It's amazing!
Sentiment Analysis: [{'label': 'POSITIVE', 'score': 0.9998856782913208}]
----------------------------------------
Text: It's okay, but I expected something better.
Sentiment Analysis: [{'label': 'NEGATIVE', 'score': 0.6220217943191528}]
----------------------------------------
Text: This is the worst experience I've ever had.
Sentiment Analysis: [{'label': 'NEGATIVE', 'score': 0.9997679591178894}]


How to fine-tune Transformers

Fine-tuning captures contextual nuances and improves performance on domain-specific texts like medical reviews or social media posts. When using transformer models like BERT or RoBERTa, you can fine-tune them on your own dataset to adapt the sentiment analysis for your context.

Steps:

  • Prepare your dataset: Collect labeled text data indicating sentiment (e.g., positive, negative, neutral).
  • Tokenize and preprocess: Split sentences into tokens using the model’s tokenizer (e.g., BERT tokenizer).
  • Fine-tune the model: Use frameworks like Hugging Face’s transformers library to load the pre-trained model and fine-tune it on your data. Here’s an example:

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

# Define a custom dataset class
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        text = self.texts[index]
        label = self.labels[index]
        inputs = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {
            "input_ids": inputs["input_ids"].squeeze(),
            "attention_mask": inputs["attention_mask"].squeeze(),
            "labels": torch.tensor(label, dtype=torch.long)
        }

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Prepare training and evaluation datasets
train_texts = ["I love this!", "It's terrible.", "Not bad, not great."]
train_labels = [0, 2, 1]  # 0 = positive, 1 = neutral, 2 = negative
eval_texts = ["Amazing experience!", "This was awful."]
eval_labels = [0, 2]

train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
eval_dataset = SentimentDataset(eval_texts, eval_labels, tokenizer)

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    logging_dir='./logs',
    save_strategy="epoch",
    load_best_model_at_end=True
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset  # Evaluation dataset for monitoring performance
)

# Fine-tune the model
trainer.train()

# Evaluate the model
metrics = trainer.evaluate()
print("Evaluation Metrics:", metrics)

Run on Colab

Output:

EpochTraining LossValidation Loss
1No log1.215670
2No log1.194448
3No log1.211473
Evaluation Metrics: {'eval_loss': 1.1944482326507568, 'eval_runtime': 1.3283, 'eval_samples_per_second': 1.506, 'eval_steps_per_second': 0.753, 'epoch': 3.0}

When running the script, you may be prompted to enter credentials from Wandb.ai. Wandb.ai is an advanced AI developer platform designed to enhance the workflow of machine learning practitioners. Use Weights & Biases to train and fine-tune models efficiently, while employing state-of-the-art tools that facilitate the monitoring of your training processes. With its robust framework, you can manage models seamlessly from experimentation through to production, ensuring that your deployments are as reliable as possible. This platform not only streamlines collaborative efforts among team members but also integrates with various libraries and frameworks, allowing for a cohesive and user-friendly experience in managing the entire machine learning lifecycle.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!