1. What is WordPiece Tokenization?
WordPiece is a subword tokenization algorithm originally developed for speech recognition and later adopted by modern NLP models like BERT, ALBERT, and DistilBERT. It breaks words into subwords to efficiently handle rare words and out-of-vocabulary (OOV) tokens.
2. Why WordPiece?
Traditional word-based tokenization struggles with:
? Rare words (e.g., “electroencephalography” is too rare for a fixed vocabulary).
? Out-of-vocabulary (OOV) words (e.g., a model trained on “running” may not recognize “runner”).
? Morphologically rich languages (e.g., German, Finnish, and Turkish have complex word forms).
WordPiece solves these issues by breaking words into frequent subword units, allowing better generalization.
3. How Does WordPiece Work?
Step 1: Start with Individual Characters
The tokenizer starts with an initial vocabulary of individual characters and a special token ##
to mark subwords.
? Example Vocabulary:["t", "h", "e", "##re", "##fore", "##ment"]
Step 2: Use Data Statistics to Merge the Most Common Pairs
The tokenizer iteratively merges the most frequent subword pairs based on their co-occurrence in the training data.
Example:
- Suppose we start with:
["play", "##ing", "##ed", "ground", "##s"]
- If “playing” is frequent in the data, it gets merged:
["playing", "##ed", "ground", "##s"]
- If “playground” is also common, we merge further:
["playing", "##ed", "playground", "##s"]
This process builds an optimal subword vocabulary.
Step 3: Tokenizing New Words
If a new word is encountered, it’s split into known subwords from the vocabulary.
? Example: Tokenizing “unhappiness”
? Assume our WordPiece vocabulary contains:["un", "##happy", "##ness"]
“unhappiness” ? [“un”, “##happy”, “##ness”]
? Example: Tokenizing “playgrounds”
? Assume our WordPiece vocabulary contains:["play", "##ground", "##s"]
“playgrounds” ? [“play”, “##ground”, “##s”]
4. WordPiece in Action (Python Example)
You can use Hugging Face’s Transformers library to try WordPiece tokenization.
1. Install Required Libraries
First, ensure you have the transformers
library installed. If not, install it using:
pip install transformers
2. Load a WordPiece Tokenizer (BERT Tokenizer)
BERT models use WordPiece tokenization. We will use Hugging Face’s AutoTokenizer
to load a pre-trained tokenizer.
from transformers import AutoTokenizer
# Load BERT tokenizer (which uses WordPiece)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Example text
text = "Tokenization helps NLP models understand language better."
# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokenized Output:", tokens)
? Expected Output:
Tokenized Output: ['token', '##ization', 'helps', 'nlp', 'models', 'understand', 'language', 'better', '.']
Explanation
- “Tokenization” is split into “token” + “##ization”
- “helps”, “NLP”, “models” remain unchanged (already in vocabulary)
- The
##
prefix in##ization
indicates that this subword attaches to the previous token
3. Convert Tokens Back to IDs
Each token in WordPiece Tokenization is mapped to a unique token ID.
# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)
? Example Output:
Token IDs: [19204, 21375, 7235, 20739, 4271, 3305, 2653, 2488, 1012]
Each token is now converted into a unique numerical ID, which the model processes.
4. Decoding Back to Text
We can reverse the process using decode()
to get back the text.
# Decode the token IDs back to text
decoded_text = tokenizer.decode(token_ids)
print("Decoded Text:", decoded_text)
? Output:
Decoded Text: tokenization helps nlp models understand language better.
Note that “Tokenization” is converted to “tokenization”, as BERT is case-insensitive (bert-base-uncased
).
5. Tokenizing a New Word (OOV Handling)
Let’s check how WordPiece handles a new, uncommon word.
# Example of an OOV (Out-of-Vocabulary) word
new_word = "unhappiness"
# Tokenize
tokens = tokenizer.tokenize(new_word)
print("OOV Word Tokenization:", tokens)
? Expected Output:
OOV Word Tokenization: ['un', '##happiness']
WordPiece breaks “unhappiness” into “un” and “##happiness” because “happiness” is a known word in the vocabulary.
5. WordPiece vs. Other Tokenization Methods
Method | How it Works | Used in | Handling OOV Words |
---|---|---|---|
Word Tokenization | Splits at spaces | NLTK, spaCy | Poor (OOV words are unknown) |
Subword Tokenization (WordPiece) | Merges frequent subwords | BERT, ALBERT | Great (Rare words are split) |
Byte Pair Encoding (BPE) | Merges byte-level patterns | GPT-2, GPT-3, LLaMA | Great (Efficient subword merging) |
Character Tokenization | Breaks into characters | OCR, Speech NLP | Excellent (No OOV words) |
6. Advantages of WordPiece Tokenization
? Solves OOV issues – Rare words are broken into subwords instead of being ignored.
? Balances vocabulary size & efficiency – Smaller vocabulary than word-based tokenization but larger than character-based.
? Works well in multiple languages – Helps handle complex languages like Chinese, Korean, and German.
? Improves model performance – Especially useful in BERT, RoBERTa, and ALBERT.
7. Conclusion
WordPiece Tokenization is a powerful subword-based technique that enables NLP models to handle rare words, improve generalization, and reduce vocabulary size. It is widely used in BERT-based models and provides a great balance between efficiency and accuracy.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.