RAG with EmbeddingGemma with Python Code using Ollama

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances the capabilities of Large Language Models (LLMs) by connecting them to external knowledge sources.

According to Google Developer website, EmbeddingGemma is a compact, open‑source embedding model crafted for swift, high‑quality retrieval on everyday devices such as smartphones. With just 308 million parameters, it’s powerful enough to execute advanced AI methods—including Retrieval‑Augmented Generation (RAG)—right on your local machine without needing an internet connection.

Here’s a breakdown of how they work together.

The Core Concept

At its heart, RAG prevents a language model from just “making things up” (hallucinating) by forcing it to base its answers on specific information you provide. It works in two main stages:

Retrieve: Find relevant information from a knowledge base.
Augment & Generate: Give that information to the LLM as context along with the user’s question, and ask it to generate an answer based only on that context.

EmbeddingGemma is the crucial tool used in the Retrieve step.

How RAG works with EmbeddingGemma

Let’s walk through the process from setting up the system to answering a user’s query.

Phase 1: Indexing Your Knowledge (The Offline Step)

First, you need to prepare your knowledge base so it’s searchable.

Load & Chunk Data: Your knowledge base (e.g., internal documents, PDFs, website articles) is loaded and broken down into smaller, manageable chunks of text.
Create Embeddings: This is where EmbeddingGemma comes in. It reads each text chunk and converts it into a numerical vector called an “embedding.” This vector represents the chunk’s semantic meaning. Chunks with similar meanings will have mathematically similar vectors.
Store in a Vector Database: These embeddings (along with their original text) are stored in a specialized database, called a vector database. This database is highly optimized for finding similar vectors very quickly.

Phase 2: Generating an Answer (The Live Step)

Now, when a user asks a question, the RAG system performs the following steps in real-time:

Embed the Query: The user’s query is sent to EmbeddingGemma, which converts it into a query vector using the exact same model that was used for the documents.
Search for Similarity: The system takes this query vector and searches the vector database to find the text chunk embeddings that are most similar or “closest” to it. This is the “retrieval” part.
Augment the Prompt: The original text from the most relevant chunks is retrieved. This text is then combined with the user’s original query into a new, expanded prompt for a generative LLM (like the instruction-tuned version of Gemma). The prompt essentially says: “Using only the following context, answer this question.”
- Context: [Text from the most relevant document chunks]
- Question: [User’s original query]
Generate the Final Answer: The generative LLM receives this augmented prompt and produces an answer that is grounded in the provided context, not just its own general knowledge. Because the context is highly relevant to the query (thanks to the embedding search), the final answer is accurate and specific.

Why Use EmbeddingGemma for RAG?

Using a specialized model like EmbeddingGemma within a RAG pipeline provides significant advantages:

High Accuracy: It creates high-quality embeddings that accurately capture the meaning of text, ensuring the most relevant documents are retrieved.
Efficiency: EmbeddingGemma models are designed to be lightweight and performant, making the process of creating and searching embeddings fast and cost-effective.
Reduces Hallucinations: The entire system makes the LLM’s final output verifiable and based on factual data from your knowledge base.
Access to Private Data: It allows you to build question-answering systems over your own private or up-to-the-minute documents without having to retrain a massive LLM.

Python Code for RAG with EmbeddingGemma

This Python code shows you how to build a simple, complete RAG (Retrieval-Augmented Generation) pipeline using EmbeddingGemma for embeddings and the instruction-tuned Gemma model for generation.

The entire process runs locally using Ollama, making it great for privacy and experimentation. 🐍

Setup & Prerequisites

First, you need to set up your environment. This involves installing a few Python libraries and pulling the required models with Ollama.

Install Ollama: Follow the instructions on the Ollama website to install it on your system.
Install Python Libraries:
pip install ollama faiss-cpu numpy
- ollama: The official Python client to interact with Ollama.
- faiss-cpu: Facebook AI’s efficient library for similarity search (our vector store).
- numpy: A core library for numerical operations, required by FAISS.

Pull the Models: Open your terminal and run these two commands to download EmbeddingGemma and the 2B generative Gemma model:

ollama run embeddinggemma
ollama pull gemma:2b

Complete Python Code for RAG

This script contains everything you need: setting up a knowledge base, creating embeddings, storing them in a FAISS vector store, and querying the system. The explaination is provided after the code chunk

View and copy code on Colab (note that it CAN’T be run on Google Colab because Ollama should be run on a local machine)

import ollama
import faiss
import numpy as np

# 1. In-memory "Knowledge Base" 📚
# In a real-world scenario, this would be a collection of documents.
documents = [
    "The capital of France is Paris.",
    "The Eiffel Tower is a famous landmark in Paris.",
    "Mars is known as the Red Planet.",
    "The solar system has eight planets.",
    "The main component of Earth's atmosphere is nitrogen.",
    "Water is composed of hydrogen and oxygen atoms.",
]

# 2. Use EmbeddingGemma to create embeddings for our documents
print(  "Embedding documents...")
embeddings = []
for doc in documents:
    response = ollama.embeddings(model="embeddinggemma", prompt=doc)
    embeddings.append(response["embedding"])

# Convert embeddings to a NumPy array
embeddings_np = np.array(embeddings).astype('float32')

# 3. Create and populate a FAISS vector store 🧠
# FAISS (Facebook AI Similarity Search) is a library for efficient similarity search.
print("Creating FAISS index...")
dimension = embeddings_np.shape[1]  # Get the dimension of the embeddings
index = faiss.IndexFlatL2(dimension) # Using a simple L2 distance index
index.add(embeddings_np)
print(f"FAISS index created with {index.ntotal} vectors.")

# 4. The RAG Query Function
def query_rag(query: str, k: int = 2): 
    """
    Queries the RAG system.
    1. Embeds the query.
    2. Searches the vector store for the top-k most similar documents.
    3. Creates a prompt with the retrieved context.
    4. Calls the generative model to get an answer.
    """
    # Step 1: Embed the query
    query_embedding_response = ollama.embeddings(model="embeddinggemma", prompt=query)
    query_embedding = np.array([query_embedding_response["embedding"]]).astype('float32')

    # Step 2: Search the vector store
    print(f"\nSearching for the top {k} most relevant documents...")
    distances, indices = index.search(query_embedding, k)
    
    retrieved_chunks = [documents[i] for i in indices[0]]
    print("Retrieved context:", retrieved_chunks)

    # Step 3: Create a prompt with the context
    prompt_template = f"""
    Based ONLY on the following context, answer the user's question.
    If the context does not contain the answer, state that you don't have enough information.
    Context:
    - {"\n- ".join(retrieved_chunks)}
    Question: {query}
    """

    # Step 4: Call the generative model
    print("Generating answer...")
    response = ollama.chat(
        model='gemma:2b',
        messages=[{'role': 'user', 'content': prompt_template}]
    )
    return response['message']['content']

# 5. Run the RAG pipeline with a sample query
user_query = "What is the capital of France and what is a famous landmark there?"
final_answer = query_rag(user_query)

print("\n--- Final Answer ---")
print(final_answer)

user_query_2 = "What is Mars known as?"
final_answer_2 = query_rag(user_query_2)

print("\n--- Final Answer ---")
print(final_answer_2)

How It Works

Knowledge Base: We start with a simple Python list called documents. This simulates the data you want the model to know about.
Embedding with embedding-gemma: We loop through each document and use the ollama.embeddings function to call EmbeddingGemma. This converts each piece of text into a numerical vector that captures its meaning.
FAISS Vector Store: The generated embeddings are stored in a FAISS index. This index is highly optimized to quickly find vectors that are “close” to each other in mathematical terms, which corresponds to semantic similarity.
Querying:
- When you ask a question (user_query), the RAG function first uses EmbeddingGemma to convert your question into a vector.
- It then uses index.search() to find the two most similar document vectors in our FAISS store.
- The original text chunks corresponding to these vectors are retrieved.
Generation with gemma:2b: The retrieved text chunks are combined with your original question into a single, comprehensive prompt. This prompt is sent to the generative Gemma model (gemma:2b), which then formulates an answer based only on the context it was given.