What’s Cache Augmented Generation (CAG)

Cache Augmented Generation (CAG) is an architecture for Large Language Models (LLMs) that removes the need for real-time data retrieval by pre-loading a knowledge base directly into the model’s active memory.

In practical terms, while RAG (Retrieval Augmented Generation) acts like a student searching a library to answer a question, CAG acts like a student who has memorized the textbook before the test begins.

How It Works: The “KV Cache”

The technical foundation of CAG is the Key-Value (KV) Cache found in Transformer models.

Pre-loading: Instead of chunking documents and putting them into a database, you feed the entire dataset (context) into the LLM at the start.
Pre-computation: The model processes this data once and calculates the mathematical representations (attention states known as “Keys” and “Values”). These are stored in the GPU’s memory.
Instant Access: When a user asks a question, the model does not search a database. It references the pre-computed cache in memory and generates an answer immediately.

CAG vs. RAG

Feature	RAG (Retrieval Augmented)	CAG (Cache Augmented)
Workflow	Retrieve → Read → Answer. Searches a database for relevant snippets first.	Preload → Cache → Answer. Loads full context once; reuses the memory state.
Speed	Slower. Incurs latency during the search and retrieval phase.	Faster. Zero retrieval time; generation is nearly instant.
Knowledge Scope	Unlimited. Can search across millions of documents.	Limited. Bounded by the model’s context window size.
Updates	Dynamic. Easy to add new files to the database.	Static. Adding new data usually requires reloading the cache.
Recall Quality	Dependent. If the search fails to find the right snippet, the answer is wrong.	High. The model sees the whole context, improving global reasoning.

Why is CAG popular now?

This approach was previously impossible because older AI models could only process very small amounts of text at once (roughly 10-20 pages).

Today, the rise of Long Context LLMs allows models to hold the equivalent of hundreds of books or massive codebases in their immediate “working memory” (context windows of 1 million+ tokens). This makes it feasible to cache an entire project’s documentation rather than building a complex retrieval system for it.

When should you use CAG?

Complex Reasoning: You need the model to connect dots across different parts of a document (e.g., “Compare the introduction’s thesis with the conclusion”), which RAG struggles to do because it retrieves in fragments.
Low Latency Requirements: You need the fastest possible response times.
Static Datasets: You are querying a specific set of files (like a legal contract, a book, or a manual) that won’t change during the conversation.
Deep Interaction: Users will ask many questions about the same set of documents.

When should you stick with RAG?

Massive Scale: You have terabytes of data that simply cannot fit into a prompt, even with large context windows.
High Velocity Data: Your data changes every minute (like news feeds or stock tickers).
Cost Sensitivity: Depending on the provider, keeping a large cache “warm” in memory can sometimes be more expensive than simple database lookups.

How it Reduces Cost & Latency

Prompt Caching is the feature that makes Cache Augmented Generation (CAG) economically viable. Without it, you would have to pay full price to re-upload and re-process your “entire textbook” (context) for every single question you ask. Prompt caching allows the model provider to keep that “textbook” open in memory (RAM) so you only pay to reference it, not to read it again.

Recommendation

If you are building a “Chat with your PDF” app where a user uploads a file and asks 10 questions in a row, use Anthropic’s caching (it’s cheaper for short, bursty sessions).
If you are building an Enterprise Bot that needs to know the company handbook 24/7 and answer thousands of employees, use Google’s Context Caching (the hourly storage fee is cheaper than reloading the handbook constantly).

What’s Cache Augmented Generation (CAG)

How It Works: The “KV Cache”

CAG vs. RAG

Why is CAG popular now?

When should you use CAG?

When should you stick with RAG?

How it Reduces Cost & Latency

Recommendation

Related

Leave a ReplyCancel reply

What’s Cache Augmented Generation (CAG)

How It Works: The “KV Cache”

CAG vs. RAG

Why is CAG popular now?

When should you use CAG?

When should you stick with RAG?

How it Reduces Cost & Latency

Recommendation

Share this:

Related

Leave a ReplyCancel reply