Cache Augmented Generation (CAG) is an architecture for Large Language Models (LLMs) that removes the need for real-time data retrieval by pre-loading a knowledge base directly into the model’s active memory.
In practical terms, while RAG (Retrieval Augmented Generation) acts like a student searching a library to answer a question, CAG acts like a student who has memorized the textbook before the test begins.
How It Works: The “KV Cache”
The technical foundation of CAG is the Key-Value (KV) Cache found in Transformer models.
- Pre-loading: Instead of chunking documents and putting them into a database, you feed the entire dataset (context) into the LLM at the start.
- Pre-computation: The model processes this data once and calculates the mathematical representations (attention states known as “Keys” and “Values”). These are stored in the GPU’s memory.
- Instant Access: When a user asks a question, the model does not search a database. It references the pre-computed cache in memory and generates an answer immediately.
CAG vs. RAG
| Feature | RAG (Retrieval Augmented) | CAG (Cache Augmented) |
| Workflow | Retrieve → Read → Answer. Searches a database for relevant snippets first. | Preload → Cache → Answer. Loads full context once; reuses the memory state. |
| Speed | Slower. Incurs latency during the search and retrieval phase. | Faster. Zero retrieval time; generation is nearly instant. |
| Knowledge Scope | Unlimited. Can search across millions of documents. | Limited. Bounded by the model’s context window size. |
| Updates | Dynamic. Easy to add new files to the database. | Static. Adding new data usually requires reloading the cache. |
| Recall Quality | Dependent. If the search fails to find the right snippet, the answer is wrong. | High. The model sees the whole context, improving global reasoning. |
Why is CAG popular now?
This approach was previously impossible because older AI models could only process very small amounts of text at once (roughly 10-20 pages).
Today, the rise of Long Context LLMs allows models to hold the equivalent of hundreds of books or massive codebases in their immediate “working memory” (context windows of 1 million+ tokens). This makes it feasible to cache an entire project’s documentation rather than building a complex retrieval system for it.
When should you use CAG?
- Complex Reasoning: You need the model to connect dots across different parts of a document (e.g., “Compare the introduction’s thesis with the conclusion”), which RAG struggles to do because it retrieves in fragments.
- Low Latency Requirements: You need the fastest possible response times.
- Static Datasets: You are querying a specific set of files (like a legal contract, a book, or a manual) that won’t change during the conversation.
- Deep Interaction: Users will ask many questions about the same set of documents.
When should you stick with RAG?
- Massive Scale: You have terabytes of data that simply cannot fit into a prompt, even with large context windows.
- High Velocity Data: Your data changes every minute (like news feeds or stock tickers).
- Cost Sensitivity: Depending on the provider, keeping a large cache “warm” in memory can sometimes be more expensive than simple database lookups.
How it Reduces Cost & Latency
Prompt Caching is the feature that makes Cache Augmented Generation (CAG) economically viable. Without it, you would have to pay full price to re-upload and re-process your “entire textbook” (context) for every single question you ask. Prompt caching allows the model provider to keep that “textbook” open in memory (RAM) so you only pay to reference it, not to read it again.
When you send a prompt to an LLM, the most expensive step is the “Prefill” (reading your input and calculating the Key-Value attention states).
- Without Caching: You send 100 pages. The model reads 100 pages (
$).![Rendered by QuickLaTeX.com \[$) → generates answer ($). Next question? You send the same 100 pages again. The model reads 100 pages again (\]](https://i0.wp.com/ksml4.com/wp-content/ql-cache/quicklatex.com-80a75cee361eda5e932d4d288789024b_l3.png?resize=799%2C19&ssl=1)
- With Caching: You send 100 pages once. The model reads them ($$$) and keeps the calculated math in memory. Next question? You send only the question. The model skips reading the 100 pages and just generates the answer ($).
The Result:
- Cost: Drop of 90-95% for input tokens (after the first load).
- Latency: The “Time to First Token” drops by 80%+ because the model skips the reading phase.
Recommendation
- If you are building a “Chat with your PDF” app where a user uploads a file and asks 10 questions in a row, use Anthropic’s caching (it’s cheaper for short, bursty sessions).
- If you are building an Enterprise Bot that needs to know the company handbook 24/7 and answer thousands of employees, use Google’s Context Caching (the hourly storage fee is cheaper than reloading the handbook constantly).