- Context windows have expanded from ~4K tokens to hundreds of thousands or even millions, enabling entire docs, contracts, or codebases to fit in a single prompt.
- Many teams still use the same chunking+embeddings+vector DB RAG architecture built for tiny context windows, without reassessing whether it’s necessary.
- For smaller corpora (e.g., ~50 pages or under ~100K tokens), full-context “stuffing” is often simpler, more accurate, and more reliable than retrieval because it avoids missed chunks and fragmentation.
- RAG is still useful at larger scales or with fast-changing data, but between ~100K and ~2M tokens you can often use filtered stuffing or lightweight retrieval instead of a full vector pipeline.
- RAG pipelines add operational complexity and multiple failure points (chunking strategy, embeddings, indexing, evaluation, monitoring) that may be unjustified for typical internal knowledge bases.

Two years ago, if you wanted an AI system to answer questions about your company's documentation, you had exactly one option: chop everything into tiny chunks, embed them into vectors, retrieve the top five matches, and pray the model could synthesize a coherent answer from fragments. The context window was 4,096 tokens. You had no choice. RAG wasn't a preference, it was a survival mechanism.
Today, Claude offers 200,000 tokens of context. Gemini gives you two million. GPT-4.1 supports 1,048,576. You can fit entire codebases, full legal contracts, complete documentation sets into a single prompt. And yet, if you look at what most teams are actually building, they are still running the same chunking, embedding, retrieval pipeline they designed when the window was 500 times smaller. They never went back and asked the obvious question: do we still need this?
The context window revolution, in numbers
The expansion happened fast enough that many teams missed it entirely.
| Model | Release | Context Window | Equivalent Pages |
|---|---|---|---|
| GPT-3 | 2020 | 4K tokens | ~6 pages |
| GPT-3.5 | 2023 | 16K tokens | ~24 pages |
| Claude 2 | 2023 | 100K tokens | ~150 pages |
| GPT-4 Turbo | 2023 | 128K tokens | ~192 pages |
| Claude 3.5 | 2024 | 200K tokens | ~300 pages |
| Gemini 1.5 Pro | 2024 | 2M tokens | ~3,000 pages |
| GPT-4.1 | 2025 | 1M tokens | ~1,500 pages |
Three thousand pages in a single prompt. That is not a marginal improvement over 4K. That is a different category of capability. But the architectures most teams deployed in 2023 were designed for six pages, and those architectures are still running in production.
The RAG pipeline you probably don't need
Here is the standard RAG setup most teams are running: documents go through a chunking step (usually 512 tokens per chunk with some overlap), then through an embedding model, then into a vector database. At query time, the user's question gets embedded, the top-k nearest chunks are retrieved, and those chunks are stuffed into the prompt alongside the question.
This pipeline has real costs. You need to maintain a vector database. You need an embedding model. You need a chunking strategy, and getting chunk size wrong either loses context or retrieves noise. You need to handle updates when documents change. You need retrieval evaluation to make sure you are actually pulling the right chunks. Every component is a failure point. Every component needs monitoring.
For a corpus of 50 pages, none of this is necessary anymore. You can put the entire thing in the prompt. It will work better, because the model sees the full document instead of disconnected fragments. It will be simpler, because you eliminated five components from your architecture. And it will be more reliable, because there is no retrieval step that can miss the relevant passage.
When to RAG, when to stuff, when to do both
The decision is not RAG versus no-RAG. It is about matching your architecture to your actual data volume. Here is a practical framework:
| Corpus Size | Approach | Why |
|---|---|---|
| Under 100K tokens (~150 pages) | Full context stuffing | Fits in one prompt. Simpler, more accurate, no retrieval failures. |
| 100K to 500K tokens | Filtered context stuffing | Pre-filter by metadata or section, then stuff what's relevant. |
| 500K to 2M tokens | Lightweight RAG or Gemini full-context | Use large-window models, or simple keyword/BM25 retrieval. |
| Over 2M tokens | Full RAG pipeline | Genuine need for vector search and sophisticated retrieval. |
| Rapidly changing data | RAG with live indexing | When the corpus updates hourly, you need an indexing pipeline regardless. |
Most internal documentation sets, most company knowledge bases, most customer support libraries fall under 150 pages. Most of them do not need RAG.
The cost argument is weaker than you think
The first objection is always cost. Stuffing 100K tokens into every prompt is expensive, right? Let's look at the actual numbers.
| Approach | Tokens per query | Cost per 1K queries (Claude Sonnet) |
|---|---|---|
| RAG (5 chunks, ~2,500 tokens input) | ~3,500 | $9.45 |
| Full context (100K tokens input) | ~101,000 | $272.70 |
| Full context with prompt caching | ~101,000 (90% cached) | $30.15 |
Without caching, full context is roughly 29x more expensive. That sounds bad. With prompt caching, it drops to about 3x. For most applications making fewer than 10,000 queries per day, that difference is tens of dollars, not thousands. And you are eliminating the cost of running a vector database, an embedding pipeline, and the engineering time to maintain them.
The real cost comparison is not input tokens versus input tokens. It is total system cost, including infrastructure, maintenance, debugging time, and the cost of wrong answers when retrieval fails.
Prompt caching changes the math entirely
If you are stuffing the same large context into repeated queries, prompt caching is not optional, it is the entire strategy. Claude's prompt caching gives you 90% off cached input tokens. That means a 100K token context that gets reused across queries costs roughly the same as processing 10K tokens fresh each time.
The implementation is straightforward: structure your prompts so the static context comes first (system prompt, documents, reference material) and the variable part (user query) comes last. The prefix gets cached automatically. Every subsequent query against the same document set reads from cache.
The architectures nobody is building
What frustrates me is not that teams are using RAG when they shouldn't. It is that the expanded context windows enable entirely new patterns that almost nobody is exploring.
Full-codebase reasoning. You can fit a 50,000-line codebase into a single prompt. That means an AI assistant that understands your entire application, not just the file you have open. Claude Code works this way. It reads your whole project, understands the relationships between modules, and makes changes that are consistent across the codebase. Most coding assistants are still doing file-level RAG.
Multi-document synthesis. Legal teams reviewing contracts could load ten related agreements into a single prompt and ask the model to identify conflicts between them. Instead, they are running each document through a separate RAG query and trying to stitch the answers together manually.
Longitudinal analysis. You can load six months of weekly reports into one prompt and ask for trend analysis. The model sees the full timeline, catches patterns that span months, and identifies gradual shifts that chunked retrieval would miss entirely.
Debug-by-context. Load your application logs, configuration files, and recent code changes into one prompt. Ask the model what went wrong. It can correlate a config change three weeks ago with an error pattern that started two weeks ago, something that RAG would never connect because the chunks would never be retrieved together.
These patterns are not theoretical. They work today with current models. But they require engineers to abandon the mental model that says "large context is wasteful" and replace it with "large context is a feature."
The real reason teams don't change
Technical inertia is the polite explanation. The honest one is that nobody wants to rip out infrastructure they spent months building. If you led the effort to set up Pinecone, built the chunking pipeline, tuned the embedding model, and wrote the retrieval evaluation suite, you have a professional incentive to keep that system running. Replacing it with "just put everything in the prompt" feels like admitting the work was unnecessary.
It wasn't unnecessary. When the context window was 4K, all of that work was essential. The mistake is treating past necessity as current necessity. The models changed. The constraints changed. The architecture should change too.
There is also a knowledge gap. Many teams set up their RAG pipeline using a tutorial from 2023 and never revisited the decision. They don't know that prompt caching exists. They don't know that Claude's context window is 50x what it was when they started. They are optimizing a system that was designed for constraints that no longer exist.
What to do on Monday
If you have a RAG pipeline in production, do this: measure your actual corpus size. Not the theoretical maximum, the real volume of data you are searching over for a typical query. If it is under 200,000 tokens, run an experiment. Take a representative set of queries, answer them with full context stuffing (with caching enabled), and compare the results to your RAG pipeline.
I have done this with four different clients in the past six months. In every case, the full-context approach produced better answers, was simpler to maintain, and cost less than expected once caching was factored in. In two cases, we decommissioned the vector database entirely.
You do not have to tear everything down at once. Start with one use case. Run both approaches in parallel. Measure answer quality, latency, and total cost. Let the data tell you whether your architecture still fits your constraints.
The context window got 10x bigger. Your architecture should at least get a second look.
Audit your current knowledge-base or doc-Q&A setup and measure how big your corpus actually is in tokens before defaulting to RAG. If it fits comfortably in a modern model’s context window, try a stuffing or filtered-stuffing prototype and compare answer quality and reliability against your retrieval pipeline. Reserve full RAG for genuinely large corpora or situations where content changes so frequently that you need continuous indexing anyway.