Gloss Key Takeaways

DeepSeek V4 Pro’s jump to a 1M-token context window makes “just include the whole document” viable for 600-page PDF sets at a low per-query cost.
For many document-heavy copilots, long-context prompting can replace traditional RAG pipelines more often than people assume, simplifying architecture.
A minimal ingest pipeline can extract PDF text page-by-page, add source markers, and rely on those markers to produce reliable citations in answers.
Explicit citation instructions in the system prompt (and a strict “don’t guess” rule) are key to getting grounded, source-linked responses.
The real tradeoff shifts from building and operating embeddings/vector infrastructure to paying context tokens on each request, which can be cheaper overall at current pricing.

Open book with flowing context lines

Build a 1M Context Document Copilot with DeepSeek V4 Pro

DeepSeek V4 Pro jumped from 128k to 1M tokens of context this quarter, and unlike most context window jumps, this one is priced low enough to actually use. At roughly $0.14 per million input tokens, you can stuff a 600 page PDF set into the model on every request and still come out ahead of an OpenAI call with proper RAG infrastructure. That changes the architectural calculus for a lot of document-heavy applications.

This is a hands-on build of a document copilot that reads a 600 page legal or technical PDF set, answers questions with citations, and runs in production at sane cost. We will compare it head to head with a RAG pipeline doing the same job, and end with a decision matrix that tells you when to pick which approach.

The argument is not "long context replaces RAG." The argument is "long context replaces RAG more often than people realize, and the cases where you still need a vector store are narrower than they were a year ago."

The naive approach that now works

Before V4 Pro, ingesting 600 pages meant chunking into 800-token segments, embedding them, indexing in a vector store, and writing retrieval logic. The naive alternative, "just put the whole document in the prompt," was either impossible or financially insane. Now it is the simpler and often better option.

import pypdf
from openai import OpenAI

client = OpenAI(
    base_url="https://api.deepseek.com/v1",
    api_key=os.environ["DEEPSEEK_API_KEY"],
)

def load_pdf_set(paths):
    chunks = []
    for path in paths:
        reader = pypdf.PdfReader(path)
        for page_num, page in enumerate(reader.pages):
            text = page.extract_text()
            chunks.append({
                "source": path,
                "page": page_num + 1,
                "text": text,
            })
    return chunks

def build_context(chunks):
    parts = []
    for c in chunks:
        parts.append(f"[{c['source']} p.{c['page']}]\n{c['text']}\n")
    return "\n".join(parts)

That is the entire ingest pipeline. No embeddings. No vector store. No chunk overlap tuning. The PDFs become structured text with source markers, and the source markers are what make citations possible at the end.

Stack of papers with golden citation tabs

The query function with citations

The trick to good citations is asking the model to use the source markers as part of its output. Most models cooperate when you make the format explicit and give them an example.

SYSTEM = """You answer questions using the provided documents.
For every claim, cite the source in this exact format: [filename.pdf p.X].
If the answer is not in the documents, say so. Do not guess."""

def ask(question, context):
    resp = client.chat.completions.create(
        model="deepseek-v4-pro",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": f"Documents:\n\n{context}\n\nQuestion: {question}"},
        ],
        temperature=0.0,
        max_tokens=2048,
    )
    return resp.choices[0].message.content

chunks = load_pdf_set(["contract-2024.pdf", "amendment-1.pdf", "amendment-2.pdf"])
context = build_context(chunks)
print(ask("What are the termination clauses?", context))

That is the whole copilot. Roughly 50 lines of Python. You are paying for context tokens on every query, which is the obvious cost. You are saving on every other piece of infrastructure that a RAG system requires, which is the less obvious savings.

The cost math people get wrong

A 600 page PDF set tokenizes to roughly 250k tokens. At $0.14 per million input tokens with V4 Pro, every query costs $0.035 in input. Output is small, maybe 500 tokens at $0.28 per million, so $0.00014. Round trip per query: about 3.5 cents.

Compare that to a RAG setup. Embedding the document set once: $0.50 with text-embedding-3-large. Vector store hosting: roughly $20 a month for a managed Qdrant or $40 for Pinecone, plus storage. Per-query: embed the question ($0.0001), retrieve top-k ($0), generate with retrieved chunks at maybe 8k input tokens through GPT-5.5: $0.024. Round trip: about 2.5 cents.

RAG is cheaper per query. Long context is cheaper to build and to maintain. The break-even depends on query volume. Below roughly 100 queries a day, long context wins on total cost of ownership. Above 1000 queries a day, RAG wins. Between those numbers it depends on how much engineering time you spend on the RAG pipeline tuning, which is almost always more than people budget.

When long context is genuinely better

Multi-document reasoning is where long context shines. A RAG system retrieves the chunks that look most similar to the question. If your question requires synthesizing across three different sections of three different documents, the retrieval step often misses one of them. The model gets fewer than the relevant chunks and produces a confidently wrong answer.

With long context, every chunk is in scope on every query. The model can connect the section in document A that defines the term, with the table in document B that uses the term, with the appendix in document C that lists the exceptions. RAG can do this, but only if your retrieval ranks them all high enough.

The other case is iterative refinement. With long context, follow-up questions reference the same context implicitly. With RAG, every follow-up triggers another retrieval round, and the retrievals can drift. "Tell me more about that" is harder for a RAG system than people expect.

When RAG is still the right call

Document sets that change frequently. If your knowledge base updates daily, re-tokenizing the whole thing for every query is wasteful, and you want incremental indexing. RAG with a real vector store handles this naturally.

Multi-tenant systems where each query needs to scope to a customer's documents, but the total corpus is huge. You cannot put 50GB of documents into a 1M context window. You retrieve the right slice per tenant.

High-volume search applications. If you are serving 50 queries a second, RAG dominates on cost. Long context is for human-scale querying, where one query a minute is normal.

Crossroads choosing between RAG library and long scroll

The decision matrix

When you are deciding which architecture to use, run through these dimensions in order. Pick long context if most answers point that way. Pick RAG if most answers point the other way. Mixed answers usually mean you should prototype both for a week.

Dimension	Long context fits	RAG fits
Corpus size	< 1M tokens (~600 pages)	> 1M tokens
Query volume	< 1k/day	> 1k/day
Update frequency	weekly or less	daily or hourly
Question type	synthesis across documents	lookup of specific facts
Tenancy	single corpus	per-tenant scoping
Engineering time	1 day	1-2 weeks
Latency tolerance	5-15 seconds	1-3 seconds

The biggest wins come from picking long context for what looks like a RAG problem, but where the corpus is small enough and the questions are synthesis-heavy. Legal review of a contract bundle. Onboarding documentation Q and A. Code review of a small repo. These were RAG-by-default a year ago. They are long-context-first now.

What changes next

The 1M context milestone is not the ceiling. Anthropic, Google, and DeepSeek are all signaling 2M to 10M context within the year. Pricing per token continues to drop. The architectural decision that was clearly RAG in 2023 has become "it depends" in 2025 and will become "long context unless you have a specific reason" by 2027.

The smart move today is to stop assuming RAG and start asking which architecture actually fits the problem. Most teams are running RAG pipelines that they no longer need, paying maintenance cost on infrastructure that solves a problem the model can now solve directly. Build the long context version first. Add RAG only when the decision matrix tells you to.

Gloss What This Means For You

If you’re building a document Q&A copilot, try a long-context prototype first: extract your PDFs, prepend clear page-level source markers, and enforce a strict citation format in the prompt. Then run the cost math on your actual document sizes and query volume to see whether paying a few cents per request beats standing up embeddings, a vector store, and retrieval logic. Keep RAG in your toolbox for cases where documents exceed the context window, need frequent incremental updates, or require very high recall across large corpora.