# TL;DR
- Longer context usually means higher price per token (e.g. 128K vs 8K tiers).
- Filling a large window is expensive: 100K input tokens at $2.50/1M = $0.25 per request before any output.
- For RAG: retrieve less, not more—bigger context ≠ better answers, but it always costs more.
- For agents: cap conversation + tool history; summarize or drop old turns instead of appending everything.
# Who This Is For
Developers using models with 32K–1M+ context windows. You want to use long context “when needed” without accidentally 5x-ing your bill.
# Assumptions & Inputs
- Models: GPT-4o, Claude 3.5, or similar with 128K+ context
- Use case: RAG, agents, or long-document QA
- Goal: minimize cost while keeping quality
# The “Bigger Window = Higher Rate” Rule
Many providers charge more for the same token when it’s part of a long-context tier. So:
- 8K context: $X per 1M input tokens
- 128K context: often 1.5–2× $X per 1M input tokens
- 1M context: premium pricing
Always check pricing by context tier, not just “per token.” Using a 200K window and sending 150K tokens can be 2–3× more expensive than sending 20K tokens in a 32K window.
# RAG: More Context = More Cost, Not Always More Quality
Typical mistake: “We have 128K context, so let’s retrieve 50 chunks and send them all.”
- 50 chunks × 500 tokens = 25K input tokens every query.
- At $2.50/1M input + $10/1M output, that’s ~$0.06+ per query just for context.
- At 100K queries/month = $6,000+ only for the retrieved context.
Better: Retrieve fewer, better chunks (e.g. top 5–10), use a reranker, and keep context under 5–10K tokens. Quality often improves (less noise) and cost drops a lot.
# Agents: Don’t Append the Whole History
Agent loops often do: messages = [system] + full_conversation + tool_results.
After 10 turns, that can be 50K+ tokens per request. Cost and latency explode.
Fixes:
- Summarize old turns every N messages.
- Drop tool payloads after using them (keep only a short “Tool X returned: success”).
- Cap total context (e.g. last 5 user + 5 assistant messages).
- Cheap model for “what should I do next?” and expensive model only for final answer.
# When Long Context Actually Pays Off
- Single long document: One 80K token doc, one call. Beats chunking + many calls if pricing is favorable.
- Legal / contracts: Need to reference many sections in one go; long context avoids losing nuance at chunk boundaries.
- Codebases: “Answer about this repo” with full file(s) in context—when the model’s long-context pricing is acceptable.
Even then: measure. Compare one 80K call vs ten 8K calls for your provider and model.
# What to Do in Practice
- Check pricing by context tier for your model (8K vs 32K vs 128K).
- Set a max context budget per request (e.g. 10K tokens for RAG, 20K for agents) and design prompts around it.
- Summarize or trim history in agents; don’t blindly append.
- Retrieve less, rank better in RAG; tune top_k and use a reranker.
- Log context length and cost per request so you see when something starts filling the whole window.
# Conclusion
Long context is a feature, not a free one. Higher per-token rates and bigger payloads quickly increase cost. Use long context only where it clearly helps; everywhere else, cap and compress.
For RAG-specific cost control, see RAG cost breakdown. For prompt-level savings, see Prompt Caching.
TokenBurner Team
AI Infrastructure Engineers
Engineers with hands-on experience building production AI systems. We've optimized context usage and seen bills spike from careless context stuffing.
Learn more about TokenBurner →