# TL;DR

RAG: Pay per query (vector DB + context tokens + LLM). Low fixed cost, cost scales with usage.
Fine-tuning: Pay once for training (or use hosted fine-tuning), then lower per-token inference. High upfront, flat marginal cost.
Break-even is usually at tens of thousands to hundreds of thousands of queries for custom fine-tuning vs RAG, depending on context size and model.
Hybrid: Use RAG for knowledge, small fine-tune for style/format; often the best cost/quality tradeoff.

# Who This Is For

Product and eng teams deciding between RAG and fine-tuning for a knowledge-heavy or domain-specific assistant. You care about total cost over 6–12 months, not just demo cost.

# Assumptions & Inputs

Use case: Q&A or task completion over private/knowledge-base content
Expected query volume: 10K–500K queries/month
Knowledge size: hundreds to thousands of documents
Willing to consider hosted fine-tuning (OpenAI, Anthropic, etc.) or self-hosted

# The Two Cost Curves

RAG:
Fixed cost ≈ (embedding one-time + vector DB monthly).
Variable cost ≈ (vector query + retrieved context tokens + LLM generation) × queries.

Fine-tuning:
Fixed cost ≈ (data prep + training job + evaluation).
Variable cost ≈ (inference only) × queries; often cheaper per query than RAG if context is large.

So: low volume → RAG is usually cheaper. High volume + stable behavior → fine-tuning can win.

# Rough Break-Even Intuition

Assume RAG: ~$0.02–0.05 per query (vector DB + 2K context + GPT-4o-mini-level generation).
Assume fine-tune: $500–2,000 one-time, then ~$0.005–0.01 per query (smaller context, cheaper model).

$1,000 fine-tune / $0.03 per query ≈ 33K queries to break even.
If you do 100K queries/month, fine-tuning pays off in under a month; if you do 5K/month, RAG is cheaper for a long time.

Your numbers will vary with context length, model choice, and vector DB pricing—but the shape of the decision is the same.

# When RAG Is the Better Deal

Low or unpredictable volume. No point paying for fine-tuning if you're at 1K–10K queries/month.
Knowledge changes often. Re-embedding is cheaper than re-training.
You need citations/sources. RAG is built for this; fine-tuning is not.
Many domains/products. One RAG pipeline can serve many indices; fine-tuning usually one model per use case.

# When Fine-Tuning Can Win

Very high, stable volume. Same model, same task, millions of queries.
Strict format/style. E.g. structured JSON, fixed tone—fine-tuning can reduce prompt size and retries.
Latency/cost per query matters. Smaller context + smaller or cheaper model after fine-tuning = lower marginal cost.
Knowledge is stable. Manual or rare updates; re-training cost is amortized over many queries.

# The Hybrid Option

Often the best balance:

RAG for retrieval (fresh, cited knowledge).
Light fine-tune (or few-shot in prompt) for output format, terminology, and style.

You get citation and updatability from RAG, and lower prompt/output cost from a model that doesn’t need long instructions every time.

# What to Actually Calculate

RAG:
- One-time: embedding + vector DB setup.
- Monthly: (vector DB + embedding of new docs) + (cost per query × expected queries).
Fine-tuning:
- One-time: data prep + training (hosted or self-hosted).
- Monthly: inference cost × expected queries (+ re-training if you retrain periodically).
Plot both over 6–12 months at low/medium/high volume and pick the curve that fits your traffic and roadmap.

# Conclusion

RAG = lower fixed cost, cost scales with usage. Fine-tuning = higher fixed cost, lower marginal cost. Break-even depends on volume and your exact RAG vs inference costs. For most products, start with RAG; add fine-tuning (or hybrid) when volume and stability justify it.

For RAG cost details, see RAG cost breakdown. For vector DB pricing, use the Vector DB calculator.

Model your RAG + vector DB costs

See how vector DB and query volume affect total RAG cost.

Open Vector DB calculator

Fine-Tuning vs RAG: When Each Is Cheaper (And When It Isn't)