Insights/2026-01-08·4 min read·By TokenBurner Team

RTX 4090 VRAM Limits: What Models Actually Fit

A single RTX 4090 can't run Llama-3 70B at usable speeds. Here's the VRAM math, quantization tradeoffs, and what actually works on 24GB.

local-llmhardwarertx-4090llama-3vram-optimization

# TL;DR

  • A single RTX 4090 (24GB) cannot run Llama-3 70B at usable speeds without extreme quantization
  • Q4 quantization requires ~42GB VRAM—you need dual 3090s/4090s (48GB total)
  • CPU offloading works but drops speed to 2-4 tokens/sec (unusable for production)
  • Practical options: 30B-35B models (Yi-34B, DeepSeek-Coder-33B) or Mixtral 8x7B fit in 24GB
  • Hardware costs add up: electricity (~$54/month at 24/7), second card requires new mobo/PSU

# Who This Is For

Engineers considering local LLM deployment to reduce API costs. You have a GPU budget ($1,500-$4,000) and want to know what models actually run on consumer hardware.

# Assumptions & Inputs

  • RTX 4090: 24GB VRAM
  • Target: Llama-3 70B or similar 70B-class models
  • Use case: coding assistance, RAG, or general chat
  • Speed requirement: >10 tokens/sec for interactive use

I'm looking at my OpenAI usage dashboard thinking:

"I'm burning $200/month renting intelligence. If I buy a GPU, it pays for itself in 9 months. Free tokens forever."

So I did what any rational engineer with poor impulse control does: I bought an RTX 4090 (24GB).

My plan was simple:

  1. Install ollama or ExLlamaV2.
  2. Download Llama-3 70B.
  3. Fire OpenAI.

Then I hit run, and my computer froze for 45 seconds before spitting out one token per second.

I didn't escape the API mines. I just bought a very expensive space heater.

The technical reality check nobody gives you before you swipe your card.

# 1. The Misconception: "24GB is Huge"

In gaming, 24GB VRAM is god-tier. In LLM land, 24GB is a studio apartment.

Most people assume:

“70B is just a number. Compression (Quantization) is magic. It’ll fit.”

Nope.

Llama-3 70B has 70 billion parameters. At FP16 (standard precision), that’s 70B * 2 bytes = 140GB.

Your 4090 has 24GB.

You are trying to park a Boeing 747 in a residential garage.

# 2. The "Quantization" Gamble

"But what about 4-bit quantization?" you ask.

Let's look at the actual GGUF sizes for a 70B model:

  • Q8_0 (8-bit): ~75 GB (Need 4x 3090s)
  • Q4_K_M (4-bit): ~42 GB (Need 2x 3090s/4090s)
  • Q2_K (2-bit): ~26 GB (Still doesn't fit on one card)

Even if you crush the model down to 4-bit (which is the industry standard for "usable intelligence"), you need 42GB of VRAM.

With a single 4090, you are short by 18GB.

⚠️ Warning: Measure before you buy

I realized this after the card arrived. Don't be like me. Check the VRAM Calculator first. The difference between Q4 and Q2 is massive.

# 3. The "Offloading" Lie

The internet will tell you:

"Just offload the rest to your System RAM! It’s fine!"

It is not fine.

When you split a model between GPU (VRAM) and CPU (DDR5 RAM), you are bottlenecked by the PCIe bus transfer speeds.

The Speed Penalty:

  • Full GPU offload: ~40-60 tokens/sec (Instant coding assistance)
  • Mixed CPU/GPU: ~2-4 tokens/sec (Painfully slow reading speed)

If you are building a RAG app or an Agent loop, 3 tokens/second is useless. You will wait 5 minutes for a code refactor that GPT-4o does in 10 seconds.

# 4. So... What Can One 4090 Actually Run?

If you stick to a single card, you have to choose: High IQ (Slow) or Medium IQ (Fast).

# The Sweet Spot: 30B - 35B Models

This is where the 4090 actually shines.

  • Yi-34B (Q4): ~20GB. Fits entirely in VRAM.
  • Speed: 50+ tokens/sec.
  • Quality: Better than GPT-3.5, slightly below GPT-4.

# The "Mixture of Experts" (Mixtral 8x7B)

  • Mixtral 8x7B (Q4): ~26GB.
  • Hack: With a high context window, this overflows. But with Q3_K_M (~20GB), it fits perfectly.
  • Result: This is currently the best coding assistant you can run on a single card.

# The "Lobotomy" Option (70B at IQ2_XXS)

You can run Llama-3 70B on one card if you use IQ2_XXS quantization (approx 2.0 bits per weight).

  • Size: ~22GB.
  • Result: It runs fast, but it's brain-damaged. It forgets instructions, hallucinates libraries, and fails logic tests that the 8B model passes.

Don't run a lobotomized 70B just to say you're running 70B.

# 5. Hardware Costs Beyond the GPU

API costs are visible. Hardware costs are invisible until you check the meter.

  1. Electricity: My 4090 rig pulls ~500W under load. If I run it 24/7 as a server, that's $54/month in electricity alone.
  2. The "Second Card" Trap: Once you realize 24GB isn't enough, you'll want a second card. But 4090s are huge. You'll need a new motherboard, a massive case, and a 1600W PSU. Suddenly your "$1,800 project" is a "$4,000 workstation."

# Conclusion: My Survival Strategy

I didn't sell the card. But I stopped trying to force Llama-3 70B into it.

My Daily Driver Stack:

  1. Coding: DeepSeek-Coder-33B (Q4). Fits perfectly. Fast completion.
  2. General Chat: Llama-3 8B (FP16). Lightning fast (100+ t/s).
  3. Complex Logic: API (Claude 3.5 Sonnet).

I use the GPU for the 90% of "dumb tasks" (autocomplete, simple refactors, summarization) and pay the API for the 10% of "genius tasks."

That cut my API bill from $200 -> $20.

If you're browsing eBay for used 3090s right now, stop. Do the math first. Check if the specific model + quantization + context window you want actually fits in the VRAM you're buying.

For more on VRAM requirements, see Llama 70B VRAM Requirements. For RAG workloads, vector database costs can also surprise you—check RAG cost analysis before building.

# Try the Calculator

Check if your GPU can run Llama-3 70B — select your exact hardware, model, and quantization to see VRAM breakdown.

T

TokenBurner Team

AI Infrastructure Engineers

Engineers with hands-on experience building production AI systems. We've tested local LLM deployments on various hardware configurations to find what actually works.

Learn more about TokenBurner →