NVIDIA GeForce RTX 4090
LLM Inference Performance
| Model | Tokens / sec | Local Fit |
|---|---|---|
| Mistral 7b Q4 | 150 tok/s | fits · single GPU |
| Llama 3 8b Q4 | 140 tok/s | fits · single GPU |
| Llama 3 13b Q4 | 78 tok/s | fits · single GPU |
| Llama 3 70b Q4 | — OOM — | OOM / offload |
Local Model Compatibility
Spec Sheet
Comparable GPUs
Analysis notes
Quick Summary
For RTX 4090 AI work in 2026, this is still the fastest single consumer card you can buy. With 16,384 CUDA cores, 512 fourth-gen Tensor cores, 24GB of GDDR6X and 82.6 TFLOPS of FP32, it chews through local LLM inference up to 13B parameters and handles fine-tuning that lesser cards can’t. The catches are a 450W appetite and a flagship price.
Specs That Matter for AI
The 24GB VRAM is the headline — it sets the ceiling on model size. The 4090 comfortably runs 7B and 13B models quantized, but a 70B model at q4 (~40GB) exceeds its memory and forces slow CPU offload. Memory bandwidth of ~1 TB/s keeps token generation fast where the model fits.
Performance
Expect ~140 tok/s on Llama 3 8B q4 and ~78 tok/s on 13B q4 — the quickest in this database. For training and LoRA fine-tuning of smaller models, the Tensor cores and bandwidth make it the clear leader among consumer hardware.
Verdict
If you want maximum consumer AI speed and can absorb the power and price, the 4090 is unmatched. If your workload is inference-only and VRAM-bound, a cheaper 24GB card delivers most of the capability for far less money.
Frequently Asked Questions
- Can the RTX 4090 run a 70B model?
- Not at q4. A 70B model at q4 needs roughly 40GB, beyond the 4090's 24GB, so it requires CPU offload and slows to a crawl. The 4090 is ideal for 7B–13B models and excellent for fine-tuning smaller models.
- Is the RTX 4090 worth it for AI in 2026?
- For the fastest single-card consumer inference and training of models up to 13B, yes. If you only need inference and care about cost, a used 24GB card like the 3090 offers similar VRAM for far less.
- How much power does the RTX 4090 draw?
- 450W TDP with transient spikes toward 600W. Budget a 1000W PSU and good case airflow; it is the most power-hungry consumer GPU here.