How to Calculate Hardware Requirements for Running LLMs Locally
The complete guide to estimating VRAM, RAM, storage, and compute for self-hosting LLMs. Covers quantization, context length, KV cache, multi-GPU setups, and practical GPU recommendations for every budget.


Running LLMs locally is no longer a niche hobby. In 2026, open-weight models like Nemotron 3 Super, Qwen 3.5, Llama 4 Scout, and Kimi K2.5 rival proprietary APIs on most benchmarks. But the first question everyone asks is always the same: will it run on my hardware?
The answer comes down to arithmetic. This guide gives you the exact formulas, the tradeoffs behind each variable, and worked examples so you can estimate requirements for any model before you download a single byte.
The VRAM Formula
Every local LLM deployment boils down to three numbers that must fit inside your GPU's memory:
Total VRAM = Model Weights + KV Cache + Runtime Overhead
Interactive VRAM Estimator
See how parameters, precision, and context length dramatically shift the hardware requirements for local inference.
Model weights are the dominant cost (70-75% of total), the KV cache scales with context length (15-20%), and runtime overhead covers CUDA context, activations, and framework buffers (5-10%). The following sections break down each component.
Component 1: Model Weights
The base memory cost of any model is determined by a simple product:
Weight Memory (GB) = Parameters (billions) × Bytes per Parameter
At full FP16 precision, each parameter takes 2 bytes. A 70B model occupies 140 GB. A 7B model occupies 14 GB. This is the floor you're working with before any optimization.
| Precision | Bytes / Param | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 | 4.0 | 28 GB | 52 GB | 280 GB |
| FP16 / BF16 | 2.0 | 14 GB | 26 GB | 140 GB |
| FP8 | 1.0 | 7 GB | 13 GB | 70 GB |
| INT4 | 0.5 | 3.5 GB | 6.5 GB | 35 GB |
Dense vs. MoE: A Critical Distinction
In 2026, most frontier open-weight models use Mixture-of-Experts (MoE) architectures. This is misleading if you only look at "active parameters." The active count determines compute cost per token, but all weights must be loaded into VRAM:
| Model | Total Params | Active / Token | FP16 Weight Memory |
|---|---|---|---|
| Qwen 3.5-35B-A3B | 35B | 3B | 70 GB |
| Llama 4 Scout | 109B | 17B | 218 GB |
| Nemotron 3 Super | 120B | 12B | 240 GB |
| Qwen 3.5-122B-A10B | 122B | 10B | 244 GB |
| Kimi K2.5 | 1,000B | 32B | 2,000 GB |
A token enters. The router selects just 3 experts to compute it. But all 32 experts must sit in VRAM constantly, waiting to be called.
A model like Qwen 3.5-35B-A3B generates tokens at the speed of a 3B dense model, but requires VRAM for all 35B parameters. At Q4_K_M, that's ~22 GB — a 24 GB GPU can handle it, but you're paying for memory you're not computing with. FloE (2025) demonstrated 9.3x parameter compression for MoE models, but this is still experimental.
Component 1b: Quantization — The Key Lever
Quantization compresses model weights from high-precision formats (FP16, 2 bytes) to lower-precision ones (4-bit, ~0.5 bytes), directly reducing the weight memory from the formula above. This is the single most impactful optimization for fitting models on consumer hardware.
GGUF Quantization Levels
GGUF (used by llama.cpp and Ollama) is the most common format for local inference. The K-quant variants use two-level block quantization with double-quantized scales, delivering better quality per bit than older formats.
| Quant Level | Bytes / Param | 7B Size | 70B Size | Quality Retention |
|---|---|---|---|---|
| Q8_0 | ~1.06 | 8.5 GB | ~75 GB | ~99.5% |
| Q6_K | ~0.81 | 6.3 GB | ~57 GB | ~99% |
| Q5_K_M | ~0.69 | 5.5 GB | ~48 GB | ~98% |
| Q4_K_M | ~0.55 | 4.7 GB | ~38.5 GB | ~95% |
| Q3_K_M | ~0.43 | 3.9 GB | ~30 GB | ~90% |
| Q2_K | ~0.31 | 3.0 GB | ~22 GB | ~85% |
Q4_K_M is the sweet spot for most users. It retains ~95% of full-precision quality while cutting memory by nearly 4x. Below Q4, degradation becomes noticeable, particularly on reasoning and code tasks.
GGUF vs. GPTQ vs. AWQ
Three quantization ecosystems compete in 2026, each with different tradeoffs. Recent comparisons show:
| Format | Runtime | Calibration | Best For |
|---|---|---|---|
| GGUF | llama.cpp, Ollama | None required | CPU/GPU hybrid, Apple Silicon, edge devices |
| GPTQ | vLLM, TGI, ExLlamaV2 | 512-2K samples | NVIDIA GPU production serving, ~20% faster tok/s |
| AWQ | vLLM, TGI | Small calibration set | Highest accuracy at 4-bit, instruction-tuned models |
If you're running locally on a single NVIDIA GPU or Apple Silicon, GGUF is the default choice. If you're serving to multiple users with vLLM, GPTQ or AWQ deliver better throughput.
Component 2: The KV Cache
LLMs generate text one token at a time. To produce token #100, the model needs to "attend" to all 99 previous tokens — it computes how relevant each past token is to the current prediction. This attention mechanism requires two vectors per past token per layer: a Key (what information the token offers) and a Value (the actual information to retrieve if relevant).
Without caching, the model would recompute these Key and Value vectors for all 99 previous tokens every time it generates a new one. The KV cache solves this by storing them in VRAM after the first computation. When generating token #100, only that single token's Key and Value need to be computed — the other 99 are read from cache. This turns an O(n²) recomputation problem into O(n) append-and-read.
Each new token needs to attend to every previous token. Without caching, that means recomputing keys and values for the entire history on every step.
Each new token adds one K and one V vector per layer. At 32 layers and 128K tokens, that's over 8 million cached vectors.
The tradeoff is memory. The cache grows linearly with context length and becomes the dominant memory consumer for long-context workloads.
KV Cache (GB) = 2 × Layers × KV Heads × Head Dim × Seq Length × Batch Size × Precision / 10⁹
The "2" accounts for both keys and values. Each transformer layer maintains its own KV pair. KV Heads is the number of key-value heads after any grouping (GQA/MQA), Head Dim is typically 128 for modern models, and Precision is the byte width of the cached values (typically 2 for FP16).
Grouped-Query Attention Reduces KV Cost
Modern models use Grouped-Query Attention (GQA) to share KV heads across multiple query heads. Llama 3.1 8B, for example, has 32 query heads but only 8 KV heads, reducing KV cache by 4x compared to standard multi-head attention. This is already baked into the formula above when you use the actual KV head count.
KV Cache by Context Length
Using Llama 3.1 8B as a reference (32 layers, 8 KV heads, 128 head dim, FP16, batch size 1):
| Context Length | KV Cache | % of Total (Q4 Weights) |
|---|---|---|
| 4K tokens | 0.5 GB | 10% |
| 8K tokens | 1.0 GB | 18% |
| 32K tokens | 4.0 GB | 46% |
| 128K tokens | 16.0 GB | 77% |
Llama 3.1 8B at Q4_K_M — same model, same quantization, dramatically different memory profiles.
At 128K context, the KV cache consumes 3.4× more memory than the model weights themselves.
At 128K context, the KV cache dwarfs the model weights. This is why a model that fits comfortably in VRAM at 8K context can OOM at longer contexts. The cache also scales linearly with batch size: serving 4 concurrent users at 32K context would consume 16 GB for KV cache alone.
KV Cache Optimization Techniques
Several techniques reduce KV cache pressure without changing the model:
- PagedAttention (vLLM): allocates KV cache in non-contiguous blocks on demand, reducing waste from 60-80% to near zero and enabling 2-4x more concurrent requests
- KV cache quantization: compressing cached keys/values to INT4 or INT8 cuts memory by 2-4x with minimal quality loss. Google's TurboQuant (March 2026) achieves 6x reduction with 3-bit storage
- FlashAttention: doesn't reduce KV cache size, but avoids materializing the full N×N attention matrix in HBM, reducing peak memory by ~33x during the attention computation itself
Component 3: Runtime Overhead
Beyond weights and KV cache, the inference runtime consumes additional VRAM:
- CUDA context: ~300-500 MB. Allocated when the GPU is initialized, before any model is loaded. This is fixed and unavoidable on NVIDIA GPUs.
- Activation memory: temporary tensors computed during each forward pass. For inference (not training), this is small — typically 200-500 MB — because only one token is processed at a time during generation.
- Framework buffers: PyTorch, vLLM, and llama.cpp each reserve scratch space for internal operations. Ranges from ~200 MB (llama.cpp) to ~1 GB (vLLM with large batch sizes).
A conservative rule of thumb: reserve 1-1.5 GB beyond model weights and KV cache. In practice, this means keeping 5-10% of your GPU's VRAM headroom. vLLM's default --gpu-memory-utilization 0.9 flag reflects this.
GPU Recommendations (2026)
For LLM inference, VRAM capacity is the hard ceiling and memory bandwidth determines token generation speed. Recent research confirms that LLM decoding remains memory-bandwidth-bound even at large batch sizes, with most GPU compute capacity underutilized. A GPU with 50% more bandwidth generates tokens ~50% faster.
| GPU | VRAM | Bandwidth | Max Model (Q4) | Price (USD) |
|---|---|---|---|---|
| RTX 3060 | 12 GB | 360 GB/s | ~14B | ~$200 used |
| RTX 4060 Ti | 16 GB | 288 GB/s | ~22B | ~$350 |
| RTX 3090 | 24 GB | 936 GB/s | ~35B | ~$700 used |
| RTX 4090 | 24 GB | 1,008 GB/s | ~35B | ~$1,600 |
| RTX 5080 | 16 GB | 960 GB/s | ~22B | $999 |
| RTX 5090 | 32 GB | 1,792 GB/s | ~50B | $1,999 |
| A100 | 80 GB | 2,039 GB/s | ~120B | ~$8,000 used |
| H100 | 80 GB | 3,350 GB/s | ~120B | ~$25,000 |
Benchmark Speeds (Q4_K_M Quantization)
Approximate tokens/second for single-user inference using llama.cpp or Ollama, based on recent benchmarks:
| GPU | 8B Model | 32B Model | 70B Model |
|---|---|---|---|
| RTX 3090 | ~48 tok/s | ~15 tok/s | ~9 tok/s (offloaded) |
| RTX 4090 | ~127 tok/s | ~30 tok/s | ~12 tok/s (offloaded) |
| RTX 5080 | ~132 tok/s | ~20 tok/s (offloaded) | ~12 tok/s (offloaded) |
| RTX 5090 | ~213 tok/s | ~78 tok/s | ~35 tok/s |
The RTX 5090's 32 GB is the inflection point: it fits models like Qwen 3.5-35B-A3B (Q4, ~22 GB) entirely in VRAM with room for KV cache, eliminating the offloading penalty that hobbles the 5080 and 4090. For Llama 4 Scout (109B MoE, ~60 GB at Q4), you need 2-3 GPUs or accept CPU offloading at ~5-15x speed penalty.
Apple Silicon: The Unified Memory Advantage
Apple Silicon takes a fundamentally different approach. Instead of discrete VRAM, the CPU and GPU share a unified memory pool. A MacBook Pro with an M4 Max chip has up to 128 GB of unified memory accessible at 546 GB/s bandwidth, and the entire pool is available for model weights.
This means a single laptop can load Llama 4 Scout (109B MoE, ~60 GB at Q4) into memory without any multi-GPU configuration. Using the MLX framework, the M4 Max achieves ~18-20 tok/s on 70B-class dense models at Q4 quantization — slower than a desktop RTX 5090 (35 tok/s) but faster than any offloaded configuration. MoE models run even faster on Apple Silicon because only active parameters hit the compute units while the full model sits comfortably in unified memory.
| Chip | Max Memory | Bandwidth | 70B Q4 Speed | Price |
|---|---|---|---|---|
| M4 Max | 128 GB | 546 GB/s | ~20 tok/s | ~$4,000 |
| M3 Ultra | 192 GB | 800 GB/s | ~28 tok/s | ~$5,500 |
| M5 Max | 128 GB | 614 GB/s | ~22 tok/s | ~$4,200 |
The tradeoff is clear: Apple Silicon offers the simplest path to running large models locally — no multi-GPU configuration, no driver issues, no PCIe bottlenecks. But for raw speed at the same model size, two NVIDIA GPUs with NVLink or even a single RTX 5090 will outperform it on models that fit.
Multi-GPU Setups
When a model won't fit on a single GPU, you split it across multiple cards. There are two strategies, and choosing the right one depends on whether you're scaling within a single machine or across nodes.
Tensor Parallelism (TP)
Tensor parallelism shards individual weight matrices across GPUs. Each GPU holds a slice of every layer and they synchronize via AllReduce after each operation. This requires fast interconnect: NVLink (600-900 GB/s) works well, PCIe (32-64 GB/s) creates a bottleneck.
Best for: 2-8 GPUs within a single node. Set --tensor-parallel-size N in vLLM. Each GPU needs roughly total_model_size / N VRAM plus its share of KV cache.
Pipeline Parallelism (PP)
Pipeline parallelism assigns complete layers to different GPUs. Activations flow sequentially through the pipeline. Communication is point-to-point (not AllReduce), so it works over slower interconnects like InfiniBand.
Best for: multi-node setups or GPUs connected via PCIe. However, PP introduces pipeline bubbles (idle time) for single-request serving. It shines with batched inference.
How a model with 8 transformer layers gets distributed across 2 GPUs using each strategy.
Consumer Multi-GPU: The PCIe Reality
Most consumer motherboards connect GPUs via PCIe, not NVLink. With llama.cpp, this is handled transparently — the model is split across GPUs by layer count using --n-gpu-layers. For two RTX 5090s (32 GB each = 64 GB total), you can fit Llama 4 Scout at Q4_K_M (~60 GB weights + ~2.5 GB KV cache at 8K context). Three RTX 4090s (72 GB total) also work.
The speed penalty from PCIe versus NVLink is real but manageable for inference: expect ~10-20% overhead compared to an equivalent single-GPU setup. For training, the penalty would be prohibitive, but inference involves far less communication.
Worked Examples
Three step-by-step calculations for common scenarios. Each uses the formula: Total = Weights + KV Cache + Overhead.
Example 1: Llama 3.1 8B on an RTX 4060 Ti 16 GB
Model: 8B parameters, 32 layers, 8 KV heads, 128 head dim. Target: Q4_K_M, 8K context, batch size 1.
- Weights: 8B × 0.55 bytes/param = 4.4 GB
- KV Cache: 2 × 32 × 8 × 128 × 8,192 × 1 × 2 bytes = 1.07 GB ≈ 1.0 GB
- Overhead: ~1.0 GB
- Total: 4.4 + 1.0 + 1.0 = 6.4 GB
Fits comfortably on 16 GB with 9.6 GB to spare. You could increase context to 32K (~4 GB KV cache) and still have headroom, or even run the model at Q6_K (6.3 GB weights) for better quality.
Example 2: Qwen 3.5-35B-A3B (MoE) on an RTX 5090
Model: 35B total parameters (3B active), 40 layers, hybrid attention (25% full attention with 2 KV heads, 75% linear attention with minimal KV). Target: Q4_K_M, 16K context, batch size 1.
This is an MoE model — all 35B weights must be loaded despite only 3B being active per token. The hybrid Gated DeltaNet architecture reduces KV cache significantly: only 10 of 40 layers use full attention with traditional KV caching.
- Weights: 35B × 0.55 = 19.3 GB
- KV Cache: 2 × 10 × 2 × 256 × 16,384 × 1 × 2 bytes = 0.33 GB ≈ 0.3 GB (only full-attention layers cache KV)
- Overhead: ~1.2 GB
- Total: 19.3 + 0.3 + 1.2 = 20.8 GB
Fits on the RTX 5090's 32 GB with 11 GB to spare. The hybrid attention architecture is the key insight here: because only 25% of layers use full attention, the KV cache barely grows even at 262K context (~3 GB), making this model uniquely long-context friendly on consumer hardware.
Example 3: Llama 4 Scout (109B MoE) on 3× RTX 4090
Model: 109B total parameters (17B active), 16 experts, MoE architecture. Target: Q4_K_M, 8K context, batch size 1.
This is where MoE memory math bites. Despite only 17B active parameters, all 109B weights need VRAM.
- Weights: 109B × 0.55 = 59.9 GB (~20 GB per GPU across 3 cards)
- KV Cache: ~2.5 GB total at 8K context (~0.8 GB per GPU)
- Overhead: ~1.0 GB per GPU
- Per GPU: 20.0 + 0.8 + 1.0 = 21.8 GB
Fits on 3× RTX 4090 (24 GB each = 72 GB total) with ~2.2 GB per GPU to spare. On 2× RTX 4090 (48 GB total), you'd need ~62.4 GB — over the limit. The RTX 5090's 32 GB makes 2 cards viable (64 GB total). Despite the large memory footprint, token generation is fast because only 17B parameters are active per forward pass.
Inference Software: The Four Engines
The inference engine you choose directly affects throughput, latency, memory efficiency, and which hardware you can use. In 2026, four engines dominate, each built around a fundamentally different architecture. HuggingFace TGI entered maintenance mode in late 2025, leaving these as the primary production choices.
| Engine | KV Cache Strategy | Batching | Quantization | Hardware |
|---|---|---|---|---|
| llama.cpp | Static allocation | Simple | GGUF (Q2-Q8, K-quants) | CPU, CUDA, Metal, Vulkan, ROCm |
| vLLM | PagedAttention | Continuous | GPTQ, AWQ, FP8 | CUDA, ROCm, TPU, Intel |
| SGLang | RadixAttention | Continuous | GPTQ, AWQ, FP8, GGUF | CUDA, ROCm, TPU |
| TensorRT-LLM | Paged KV + fusion | In-flight | FP8, FP4, NVFP4 | CUDA only (Hopper/Blackwell) |
llama.cpp — The Portable Engine
llama.cpp is a C++ inference engine with zero dependencies. It runs on anything: NVIDIA, AMD, Intel Arc, Apple Silicon, and pure CPU. The GGUF format supports K-quant quantization (Q2_K through Q8_0) without calibration data, and the engine handles CPU/GPU hybrid inference transparently via --n-gpu-layers.
Recent 2026 optimizations include Metal Tensor API support (+26% geomean improvement on Apple Silicon), Vulkan shared-memory kernels (2.5x speedup on Intel Arc), and MCP client support for tool calling.
Throughput: ~120 tok/s single-user on RTX 3090 (8B Q4). Excels at single-user latency but throughput drops at high concurrency due to simple batching. Ollama wraps llama.cpp in a Go server (ollama run qwen3.5:9b) with model management and a REST API — same performance, easier setup.
vLLM — The Production Workhorse
vLLM is a PyTorch-based engine built around PagedAttention: KV cache is divided into fixed-size blocks (typically 16 tokens) allocated on demand in non-contiguous GPU memory, eliminating the 60-80% waste from traditional pre-allocation. Combined with continuous batching (requests enter and exit the batch dynamically each iteration), vLLM achieves 2-4x throughput over naive PyTorch serving.
The architecture has five components: an Engine Core coordinator, a Scheduler managing waiting/running queues, KV Cache Managers handling block-level allocation, Workers executing GPU inference, and Model Runners preparing inputs. Tensor parallelism is native (--tensor-parallel-size 4). Hardware support spans NVIDIA, AMD ROCm (first-class since v0.17.0), Intel Gaudi, and TPUs.
Throughput: ~12,500 tok/s on Llama 3.1 8B (H100). The broadest model compatibility and largest community of any production engine.
SGLang — The Latency Leader
SGLang introduces RadixAttention: a token-level radix tree that automatically detects and reuses common prefixes across requests. When a new request arrives, the RadixCache finds the longest cached prefix and returns device-mapped KV indices. Unlike PagedAttention (which pages within a single request), RadixAttention shares KV cache across requests, making it significantly faster for workloads with repeated prefixes: multi-turn chat, RAG with shared system prompts, and few-shot learning.
SGLang also leads on structured output: native regex, JSON schema, and grammar constraints are compiled into an FSM (finite state machine) that runs with minimal overhead during decoding. For agentic workflows that need tool calls as JSON, this eliminates retry loops.
Throughput: ~16,200 tok/s on Llama 3.1 8B (H100) — 29% faster than vLLM. Under 16 concurrent requests, SGLang completes 4.6x faster than vLLM. TTFT p95 is consistently 5-8% lower at all concurrency levels.
TensorRT-LLM — Maximum NVIDIA Performance
TensorRT-LLM trades portability for raw speed. It compiles models into optimized CUDA graphs with kernel fusion (LayerNorm + MatMul + bias + activation in a single kernel), hand-tuned Flash Attention variants, and in-flight batching that mixes prefill and decode phases within the same batch.
The quantization support is NVIDIA-native: FP8 on Hopper, NVFP4 on Blackwell (with automatic backend selection between TRT-LLM and FlashInfer kernels based on profiled performance). On H100 with FP8, TensorRT-LLM delivers 10,000+ output tok/s with sub-100ms TTFT — the highest absolute performance of any engine, but it only runs on NVIDIA datacenter GPUs and requires a compilation step before serving.
When to Use Each
| Scenario | Best Choice | Why |
|---|---|---|
| Personal laptop / Apple Silicon | llama.cpp (or Ollama) | Metal backend, CPU/GPU hybrid, GGUF flexibility |
| Single NVIDIA GPU, dev use | Ollama or SGLang | Ollama for simplicity, SGLang for structured output |
| Multi-GPU production API | vLLM or SGLang | Broadest hardware support (vLLM) or best latency (SGLang) |
| Multi-turn / RAG with shared prompts | SGLang | RadixAttention prefix sharing across requests |
| Maximum throughput, NVIDIA datacenter | TensorRT-LLM | Kernel fusion, FP8/FP4, in-flight batching |
| AMD / Intel GPU | vLLM or llama.cpp | First-class ROCm in vLLM; Vulkan in llama.cpp |
For Apple Silicon specifically, MLX is purpose-built for the unified memory architecture and outperforms llama.cpp on M-series chips for models that fit entirely in memory.
The hardware equation for local LLM inference is ultimately straightforward: weights plus KV cache plus overhead must fit in your available memory. Quantization is the primary lever, context length is the hidden variable, and memory bandwidth determines speed. Run the numbers for your specific model and use case before making any hardware decisions — the formulas above will get you within 10% of actual usage.
Check our model directory for detailed specs on any model, or use the LLM leaderboard to compare performance before deciding what to run locally.