How much VRAM do I need to run Llama 4 Scout (109B)?

Llama 4 Scout has 109B total parameters (17B active, MoE). At FP16, that's ~218 GB. With 4-bit quantization (Q4_K_M), it drops to ~60 GB, fitting on 3× RTX 4090 (24 GB each), 2× RTX 5090 (32 GB each), or a single M4 Max with 128 GB unified memory. Add 2-8 GB for KV cache depending on context length.

Can I run LLMs with just system RAM instead of VRAM?

Yes. Tools like llama.cpp support CPU-only inference using system RAM. It works but is 10-20x slowerthan GPU inference. A hybrid approach that offloads some layers to the GPU and keeps the rest in RAM offers a practical middle ground. The rule of thumb: hybrid inference is worthwhile when 60%+ of the model's layers fit in VRAM.

What is the difference between GGUF, GPTQ, and AWQ quantization?

GGUF (llama.cpp) supports CPU+GPU hybrid inference with flexible quantization levels and no calibration required. GPTQ is GPU-only with calibration-based quantization, offering ~20% faster token generation on NVIDIA GPUs. AWQ preserves critical weights using activation awareness, delivering slightly better accuracy at the same bit width but with a smaller ecosystem of pre-quantized models.

How does context length affect VRAM usage?

The KV cache scales linearly with context length, batch size, and the number of attention layers. For a Llama 3.1 8B model, the KV cache uses ~4 GB at 32K context and ~16 GB at 128K context. This is often the factor that pushes you from one GPU tier to the next. Techniques like GQA and KV cache quantization can reduce this by 4-8x.

Do MoE models need less VRAM than dense models?

No. MoE models need VRAM for all parameters, not just the active ones. A 120B MoE model with 12B active parameters still loads all 120B weights into memory. The active parameter count only determines compute cost and token generation speed, not memory. Compression techniques like FloE (2025) can reduce MoE memory by up to 9.3x but are not yet mainstream.

What is the cheapest GPU setup for running LLMs locally?

For small models (Qwen 3.5-9B, Gemma 3 12B) at Q4: a used RTX 3060 12 GB (~$200) or RTX 4060 Ti 16 GB (~$350) works well. For MoE models (Qwen 3.5-35B-A3B): a single RTX 4090 (24 GB, ~$1,600) or RTX 5090 (32 GB, $1,999). For Llama 4 Scout (109B): three used RTX 3090s or two RTX 5090s.

Back to blog

Technical Guide·Self-Hosting

How to Calculate Hardware Requirements for Running LLMs Locally

The complete guide to estimating VRAM, RAM, storage, and compute for self-hosting LLMs. Covers quantization, context length, KV cache, multi-GPU setups, and practical GPU recommendations for every budget.

Jonathan Chavez

Co-Founder @ LLM Stats

Apr 3, 2026·18 min read

Running LLMs locally is no longer a niche hobby. In 2026, open-weight models like Nemotron 3 Super, Qwen 3.5, Llama 4 Scout, and Kimi K2.5 rival proprietary APIs on most benchmarks. But the first question everyone asks is always the same: will it run on my hardware?

The answer comes down to arithmetic. This guide gives you the exact formulas, the tradeoffs behind each variable, and worked examples so you can estimate requirements for any model before you download a single byte.

The VRAM Formula

Every local LLM deployment boils down to three numbers that must fit inside your GPU's memory:

Total VRAM = Model Weights + KV Cache + Runtime Overhead

Interactive VRAM Estimator

See how parameters, precision, and context length dramatically shift the hardware requirements for local inference.

Model Size

Quantization

Context Length

Total VRAM Required

10.1GB

Fits on RTX 5080 / 4060 Ti (16 GB)

Weights5.0 GB

KV Cache4.0 GB

Overhead1.2 GB

Model weights are the dominant cost (70-75% of total), the KV cache scales with context length (15-20%), and runtime overhead covers CUDA context, activations, and framework buffers (5-10%). The following sections break down each component.

Component 1: Model Weights

The base memory cost of any model is determined by a simple product:

Weight Memory (GB) = Parameters (billions) × Bytes per Parameter

At full FP16 precision, each parameter takes 2 bytes. A 70B model occupies 140 GB. A 7B model occupies 14 GB. This is the floor you're working with before any optimization.

Precision	Bytes / Param	7B Model	13B Model	70B Model
FP32	4.0	28 GB	52 GB	280 GB
FP16 / BF16	2.0	14 GB	26 GB	140 GB
FP8	1.0	7 GB	13 GB	70 GB
INT4	0.5	3.5 GB	6.5 GB	35 GB

Dense vs. MoE: A Critical Distinction

In 2026, most frontier open-weight models use Mixture-of-Experts (MoE) architectures. This is misleading if you only look at "active parameters." The active count determines compute cost per token, but all weights must be loaded into VRAM:

Model	Total Params	Active / Token	FP16 Weight Memory
Qwen 3.5-35B-A3B	35B	3B	70 GB
Llama 4 Scout	109B	17B	218 GB
Nemotron 3 Super	120B	12B	240 GB
Qwen 3.5-122B-A10B	122B	10B	244 GB
Kimi K2.5	1,000B	32B	2,000 GB

The MoE Memory Paradox

A token enters. The router selects just 3 experts to compute it. But all 32 experts must sit in VRAM constantly, waiting to be called.

Inactive ExpertsConsume VRAM

Active ExpertsConsume VRAM + Compute

A model like Qwen 3.5-35B-A3B generates tokens at the speed of a 3B dense model, but requires VRAM for all 35B parameters. At Q4_K_M, that's ~22 GB — a 24 GB GPU can handle it, but you're paying for memory you're not computing with. FloE (2025) demonstrated 9.3x parameter compression for MoE models, but this is still experimental.

Component 1b: Quantization — The Key Lever

Quantization compresses model weights from high-precision formats (FP16, 2 bytes) to lower-precision ones (4-bit, ~0.5 bytes), directly reducing the weight memory from the formula above. This is the single most impactful optimization for fitting models on consumer hardware.

GGUF Quantization Levels

GGUF (used by llama.cpp and Ollama) is the most common format for local inference. The K-quant variants use two-level block quantization with double-quantized scales, delivering better quality per bit than older formats.

Quant Level	Bytes / Param	7B Size	70B Size	Quality Retention
Q8_0	~1.06	8.5 GB	~75 GB	~99.5%
Q6_K	~0.81	6.3 GB	~57 GB	~99%
Q5_K_M	~0.69	5.5 GB	~48 GB	~98%
Q4_K_M	~0.55	4.7 GB	~38.5 GB	~95%
Q3_K_M	~0.43	3.9 GB	~30 GB	~90%
Q2_K	~0.31	3.0 GB	~22 GB	~85%

Q4_K_M is the sweet spot for most users. It retains ~95% of full-precision quality while cutting memory by nearly 4x. Below Q4, degradation becomes noticeable, particularly on reasoning and code tasks.

GGUF vs. GPTQ vs. AWQ

Three quantization ecosystems compete in 2026, each with different tradeoffs. Recent comparisons show:

Format	Runtime	Calibration	Best For
GGUF	llama.cpp, Ollama	None required	CPU/GPU hybrid, Apple Silicon, edge devices
GPTQ	vLLM, TGI, ExLlamaV2	512-2K samples	NVIDIA GPU production serving, ~20% faster tok/s
AWQ	vLLM, TGI	Small calibration set	Highest accuracy at 4-bit, instruction-tuned models

If you're running locally on a single NVIDIA GPU or Apple Silicon, GGUF is the default choice. If you're serving to multiple users with vLLM, GPTQ or AWQ deliver better throughput.

Component 2: The KV Cache

LLMs generate text one token at a time. To produce token #100, the model needs to "attend" to all 99 previous tokens — it computes how relevant each past token is to the current prediction. This attention mechanism requires two vectors per past token per layer: a Key (what information the token offers) and a Value (the actual information to retrieve if relevant).

Without caching, the model would recompute these Key and Value vectors for all 99 previous tokens every time it generates a new one. The KV cache solves this by storing them in VRAM after the first computation. When generating token #100, only that single token's Key and Value need to be computed — the other 99 are read from cache. This turns an O(n²) recomputation problem into O(n) append-and-read.

Why the KV Cache Exists

Each new token needs to attend to every previous token. Without caching, that means recomputing keys and values for the entire history on every step.

Each new token adds one K and one V vector per layer. At 32 layers and 128K tokens, that's over 8 million cached vectors.

The tradeoff is memory. The cache grows linearly with context length and becomes the dominant memory consumer for long-context workloads.

KV Cache (GB) = 2 × Layers × KV Heads × Head Dim × Seq Length × Batch Size × Precision / 10⁹

The "2" accounts for both keys and values. Each transformer layer maintains its own KV pair. KV Heads is the number of key-value heads after any grouping (GQA/MQA), Head Dim is typically 128 for modern models, and Precision is the byte width of the cached values (typically 2 for FP16).

Grouped-Query Attention Reduces KV Cost

Modern models use Grouped-Query Attention (GQA) to share KV heads across multiple query heads. Llama 3.1 8B, for example, has 32 query heads but only 8 KV heads, reducing KV cache by 4x compared to standard multi-head attention. This is already baked into the formula above when you use the actual KV head count.

KV Cache by Context Length

Using Llama 3.1 8B as a reference (32 layers, 8 KV heads, 128 head dim, FP16, batch size 1):

Context Length	KV Cache	% of Total (Q4 Weights)
4K tokens	0.5 GB	10%
8K tokens	1.0 GB	18%
32K tokens	4.0 GB	46%
128K tokens	16.0 GB	77%

How Context Length Shifts the Memory Balance

Llama 3.1 8B at Q4_K_M — same model, same quantization, dramatically different memory profiles.

6.4 GB

6.9 GB

32K

9.9 GB

128K

21.9 GB

Weights (4.7 GB)

KV Cache

Overhead

At 128K context, the KV cache consumes 3.4× more memory than the model weights themselves.

At 128K context, the KV cache dwarfs the model weights. This is why a model that fits comfortably in VRAM at 8K context can OOM at longer contexts. The cache also scales linearly with batch size: serving 4 concurrent users at 32K context would consume 16 GB for KV cache alone.

KV Cache Optimization Techniques

Several techniques reduce KV cache pressure without changing the model:

PagedAttention (vLLM): allocates KV cache in non-contiguous blocks on demand, reducing waste from 60-80% to near zero and enabling 2-4x more concurrent requests
KV cache quantization: compressing cached keys/values to INT4 or INT8 cuts memory by 2-4x with minimal quality loss. Google's TurboQuant (March 2026) achieves 6x reduction with 3-bit storage
FlashAttention: doesn't reduce KV cache size, but avoids materializing the full N×N attention matrix in HBM, reducing peak memory by ~33x during the attention computation itself

Component 3: Runtime Overhead

Beyond weights and KV cache, the inference runtime consumes additional VRAM:

CUDA context: ~300-500 MB. Allocated when the GPU is initialized, before any model is loaded. This is fixed and unavoidable on NVIDIA GPUs.
Activation memory: temporary tensors computed during each forward pass. For inference (not training), this is small — typically 200-500 MB — because only one token is processed at a time during generation.
Framework buffers: PyTorch, vLLM, and llama.cpp each reserve scratch space for internal operations. Ranges from ~200 MB (llama.cpp) to ~1 GB (vLLM with large batch sizes).

A conservative rule of thumb: reserve 1-1.5 GB beyond model weights and KV cache. In practice, this means keeping 5-10% of your GPU's VRAM headroom. vLLM's default --gpu-memory-utilization 0.9 flag reflects this.

GPU Recommendations (2026)

For LLM inference, VRAM capacity is the hard ceiling and memory bandwidth determines token generation speed. Recent research confirms that LLM decoding remains memory-bandwidth-bound even at large batch sizes, with most GPU compute capacity underutilized. A GPU with 50% more bandwidth generates tokens ~50% faster.

GPU	VRAM	Bandwidth	Max Model (Q4)	Price (USD)
RTX 3060	12 GB	360 GB/s	~14B	~$200 used
RTX 4060 Ti	16 GB	288 GB/s	~22B	~$350
RTX 3090	24 GB	936 GB/s	~35B	~$700 used
RTX 4090	24 GB	1,008 GB/s	~35B	~$1,600
RTX 5080	16 GB	960 GB/s	~22B	$999
RTX 5090	32 GB	1,792 GB/s	~50B	$1,999
A100	80 GB	2,039 GB/s	~120B	~$8,000 used
H100	80 GB	3,350 GB/s	~120B	~$25,000

Benchmark Speeds (Q4_K_M Quantization)

Approximate tokens/second for single-user inference using llama.cpp or Ollama, based on recent benchmarks:

GPU	8B Model	32B Model	70B Model
RTX 3090	~48 tok/s	~15 tok/s	~9 tok/s (offloaded)
RTX 4090	~127 tok/s	~30 tok/s	~12 tok/s (offloaded)
RTX 5080	~132 tok/s	~20 tok/s (offloaded)	~12 tok/s (offloaded)
RTX 5090	~213 tok/s	~78 tok/s	~35 tok/s

The RTX 5090's 32 GB is the inflection point: it fits models like Qwen 3.5-35B-A3B (Q4, ~22 GB) entirely in VRAM with room for KV cache, eliminating the offloading penalty that hobbles the 5080 and 4090. For Llama 4 Scout (109B MoE, ~60 GB at Q4), you need 2-3 GPUs or accept CPU offloading at ~5-15x speed penalty.

Apple Silicon: The Unified Memory Advantage

Apple Silicon takes a fundamentally different approach. Instead of discrete VRAM, the CPU and GPU share a unified memory pool. A MacBook Pro with an M4 Max chip has up to 128 GB of unified memory accessible at 546 GB/s bandwidth, and the entire pool is available for model weights.

This means a single laptop can load Llama 4 Scout (109B MoE, ~60 GB at Q4) into memory without any multi-GPU configuration. Using the MLX framework, the M4 Max achieves ~18-20 tok/s on 70B-class dense models at Q4 quantization — slower than a desktop RTX 5090 (35 tok/s) but faster than any offloaded configuration. MoE models run even faster on Apple Silicon because only active parameters hit the compute units while the full model sits comfortably in unified memory.

Chip	Max Memory	Bandwidth	70B Q4 Speed	Price
M4 Max	128 GB	546 GB/s	~20 tok/s	~$4,000
M3 Ultra	192 GB	800 GB/s	~28 tok/s	~$5,500
M5 Max	128 GB	614 GB/s	~22 tok/s	~$4,200

The tradeoff is clear: Apple Silicon offers the simplest path to running large models locally — no multi-GPU configuration, no driver issues, no PCIe bottlenecks. But for raw speed at the same model size, two NVIDIA GPUs with NVLink or even a single RTX 5090 will outperform it on models that fit.

Multi-GPU Setups

When a model won't fit on a single GPU, you split it across multiple cards. There are two strategies, and choosing the right one depends on whether you're scaling within a single machine or across nodes.

Tensor Parallelism (TP)

Tensor parallelism shards individual weight matrices across GPUs. Each GPU holds a slice of every layer and they synchronize via AllReduce after each operation. This requires fast interconnect: NVLink (600-900 GB/s) works well, PCIe (32-64 GB/s) creates a bottleneck.

Best for: 2-8 GPUs within a single node. Set --tensor-parallel-size N in vLLM. Each GPU needs roughly total_model_size / N VRAM plus its share of KV cache.

Pipeline Parallelism (PP)

Pipeline parallelism assigns complete layers to different GPUs. Activations flow sequentially through the pipeline. Communication is point-to-point (not AllReduce), so it works over slower interconnects like InfiniBand.

Best for: multi-node setups or GPUs connected via PCIe. However, PP introduces pipeline bubbles (idle time) for single-request serving. It shines with batched inference.

Tensor Parallelism vs Pipeline Parallelism

How a model with 8 transformer layers gets distributed across 2 GPUs using each strategy.

Tensor ParallelismEvery layer is split across both GPUs

AllReduce after each layer

·Needs NVLink

Pipeline ParallelismEach GPU owns complete layers

Point-to-point between stages

·Works over PCIe

GPU 1

GPU 2

8-layer model × 2 GPUs

Consumer Multi-GPU: The PCIe Reality

Most consumer motherboards connect GPUs via PCIe, not NVLink. With llama.cpp, this is handled transparently — the model is split across GPUs by layer count using --n-gpu-layers. For two RTX 5090s (32 GB each = 64 GB total), you can fit Llama 4 Scout at Q4_K_M (~60 GB weights + ~2.5 GB KV cache at 8K context). Three RTX 4090s (72 GB total) also work.

The speed penalty from PCIe versus NVLink is real but manageable for inference: expect ~10-20% overhead compared to an equivalent single-GPU setup. For training, the penalty would be prohibitive, but inference involves far less communication.

Worked Examples

Three step-by-step calculations for common scenarios. Each uses the formula: Total = Weights + KV Cache + Overhead.

Example 1: Llama 3.1 8B on an RTX 4060 Ti 16 GB

Model: 8B parameters, 32 layers, 8 KV heads, 128 head dim. Target: Q4_K_M, 8K context, batch size 1.

Weights: 8B × 0.55 bytes/param = 4.4 GB
KV Cache: 2 × 32 × 8 × 128 × 8,192 × 1 × 2 bytes = 1.07 GB ≈ 1.0 GB
Overhead: ~1.0 GB
Total: 4.4 + 1.0 + 1.0 = 6.4 GB

Fits comfortably on 16 GB with 9.6 GB to spare. You could increase context to 32K (~4 GB KV cache) and still have headroom, or even run the model at Q6_K (6.3 GB weights) for better quality.

Example 2: Qwen 3.5-35B-A3B (MoE) on an RTX 5090

Model: 35B total parameters (3B active), 40 layers, hybrid attention (25% full attention with 2 KV heads, 75% linear attention with minimal KV). Target: Q4_K_M, 16K context, batch size 1.

This is an MoE model — all 35B weights must be loaded despite only 3B being active per token. The hybrid Gated DeltaNet architecture reduces KV cache significantly: only 10 of 40 layers use full attention with traditional KV caching.

Weights: 35B × 0.55 = 19.3 GB
KV Cache: 2 × 10 × 2 × 256 × 16,384 × 1 × 2 bytes = 0.33 GB ≈ 0.3 GB (only full-attention layers cache KV)
Overhead: ~1.2 GB
Total: 19.3 + 0.3 + 1.2 = 20.8 GB

Fits on the RTX 5090's 32 GB with 11 GB to spare. The hybrid attention architecture is the key insight here: because only 25% of layers use full attention, the KV cache barely grows even at 262K context (~3 GB), making this model uniquely long-context friendly on consumer hardware.

Example 3: Llama 4 Scout (109B MoE) on 3× RTX 4090

Model: 109B total parameters (17B active), 16 experts, MoE architecture. Target: Q4_K_M, 8K context, batch size 1.

This is where MoE memory math bites. Despite only 17B active parameters, all 109B weights need VRAM.

Weights: 109B × 0.55 = 59.9 GB (~20 GB per GPU across 3 cards)
KV Cache: ~2.5 GB total at 8K context (~0.8 GB per GPU)
Overhead: ~1.0 GB per GPU
Per GPU: 20.0 + 0.8 + 1.0 = 21.8 GB

Fits on 3× RTX 4090 (24 GB each = 72 GB total) with ~2.2 GB per GPU to spare. On 2× RTX 4090 (48 GB total), you'd need ~62.4 GB — over the limit. The RTX 5090's 32 GB makes 2 cards viable (64 GB total). Despite the large memory footprint, token generation is fast because only 17B parameters are active per forward pass.

Inference Software: The Four Engines

The inference engine you choose directly affects throughput, latency, memory efficiency, and which hardware you can use. In 2026, four engines dominate, each built around a fundamentally different architecture. HuggingFace TGI entered maintenance mode in late 2025, leaving these as the primary production choices.

Engine	KV Cache Strategy	Batching	Quantization	Hardware
llama.cpp	Static allocation	Simple	GGUF (Q2-Q8, K-quants)	CPU, CUDA, Metal, Vulkan, ROCm
vLLM	PagedAttention	Continuous	GPTQ, AWQ, FP8	CUDA, ROCm, TPU, Intel
SGLang	RadixAttention	Continuous	GPTQ, AWQ, FP8, GGUF	CUDA, ROCm, TPU
TensorRT-LLM	Paged KV + fusion	In-flight	FP8, FP4, NVFP4	CUDA only (Hopper/Blackwell)

llama.cpp — The Portable Engine

llama.cpp is a C++ inference engine with zero dependencies. It runs on anything: NVIDIA, AMD, Intel Arc, Apple Silicon, and pure CPU. The GGUF format supports K-quant quantization (Q2_K through Q8_0) without calibration data, and the engine handles CPU/GPU hybrid inference transparently via --n-gpu-layers.

Recent 2026 optimizations include Metal Tensor API support (+26% geomean improvement on Apple Silicon), Vulkan shared-memory kernels (2.5x speedup on Intel Arc), and MCP client support for tool calling.

Throughput: ~120 tok/s single-user on RTX 3090 (8B Q4). Excels at single-user latency but throughput drops at high concurrency due to simple batching. Ollama wraps llama.cpp in a Go server (ollama run qwen3.5:9b) with model management and a REST API — same performance, easier setup.

vLLM — The Production Workhorse

vLLM is a PyTorch-based engine built around PagedAttention: KV cache is divided into fixed-size blocks (typically 16 tokens) allocated on demand in non-contiguous GPU memory, eliminating the 60-80% waste from traditional pre-allocation. Combined with continuous batching (requests enter and exit the batch dynamically each iteration), vLLM achieves 2-4x throughput over naive PyTorch serving.

The architecture has five components: an Engine Core coordinator, a Scheduler managing waiting/running queues, KV Cache Managers handling block-level allocation, Workers executing GPU inference, and Model Runners preparing inputs. Tensor parallelism is native (--tensor-parallel-size 4). Hardware support spans NVIDIA, AMD ROCm (first-class since v0.17.0), Intel Gaudi, and TPUs.

Throughput: ~12,500 tok/s on Llama 3.1 8B (H100). The broadest model compatibility and largest community of any production engine.

SGLang — The Latency Leader

SGLang introduces RadixAttention: a token-level radix tree that automatically detects and reuses common prefixes across requests. When a new request arrives, the RadixCache finds the longest cached prefix and returns device-mapped KV indices. Unlike PagedAttention (which pages within a single request), RadixAttention shares KV cache across requests, making it significantly faster for workloads with repeated prefixes: multi-turn chat, RAG with shared system prompts, and few-shot learning.

SGLang also leads on structured output: native regex, JSON schema, and grammar constraints are compiled into an FSM (finite state machine) that runs with minimal overhead during decoding. For agentic workflows that need tool calls as JSON, this eliminates retry loops.

Throughput: ~16,200 tok/s on Llama 3.1 8B (H100) — 29% faster than vLLM. Under 16 concurrent requests, SGLang completes 4.6x faster than vLLM. TTFT p95 is consistently 5-8% lower at all concurrency levels.

TensorRT-LLM — Maximum NVIDIA Performance

TensorRT-LLM trades portability for raw speed. It compiles models into optimized CUDA graphs with kernel fusion (LayerNorm + MatMul + bias + activation in a single kernel), hand-tuned Flash Attention variants, and in-flight batching that mixes prefill and decode phases within the same batch.

The quantization support is NVIDIA-native: FP8 on Hopper, NVFP4 on Blackwell (with automatic backend selection between TRT-LLM and FlashInfer kernels based on profiled performance). On H100 with FP8, TensorRT-LLM delivers 10,000+ output tok/s with sub-100ms TTFT — the highest absolute performance of any engine, but it only runs on NVIDIA datacenter GPUs and requires a compilation step before serving.

When to Use Each

Scenario	Best Choice	Why
Personal laptop / Apple Silicon	llama.cpp (or Ollama)	Metal backend, CPU/GPU hybrid, GGUF flexibility
Single NVIDIA GPU, dev use	Ollama or SGLang	Ollama for simplicity, SGLang for structured output
Multi-GPU production API	vLLM or SGLang	Broadest hardware support (vLLM) or best latency (SGLang)
Multi-turn / RAG with shared prompts	SGLang	RadixAttention prefix sharing across requests
Maximum throughput, NVIDIA datacenter	TensorRT-LLM	Kernel fusion, FP8/FP4, in-flight batching
AMD / Intel GPU	vLLM or llama.cpp	First-class ROCm in vLLM; Vulkan in llama.cpp

For Apple Silicon specifically, MLX is purpose-built for the unified memory architecture and outperforms llama.cpp on M-series chips for models that fit entirely in memory.

The hardware equation for local LLM inference is ultimately straightforward: weights plus KV cache plus overhead must fit in your available memory. Quantization is the primary lever, context length is the hidden variable, and memory bandwidth determines speed. Run the numbers for your specific model and use case before making any hardware decisions — the formulas above will get you within 10% of actual usage.

Check our model directory for detailed specs on any model, or use the LLM leaderboard to compare performance before deciding what to run locally.

Questions

Frequently Asked Questions

Llama 4 Scout has 109B total parameters (17B active, MoE). At FP16, that's ~218 GB. With 4-bit quantization (Q4_K_M), it drops to ~60 GB, fitting on 3× RTX 4090 (24 GB each), 2× RTX 5090 (32 GB each), or a single M4 Max with 128 GB unified memory. Add 2-8 GB for KV cache depending on context length.
Yes. Tools like llama.cpp support CPU-only inference using system RAM. It works but is 10-20x slowerthan GPU inference. A hybrid approach that offloads some layers to the GPU and keeps the rest in RAM offers a practical middle ground. The rule of thumb: hybrid inference is worthwhile when 60%+ of the model's layers fit in VRAM.
GGUF (llama.cpp) supports CPU+GPU hybrid inference with flexible quantization levels and no calibration required. GPTQ is GPU-only with calibration-based quantization, offering ~20% faster token generation on NVIDIA GPUs. AWQ preserves critical weights using activation awareness, delivering slightly better accuracy at the same bit width but with a smaller ecosystem of pre-quantized models.
The KV cache scales linearly with context length, batch size, and the number of attention layers. For a Llama 3.1 8B model, the KV cache uses ~4 GB at 32K context and ~16 GB at 128K context. This is often the factor that pushes you from one GPU tier to the next. Techniques like GQA and KV cache quantization can reduce this by 4-8x.
No. MoE models need VRAM for all parameters, not just the active ones. A 120B MoE model with 12B active parameters still loads all 120B weights into memory. The active parameter count only determines compute cost and token generation speed, not memory. Compression techniques like FloE (2025) can reduce MoE memory by up to 9.3x but are not yet mainstream.
For small models (Qwen 3.5-9B, Gemma 3 12B) at Q4: a used RTX 3060 12 GB (~$200) or RTX 4060 Ti 16 GB (~$350) works well. For MoE models (Qwen 3.5-35B-A3B): a single RTX 4090 (24 GB, ~$1,600) or RTX 5090 (32 GB, $1,999). For Llama 4 Scout (109B): three used RTX 3090s or two RTX 5090s.