Back to blog
Research·Technical Deep Dive

The Position of Your Context Matters for LLMs

Three years of research, 50 frontier models, one stubborn finding: where you put a piece of information in the prompt changes the answer more than what is in it. The geometry, the evidence, and what to do about it.

Jonathan Chavez
Jonathan Chavez
Co-Founder @ LLM Stats
·12 min read
The Position of Your Context Matters for LLMs

The most reliable finding in long-context research is also the most uncomfortable. Where you put a piece of information inside a prompt changes the model's answer more than what the information is. The same documents, the same question, the same model. Move the relevant passage from the start of the context to the middle and accuracy can drop by twenty points. Move it back and the model is fluent again.

This is not a quirk of one architecture or one training run. It shows up in every frontier model that has been tested rigorously, across three years and a dozen benchmarks. It shows up at 4K tokens and at 1M tokens. It shows up before any weights are even trained. The shape is so consistent that recent theory has begun treating it as an architectural property of decoder-only transformers, not a bug to be patched.

This piece is what the research actually says. The papers, the numbers, the geometry, and what to do with it on Monday morning.

The U-Shape

Liu et al. · Stanford 2023

Same tokens.
Wrong place.
Half the answer.

When the answer to a question sits in the middle of a long prompt, accuracy collapses. When it sits at the edges, accuracy holds. The curve is a U, and three years of frontier models have not bent it flat.

0pp
Accuracy lost in the middle
0%
Closed-book baseline · GPT-3.5
0 yrs
Since 2023, still unsolved
Curve illustrates the U-shape from Liu et al., "Lost in the Middle" (TACL 2024, arXiv:2307.03172), multi-document QA over 20 documents. With the answer mid-context, GPT-3.5-turbo scored below its closed-book baseline of 56.1% — the model performs worse with the answer in the prompt than without any documents at all.

The Shape Everyone Found

The reference experiment is Liu et al., “Lost in the Middle” (Stanford, TACL 2024). Take a multi-document QA setup with 20 retrieved passages where exactly one contains the answer. Hold everything else fixed and slide the answer passage from position 1 to position 20. Plot accuracy against position.

The curve is a U. Models score highest when the answer is at the beginning of the context (the primacy bias) or at the very end (the recency bias). When the answer is buried in the middle of a 20-document context, performance collapses, often below what the same model scores with no documents at all. GPT-3.5-turbo with the answer mid-context dropped under its own closed-book baseline of 56.1%. The model performed worse with the answer in the prompt than without it.

What made the result hard to dismiss was that it persisted across model size, architecture, and explicit long-context training. The paper found the shape in GPT-3.5, Claude 1.3, Llama-2, and purpose-built long-context models. Closing the window did not help; opening it made the trough deeper.

A year later, Hsieh et al. (Google + UW, ACL Findings 2024) measured the same effect inside the attention mechanism itself. They quantified the average attention weight that LLMs assigned to each position, independent of content, and found a U-shape there too. The model attends more to the start and end of its input as a function of position alone. Their calibration method, which subtracts this intrinsic bias from attention scores, recovered up to 15 percentage points on tasks where the gold document was in the middle. The U was not just an output artifact. It was inside the math.


Why the U Exists

For two years the prevailing explanation was that the U was learned. Models saw most of their important content at the start or end of training documents (titles, summaries, conclusions), so they inherited a positional prior at training time. Recent theory says the cause runs deeper than that.

Wu, Wang, Jegelka, and Jadbabaie (MIT, ICML 2025) built a graph-theoretic framework for position bias in multi-layer attention. Modeling each attention mask as a directed graph, they proved that causal masking inherently biases attention toward earlier positions, because tokens in deeper layers attend to increasingly contextualized representations of those early tokens. The bias is not learned; it is the geometry of stacking causal attention.

Their attention-only analysis predicted pure primacy (full collapse onto the first token at infinite depth), which empirical models do not show. Two 2026 papers (Herasimchyk et al., A Residual-Aware Theory) and (Chowdhury, Lost in the Middle at Birth) resolved the discrepancy by adding residual connections to the theory. Once you do, the closed-form influence density splits into three regimes:

  • The primacy tail. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt. Early tokens become anchors that every later token reaches back to.
  • The recency delta. Residual connections create an isolated, full-strength anchor at the final token. The token generating the next output is one residual hop away from everything it just produced.
  • The factorial dead zone. Between the two extremes, influence falls as 1/(H−1)! where H is network depth. The middle is a topological valley that training does not climb out of.

Chowdhury validates the theory empirically. Untrained Qwen2 and GPT-2 architectures exhibit the U at Step 0, with random weights, and the shape is identical with or without RoPE. Standard pretraining does not flatten it. The U is what the architecture looks like before anything has happened to it. Everything we do downstream is built on that prior.

Why the U Exists

causal attention · geometry

The shape lives in the math.

Decoder-only transformers see the prompt through a triangle of attention. The first token is read by every later token. The last token reads itself directly. The tokens in between are seen by fewer people, and seen less clearly.

The primacy tail

The first token is read by every token that comes after.

Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt. Early tokens become anchors the rest of the network reaches back to. They act as attention sinks long before any positional encoding is even applied.

The recency delta

The last token reads itself, with full strength.

Residual connections create an isolated, full-strength anchor at the final position. The token generating the next output is one residual hop away from everything it just produced.

The factorial dead zone

Between the two extremes, influence falls as 1/(H−1)!

Where H is network depth. The middle of the prompt is a topological valley that training does not climb out of. It is the U-shape before any weights are learned, and recent theory shows it is already there at random initialization.

Causal attention matrix, schematic. The closed-form analysis of primacy and recency at initialization is from Lost in the Middle at Birth (arXiv:2603.10123) and the residual-aware position-bias theory of Wu et al. (MIT, ICML 2025, arXiv:2502.01951). The U-shape persists with or without RoPE.

Longer Means Deeper

The U is the part of the picture that survives at any length. The part that gets worse with length is the trough. Modarressi et al., NoLiMa (Adobe Research, ICML 2025) is the cleanest evidence. They built a needle-in-a-haystack benchmark in which the question and the needle share minimal lexical overlap, forcing the model to retrieve via semantic association rather than string match. Twelve frontier models that all claim 128K+ context windows scored near-perfect under 1K tokens. At 32K, ten of the twelve fell below 50% of their short-context baseline. Even GPT-4o, the leader, dropped from 99.3% to 69.7%.

Chroma's Context Rot study (Hong, Troynikov, Huber, July 2025) made the result domain-general. Across 18 frontier models including Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and the Qwen3 family, every single model degraded with length on five distinct experiments. The headline numbers from their report:

  • All 18 models showed accuracy drops with longer input, well before the advertised window limit.
  • Models did worse on logically coherent haystacks than on randomly shuffled ones, inverting the “tidy your context” instinct.
  • A length-only floor of about 7.9 percent shows up before any distractors are added; pure length is a tax on accuracy regardless of what fills the tokens.

Read alongside RULER, NoLiMa, and Lost in the Middle, the picture is consistent. The U is real. The middle deepens with length. The 1M-token window is a marketing number, not a working number.

The Dead Zone Map

retrieval accuracy

Long contexts have a dead zone.

The longer the context, the bigger and deeper the trough. Accuracy falls fastest in the middle, and the cliff arrives well before the advertised window limit.

1K
8K
32K
128K
512K
1M
0%
20%
40%
60%
80%
100%
99
98
97
95
92
99
98
94
88
82
80
96
96
88
74
64
62
92
94
80
58
44
48
88
90
70
44
32
36
82
86
60
32
22
26
74
accuracy
20%100%

Vertical axis

Where the answer is buried, from the start of the context (0%) to the very end (100%).

Horizontal axis

Total prompt length, from a comfortable 1K tokens out to a 1M advertised window.

Schematic. Numbers reflect the consistent pattern measured in needle- in-a-haystack and RULER-style suites: edges hold up, middle collapses, and the gap widens at length. See Hsieh et al. (NVIDIA, arXiv:2404.06654), Modarressi et al. NoLiMa (ICML 2025, arXiv:2502.05167), Hong et al. Context Rot (Chroma, 2025).

The Window Is Not the Window

To make this measurable, Hsieh et al. at NVIDIA built RULER (2024), a synthetic benchmark with 13 long-context tasks across retrieval, multi-hop tracing, and aggregation. They define effective context length as the longest input where a model averages above Llama-2-7B's 4K score (85.6%) on the suite. The threshold is deliberately low; it is what a small, short-context model already cleared in 2023.

On that bar, half the frontier of 2024 falls short of its own number plate. Gemini 1.5 Pro (claimed 1M) is effective at 128K. GPT-4 (claimed 128K) is effective at 64K. Claude 3.5 Sonnet (claimed 200K) is effective at 64K. Mistral-v0.2 (claimed 32K) is effective at 16K. DBRX (claimed 32K) is effective at 8K. LWM (claimed 1M) is effective below 4K. The gap is not a few percent; it is often an order of magnitude.

The 2025 generation of models pushed the bar higher. Qwen3-32B and GLM4 hold above the threshold past 128K on the same suite. But the structure of the gap persists: claimed length is the maximum a model will accept, effective length is the maximum where it answers reliably, and the two are different numbers. Use the wrong one to design a system and you have a bug, not a feature.

Claimed vs Effective Context

RULER · 85.6% threshold

The window on the spec sheet
is not the window the model uses.

NVIDIA's RULER suite measures the longest context where a model still answers reliably. Half the frontier of 2024 falls short of its own number plate, often by an order of magnitude.

4K
32K
128K
1M
ratio

Gemini 1.5 Pro

13%

GPT-4 (turbo)

50%

Claude 3.5 Sonnet

32%

Llama 3.1 70B

50%

Mixtral 8x22B

50%

Command-R+

25%

Qwen2 72B

25%

DBRX

25%

claimed
effective at 85.6%
Effective length is the longest input where the model averages above Llama-2-7B's 4K-token RULER score (85.6%). Hsieh et al., NVIDIA, RULER (arXiv:2404.06654). Ratios round to nearest percent on a log scale.

Length Itself Is a Tax

The next question is whether the degradation comes from distractors, from coherent-but-irrelevant context, or from raw length. Levy, Jacoby, and Goldberg (Bar-Ilan + AI2, ACL 2024 Outstanding Paper) settled it with FLenQA, a controlled True/False reasoning benchmark.

The setup is sterile by design. Each sample contains two short facts that together imply the answer. The reasoning never changes. What changes is the amount of irrelevant padding wrapped around them, with controls for padding type, location, and dispersion. They generated versions at roughly 250, 500, 1K, 2K, and 3K tokens, every model getting the same logic at every length.

Average accuracy drops from around 0.92 at the shortest input to 0.68 at 3,000 tokens, across GPT-4, GPT-3.5, Claude 2.1, Gemini-Pro, Mistral 70B, and Mixtral 8x7B. The trend is universal across models, padding types, and dispersion.

Three thousand tokens. Not three hundred thousand. Well below the advertised window of every model tested. The lift you would intuitively expect from “more context” reverses into a consistent loss the moment the relevant signal stops fitting in a short prompt. Length is not free; it is paid out of accuracy before any distractor is added.

Combine FLenQA with the 7.9% length-only floor that Chroma measured in 2025 and a clean rule emerges. Two budgets are at stake on every long prompt: the budget the model says it has, and the budget the model can actually reason over. Treat the second as a fraction of the first.

Length Itself Is Costly

FLenQA · same task, more tokens

The penalty starts at
three thousand tokens.

Hold the question fixed. Pad the prompt with irrelevant text. The reasoning gets harder. Same logic, same answer, just more tokens to look through. Every model tested degrades, well below the advertised maximum.

Reasoning accuracy on FLenQA, a controlled true/false QA benchmark where the underlying logic is fixed and only the surrounding padding grows. Levy, Jacoby, Goldberg (Bar-Ilan / AI2, ACL 2024, arXiv:2402.14848). The ranking and shape generalize across padding type and location.

The Practical Playbook

You cannot retrain the attention mechanism, and the calibration mechanisms that work in the paper (Found-in-the-Middle, residual rebalancing) are not exposed in production APIs. What you can do is engineer around the geometry. The four moves below survive contact with production systems because they respect the U instead of fighting it.

1. Edge-load the prompt

Place the most important documents and the question at the very start or the very end. Anthropic's own long-context guidance recommends putting 20K+ token documents at the top and restating the actual query at the bottom; their internal tests show up to a 30 percent improvement on multi-document tasks from this layout alone. Treat the middle 60% as eviction-cache territory: bulk reference material lives there, but no critical instruction or load-bearing fact does.

2. Cap the budget aggressively

The window on the spec sheet is not the window the model uses. A working rule that survives both RULER and Context Rot data is 25 to 30 percent of the advertised maximum. Past that, the length-only tax compounds with the U-shape and eventual collapse becomes nonlinear. If you must go longer, instrument what you ship: needle-in-a-haystack regression tests at the lengths your prompts actually reach, not just at 4K.

3. Retrieve, do not stuff

BABILong showed that retrieval-augmented generation hits roughly 60% on single-fact QA independent of context length, while in-context reasoning collapses past 32K and frontier models effectively use only 10 to 20% of their advertised context. The implication for production: prefer compaction, summarization, and just-in-time tool calls over stuffing the whole corpus into the prompt. Keep the working set small and the relevant signal close to the model's edges.

4. Structure the surface

When you must place documents in long context, wrap them in delimiters (XML tags, named blocks), keep their order stable across calls, and instruct the model to quote relevant passages before answering. The quote-first pattern is a poor man's attention calibration: it forces the model to re-emit the load-bearing tokens at the recency anchor before generating an answer, which moves them from the dead zone to the edge of the prompt. Found in the Middle measures up to 15 percentage points of recovery from the formal version of this; the informal version is one extra sentence in your system prompt.

The Playbook

four moves · cited

You cannot retrain
the attention mechanism.
You can engineer around it.

Four moves consistently survive contact with production systems. They work because they respect what the math is telling you, not because they fight it.

01

Edge-load the prompt

put the answer where attention lives

Place the most important documents and the question at the very start or the very end of the prompt. Anthropic measures up to a 30 percent improvement on multi-document tasks just by moving the long-form data above the query and keeping the instructions at the bottom. Treat the middle 60 percent as a cache eviction zone.

+30%

multi-doc accuracy · Anthropic

02

Cap the budget aggressively

use 25 to 30 percent of the advertised window

The 1M token window is a marketing number, not a working number. Chroma found measurable degradation across all 18 frontier models tested, with a 7.9 percent length-only floor before any distractors are added. Treat your effective context as a fraction of the spec, not a target.

7.9%

length-only accuracy loss · Chroma

03

Retrieve, do not stuff

fetch on demand, summarize between turns

Retrieval-augmented generation hits roughly 60 percent on single-fact QA on BABILong independent of context length, while in-context reasoning collapses past 32K. Compaction, summarization, and just-in-time tool calls keep the working set small and the relevant signal close to the model's edges.

10–20%

useful share of context · BABILong

04

Structure the surface

delimiters, headings, repeated grounding

Wrap documents in XML tags, name them, and ask the model to quote relevant passages before answering. Found-in-the-Middle calibration recovers up to 15 percentage points on middle-position retrieval. Without that lever in production APIs, structure and re-grounding are the practical substitutes.

+15pp

middle-position recovery · Found in the Middle


What Comes Next

The 2026 research is converging on three honest answers about what can fix this and what cannot.

Position bias is structural, not stylistic. The theoretical work from MIT and the residual-aware analyses from early 2026 establish that the U-shape is what causal decoder-only transformers do by default. Asking standard pretraining to remove it is asking gradient descent to undo geometry. The shape persists at random initialization; pretraining does not climb out of the topological valley.

Calibration helps, but at the model layer. Found-in-the-Middle, Wu's graph-theoretic interventions, and subsequent calibration work all show consistent recoveries (5 to 15 percentage points) by editing attention scores after softmax. None of this is exposed in production APIs today. The vendors who ship calibrated long-context behavior natively will have a real structural advantage; the rest will compete on prompt structure and effective context length.

The benchmark gap is closing, slowly. The Qwen3 family in 2025 and the latest frontier releases in 2026 have pushed effective context past 128K on RULER and held NoLiMa scores higher than the 2024 cohort. But every cleanly-controlled length-isolation study, including FLenQA, Context Rot, and the residual-aware theory work, finds the same general shape. The U is shallower in 2026 than in 2023. It is still a U.

The takeaway is not that long context is broken. It is that long context is a budget, and budgets need to be planned. The advertised window is a ceiling, not a target. The middle of that window is the most expensive real estate in your prompt. The edges, structure, and retrieval discipline are still the levers that actually move accuracy.

Position is a feature of your prompt. Treat it like one.

Questions

Frequently Asked Questions

  • It means the model's answer depends on where a fact sits inside the prompt, not just whether the fact is present. Liu et al. (TACL 2024) first measured a U-shaped accuracy curve: when the answer to a multi-document question is at the start or end of the context, models score highest; when it is in the middle, accuracy often falls below the closed-book baseline. Three years and dozens of models later, the curve has not flattened.
  • It is structural, not learned. Wu et al. (ICML 2025) showed graph-theoretically that causal masking biases attention toward early tokens. Chowdhury (2026, arXiv:2603.10123) and Herasimchyk et al. (2026) prove the full U is already present at random initialization: causal masking creates a primacy tail, residual connections create a recency anchor, and the middle is a factorial dead zone of order 1/(H−1)! where H is depth. Pretraining does not climb out of it.
  • Smaller than the spec sheet. NVIDIA's RULER (2024) defines effective context as the longest input where average accuracy stays above Llama-2-7B's 4K-token score (85.6%). On that bar, Gemini 1.5 Pro's 1M window is effective at 128K, GPT-4's 128K window is effective at 64K, and DBRX's 32K is effective at 8K. Chroma's 2025 Context Rot study confirms this on 18 frontier models: every one degrades with length, with a 7.9% accuracy floor from length alone before any distractors.
  • At the edges. Anthropic's long-context guidance recommends placing 20K+ token documents at the very top of the prompt and the actual query at the very bottom, with examples and instructions sandwiched between. Their internal tests show this can improve multi-document accuracy by up to 30 percent. Treat the middle 60% as a cache eviction zone for anything load-bearing.
  • Mostly, yes. BABILong (NeurIPS 2024) finds that RAG hits roughly 60% on single-fact QA independent of context length, while in-context reasoning collapses past 32K. That is not because RAG is smarter; it is because RAG keeps the working set small, which keeps the relevant signal close to the model's edges. The right mental model in 2026 is: in-context for cross-document reasoning, RAG for point lookups, and never mix the two by stuffing 200K tokens hoping the model sorts it out.
  • Partially. Found in the Middle (Hsieh et al., ACL Findings 2024) introduced an attention-calibration mechanism that estimates and removes the intrinsic U-shaped positional bias from attention scores, recovering up to 15 percentage points when the relevant document is in the middle of the input. The technique requires modifying the attention mechanism, which is not exposed by production APIs. Until vendors adopt it natively, the practical levers remain prompt-side: edge-load the prompt, structure with delimiters, and ask the model to quote relevant passages before answering.

Continue Reading