Meta shipped 10M-token context. The model scores 15.6% at 128K tokens. Here's what actually works and what doesn't.
Llama 4 Scout claims a 10-million-token context window. That's roughly 15,000 pages of text, 500,000 lines of code, or the entire Harry Potter series fifteen times over. When Meta announced it in April 2025, the number dominated every headline. What didn't make headlines: on Fiction.LiveBench -- a test that measures whether a model actually understands what it reads -- Scout scored 15.6% at just 128K tokens. Gemini 2.5 Pro scored 90.6% on the same test.
The model that claims to read 10 million tokens can barely comprehend 128 thousand.
This is the story of the context window arms race: where the marketing is enormous, the reality is complicated, and "more context" is almost never the answer you think it is.
Every major AI lab now ships at least 1 million tokens of context. The race has produced a clear hierarchy:
| Model | Context Window | Provider | Type | Release |
|---|---|---|---|---|
| LTM-2-Mini | 100M tokens | Magic.dev | Research/private | Aug 2024 |
| Llama 4 Scout | 10M tokens | Meta | Open-weight | Apr 2025 |
| Grok 4.20 | 2M tokens | xAI | Proprietary | Mar 2026 |
| Claude Opus 4.6 | 1M tokens | Anthropic | Proprietary | Mar 2026 |
| GPT-5.4 | 1.05M tokens | OpenAI | Proprietary | Mar 2026 |
| Gemini 3.1 Pro | 1M tokens | Proprietary | Feb 2026 | |
| Llama 4 Maverick | 1M tokens | Meta | Open-weight | Apr 2025 |
| Qwen 3.6 Plus | 1M tokens | Alibaba | Proprietary | Mar 2026 |
| Mistral Small 4 | 256K tokens | Mistral | Open-weight | Mar 2026 |
| DeepSeek V3.2 | 128K tokens | DeepSeek | Open-weight | 2025 |
Two years ago, 128K was impressive. Now five models offer 1M+, two offer 2M+, and Meta claims 10M. Context size has become table stakes. The question isn't "how big" anymore. It's "how well."
The engineering behind Scout's context window is genuinely clever, even if the end result disappoints.
The core innovation is iRoPE -- Interleaved Rotary Position Embeddings. Here's how it works:
Every four transformer layers follow a repeating pattern:
So 75% of the model's layers only look at 8K tokens at a time. The other 25% periodically sweep the full context. This is how you make 10M tokens computationally feasible -- you cheat, elegantly. Most of the model operates locally. A minority of layers operate globally.
The model was pre-trained on over 30 trillion tokens across 200 languages, but at a 256K context length -- not 10M. The jump from 256K to 10M happens through length generalization during instruction tuning. The model was never trained on 10M-token sequences. It extrapolates.
This matters. A lot.
Before we talk about whether it works, let's ground the number in reality.
| Content Type | Approximate Token Count | Fits in 10M? |
|---|---|---|
| One page of text | ~500 tokens | Yes (20,000 pages) |
| Average novel (80K words) | ~100K tokens | Yes (~100 novels) |
| War and Peace + entire Harry Potter | ~1.5M tokens | Yes, 6x over |
| Full codebase (500K lines) | ~2M tokens | Yes, 5x over |
| 10 hours of audio transcription | ~1M tokens | Yes, 10x over |
| One hour of video at 1fps | ~1M tokens | Yes, 10x over |
| Entire English Wikipedia | ~4B tokens | No |
Source: estimates from Digital Applied and Gemini 1.5 technical report
The promise is tantalizing. Dump your entire codebase, all your legal contracts, or a semester's worth of lecture transcripts into a single prompt. No chunking, no RAG pipeline, no retrieval step. Just... throw everything in.
The reality is different.
Here's the data that context window marketing doesn't want you to see.
Research consistently shows that models' effective context -- the range where they maintain reliable performance -- is roughly 60-70% of their advertised maximum. A 200K model becomes unreliable around 130K. A 1M model starts degrading around 600-700K.
For Scout's 10M, applying this rule generously gives an effective range of 6-7M for simple retrieval. For synthesis and reasoning? Much lower.
Chroma Research tested 18 frontier models -- including GPT-4.1, Claude Opus 4, and Gemini 2.5 -- and found that every single model exhibited continuous performance degradation as input length increased. Not a cliff. A slope. It starts early and gets worse.
Their most surprising finding: models performed worse on logically structured documents than on randomly shuffled text. Coherent structure triggers recency bias -- the model leans on the last few paragraphs rather than synthesizing the whole document. More structure, paradoxically, means worse long-context performance.
Stanford and UC Berkeley researchers demonstrated that LLMs exhibit a U-shaped performance curve: they recall information well from the beginning and end of the context, but accuracy drops 30%+ when relevant information is buried in the middle. This isn't a Scout-specific problem. It affects every model. And it gets worse as context grows, because there's more "middle" to get lost in.
Meta released almost no long-context evaluations beyond needle-in-a-haystack. Nathan Lambert (Interconnects.ai) noted that the absence of RULER benchmark results and NoLiMa evaluations was conspicuous. NIAH is the easiest possible long-context test -- find one specific sentence in a pile of text. It's necessary but not sufficient.
Independent evaluations told a grim story:
| Test | Scout Score | Competitor Score | Gap |
|---|---|---|---|
| Fiction.LiveBench (128K) | 15.6% | Gemini 2.5 Pro: 90.6% | -75 points |
| GPQA Diamond | 57.2% | Maverick: 69.8% | -12.6 points |
| LiveCodeBench | 32.8% | Maverick: 43.4% | -10.6 points |
| ARC-AGI-2 | 0.0% | (baseline) | Failed |
| Aider Polyglot (coding) | 16% | Qwen 2.5 Coder: higher | Poor |
At 300K tokens, one independent tester reported the model "collapsed completely" -- failing to identify hidden test sentences and instead generating responses from pre-training knowledge rather than analyzing the provided text.
Maverick (the smaller sibling with 1M context) outperforms Scout on all 11 standard benchmarks. The model with 10x less context is better at everything.
Let's say you believe in the 10M promise and want to use it. What does it cost?
The KV cache -- the memory structure that stores information about all previous tokens -- grows linearly with context length. At 10M tokens, one analysis calculated the KV cache alone requires approximately 32TB of memory. A single H100 has 80GB. You'd need roughly 240 H100 GPUs just for the cache.
More conservative estimates still require massive hardware:
| Context Length | KV Cache Memory | Hardware Needed | Approximate Cost |
|---|---|---|---|
| 32K tokens | ~2 GB | 1x H100 (with model) | $20K |
| 128K tokens | ~8 GB | 1x H100 | $20K |
| 1M tokens | ~64 GB | 8x H100 | $160K |
| 3.6M tokens | ~230 GB | 8x H200 | $280K |
| 10M tokens | ~410-32,000 GB | 7-240x H100 | $140K-$4.8M |
Source: vLLM Llama 4 blog, hardware analysis, APXML
The range depends on quantization and optimization. But even the optimistic end -- 7 H100s at ~$90/hour in the cloud -- means a single 10M-token query costs real money. Contrast that with RAG, where retrieving 5-10K relevant tokens costs fractions of a penny.
In practice, most API providers cap Scout at 128K to 1M tokens for consistent performance and economic viability.
Enough doom. Here are the use cases where long context genuinely works -- and where it doesn't.
Entire-codebase analysis. Loading 40-50K lines of code into a 1M context window lets you ask "if I change this function, what breaks?" without building a retrieval pipeline. Sourcegraph testing showed improvements in recall and helpfulness when full codebases fit in context. This is the killer app for long context in 2026.
Bounded document analysis. A set of 50-100 legal contracts. A full quarterly financial report. A patient's complete medical history. When the corpus is fixed, bounded, and fits in context, dumping it all in works. No chunking artifacts. No retrieval misses.
Long-running agent sessions. AI agents that make 20-30+ tool calls accumulate significant context. A 1M window means the agent remembers everything it's done in a session without summarization tricks that lose detail.
Cross-document reasoning. Finding contradictions across five research papers. Comparing clauses across a dozen contracts. Tasks where the answer depends on information scattered across multiple documents and you need the model to see all of them simultaneously.
Synthesis over massive corpora. Asking "summarize the themes across these 500 documents" sounds like a context window problem. It's not. Beyond ~1-2M tokens, synthesis quality degrades severely. The model can retrieve from 10M tokens (find a specific fact), but it can't reason across 10M tokens (connect ideas from beginning, middle, and end).
Dynamic or frequently updated data. If your knowledge base changes weekly, stuffing it into context every query is expensive and wasteful. RAG with an updated index is orders of magnitude more efficient.
Permission-sensitive data. Long context has no concept of access control. If User A shouldn't see Document B, you can't dump everything into one context. RAG systems can filter by permissions.
Cost-sensitive applications. Processing 10M tokens costs $1.10-$120 per query depending on model and provider. A well-optimized RAG system retrieving the relevant 5-10K tokens costs ~$0.00008. That's a 1,250x cost difference.
Every time a larger context window ships, someone writes "RAG is dead." It's been declared dead after Gemini 1.5's 1M, after Llama 4's 10M, and probably will be after Magic.dev's 100M.
RAG is not dead. The math doesn't support it.
| Dimension | Long Context (10M) | RAG |
|---|---|---|
| Cost per query | $1.10-$120 | ~$0.00008 |
| Latency | 30-60 seconds at high token counts | ~1 second |
| Compute overhead | 260% overhead vs 2K context at 128K | Minimal per-query |
| Access control | None | Permission-aware filtering |
| Citations | Weak unless post-processed | Built into pipeline |
| Corpus size limit | 10M tokens (~15K pages) | Unlimited |
| Data freshness | Stale after context creation | Updated with index |
The 2026 consensus, and I agree with it: use both. RAG retrieves the most relevant documents from your entire knowledge base. Long context reasons over those retrieved documents. RAG handles scale (millions of documents). Long context handles depth (hundreds of pages). Andrej Karpathy calls this "context engineering" -- the art and science of filling the context window with just the right information for the next step.
The hybrid approach outperforms either method alone. Use RAG to narrow 10 million documents to 100 relevant ones. Load those 100 into a 1M context window. Let the model reason across them. That's the architecture that works.
Meta's Llama 4 launch was marred by a credibility crisis that colors everything about the 10M context claim.
Meta submitted a specially crafted, non-public version of Llama 4 Maverick to LM Arena -- one "optimized for conversationality" with verbose, emoji-filled outputs. The public release version was nothing like it. When LM Arena tested the actual release, Maverick dropped from #2 to #32. Scout fell out of the top 100.
LM Arena stated that Meta's interpretation of their policy "did not match what we expect from model providers." A departing Meta AI Chief confirmed: "Results were fudged."
AI commentator Zvi Mowshowitz called it "by far the most negative reaction I have seen to a model release." The LocalLlama subreddit described Scout as "severely underwhelming on all fronts."
When a company manipulates benchmarks on one dimension, it's reasonable to scrutinize their other claims. Including the 10M number.
Here's what I'd actually recommend based on the data:
Use a 1M context model (Maverick, Claude Opus 4.6, Gemini 3.1 Pro) with your full codebase loaded. Don't bother with Scout's 10M -- Maverick performs better on every coding benchmark at 1M context. If your codebase exceeds 1M tokens, use RAG to select the relevant modules first.
Under 100 documents: Load them all into 1M context. Simple and effective.
100-1,000 documents: Hybrid approach. RAG retrieves the top 20-50 most relevant. Load those into context for cross-document reasoning.
Over 1,000 documents: Pure RAG. No context window is large enough to hold thousands of documents effectively, and even if it were, the lost-in-the-middle problem would destroy accuracy.
| Priority | Best Choice | Why |
|---|---|---|
| Maximum recall quality | Claude Opus 4.6 | 76% on MRCR 8-needle at 1M -- best in class |
| Cost-efficient bulk processing | Gemini 3.1 Pro | $2/MTok input, strong 1M performance |
| Self-hosted / private | Llama 4 Maverick | Apache-like license, 1M context, runs on 8x H100 |
| Budget self-hosted | Qwen 3.5-122B | MoE, fits on 1x H100 quantized, 262K context |
| Pure retrieval at scale | Llama 4 Scout | If you literally only need "find X in 10M tokens" |
The 10M context window is a marketing number. Not a lie -- the architecture supports it -- but a number designed to generate headlines rather than solve problems.
Here's my evidence. The model was trained at 256K context and extrapolates to 10M through length generalization. The KV cache at 10M tokens requires hardware that most organizations don't have. Independent evaluations show the model collapsing at 300K tokens for synthesis tasks. Meta released no serious long-context benchmarks beyond needle-in-a-haystack. And the company was caught manipulating benchmarks on the same model release.
The useful Scout is a 256K-1M retrieval model. In that range, it's a solid open-weight option for finding specific information in large document sets. It's surprisingly good at low-resource language translation -- Swahili, Georgian, languages that other models struggle with. That's a real strength that Meta barely marketed.
But 10M tokens of genuine comprehension? No. Not with current architecture. Not with length generalization from 256K training. The gap between "can technically accept 10M tokens as input" and "can reason meaningfully over 10M tokens" is enormous.
More broadly, I think the context window arms race has reached diminishing returns. We went from 4K (GPT-3.5) to 128K (GPT-4) to 1M (five models in 2026) in three years. Each jump mattered less than the one before. The 4K-to-128K jump was transformative -- suddenly you could fit entire documents instead of paragraphs. The 128K-to-1M jump was useful for codebases and large document sets. The 1M-to-10M jump is... theoretical. I can't name a production use case where 10M tokens of context is the right solution and RAG + 1M isn't.
The future isn't bigger context windows. It's smarter context engineering. Filling the window with the right 100K tokens is worth more than filling it with 10M irrelevant ones. Andrej Karpathy said it best: context engineering is "the delicate art and science of filling the context window with just the right information for the next step."
That's the skill that matters. Not how big your window is. How well you use it.