DeepSeek V4

How hybrid attention, million-token context, and agent post-training push an open model close to the frontier.

This article is based on the official DeepSeek-V4 technical report and DeepSeek's preview release. The paper figures shown below were extracted from the official PDF and stored locally in this site.

Why V4 matters

The headline is easy to miss.

Yes, DeepSeek-V4-Pro is large at 1.6T total parameters with 49B active, and DeepSeek-V4-Flash is still huge at 284B total with 13B active. But the more important point is that both models support a 1M-token context window, and the paper is explicit about why that matters: DeepSeek sees test-time scaling and long-horizon agentic work as the next frontier.

That changes the way we should read the release. V4 is not just a bigger open model. It is a model trying to make very long context economically usable by default.

Overview figure from the DeepSeek V4 paper showing benchmark performance on the left and inference FLOPs plus KV cache savings on the right
Figure from the DeepSeek-V4 paper. Left: V4-Pro-Max against recent frontier models on knowledge, reasoning, and agentic benchmarks. Right: why the paper is strategically interesting at all — much cheaper long-context inference and KV-cache usage than V3.2.

The right half of that figure is the real story. DeepSeek reports that at 1M context:

If those gains hold in practice, the model can spend more of its budget on actual reasoning instead of just hauling around context.

The architecture at a glance

Before we zoom into the long-context mechanism, it helps to see the whole block once.

Overall DeepSeek V4 architecture with MTP modules, prediction head, residual mixing, pre-block and post-block mixing, CSA or HCA attention, and DeepSeekMoE
Overall architecture from the paper. The block has three ideas that matter for the rest of this article: hybrid attention (`CSA / HCA`), stronger residual routing (`mHC` via the residual mixing path), and a large MoE feed-forward stack (`DeepSeekMoE`).

There are three moving parts worth tracking:

  1. CSA / HCA: the long-context attention mechanism
  2. Residual mixing: where mHC strengthens the residual path
  3. DeepSeekMoE: the expert-based feed-forward stack

The easiest mistake here is to focus only on the MoE size. The paper’s real novelty is the attention stack.

What problem V4 is actually solving

DeepSeek V3.2 had already pushed agent training and reasoning much further than the older V3 line. It introduced “thinking in tool-use” and a large synthetic agent-training corpus spanning 1,800+ environments and 85k+ complex instructions. But V3.2 still lived with the old long-context cost curve.

That matters because many tasks that feel “frontier” are really context problems:

The model is not only limited by its weights. It is limited by how much context and scratch space it can afford to use.

The central V4 bet: if you make million-token context cheap enough, you unlock better test-time scaling, more reliable agents, and a wider range of long-horizon tasks without needing a completely different base architecture.

Build the attention mechanism together

The paper introduces two complementary attention modules:

They are interleaved across layers. One preserves more detail and retrieves selectively. The other is cheaper and coarser.

CSA: keep detail where it matters

CSA compresses the KV cache first, then uses a sparse selection step to choose which compressed regions deserve attention. It also keeps a local sliding window so nearby detail is never lost.

CSA diagram from the DeepSeek V4 paper showing token-level compression, Lightning Indexer, top-k selector, and sliding window KV entries
CSA from the paper. The key pieces are the token-level compressor, the Lightning Indexer, the top-k selector, and the local sliding window. V4 does not pay full attention cost everywhere. It compresses first, then spends detail selectively.

The flow is:

  1. compress many KV tokens into fewer entries
  2. score which compressed regions matter
  3. retrieve only the top relevant compressed entries
  4. combine that with a local window of raw nearby tokens

This is a nice design because it preserves two kinds of structure at once:

HCA: go much cheaper on long-range memory

HCA makes a different trade. It compresses the KV cache much more aggressively and uses that compressed memory as a cheap global scaffold.

HCA diagram from the DeepSeek V4 paper showing heavier KV compression plus a local sliding window
HCA from the paper. Same high-level idea as CSA, but the compression is much stronger and the global memory is much coarser. This is where V4 gets a large part of its million-token efficiency.

Where CSA says “compress, then retrieve precisely,” HCA says “compress harder, then use the compressed stream itself as the global memory.”

That is why the two belong together. CSA is the higher-detail mode. HCA is the cheaper long-range mode.

An intuition builder

The paper’s actual FLOP curves depend on more than one knob. But we can still build the right intuition: compression lowers storage cost, sparse or compressed retrieval lowers per-query fan-out, and a local window preserves nearby detail.

The explorer above is an approximation for understanding the mechanism. DeepSeek's exact implementation also depends on grouped projections, indexer paths, MoE kernels, and runtime optimizations described elsewhere in the paper.

Why this is better than V3.2

Now the upgrade story becomes clear.

DeepSeek-V3.2-Exp introduced DSA, which already made long-context attention more efficient than dense attention. But V4 pushes the idea further by turning long-context handling into a hybrid system:

The V4 Trade
$$\text{dense attention everywhere} \;\longrightarrow\; \text{compressed memory + selective retrieval + local detail}$$
Compressed memory: shrink the KV footprint before retrieval
Selective retrieval: spend detailed attention only where relevance is high
Local detail: preserve fine-grained nearby token interactions

That is the architectural side. On the training side, V4 also switches to a more serious specialist-then-distill pipeline:

This is just as revealing as the attention change. The capability gap is no longer explained by pretraining alone.

Two other upgrades that matter

mHC: stronger residual routing

V4 also uses Manifold-Constrained Hyper-Connections (mHC). If you read the separate mHC paper, the idea is to make richer residual routing trainable by constraining the routing matrix to the Birkhoff polytope.

The short version is:

This is not the headline of V4, but it matters. It is another example of a frontier pattern: add expressivity, then add the mathematical constraint that keeps it trainable.

Muon: optimization is part of the capability stack

The paper also highlights the Muon optimizer. That may sound like a footnote, but it should not.

At this scale, optimizer choice changes:

In frontier systems, optimization and kernels are no longer “implementation details.” They are part of the model story.

Reasoning effort is now a first-class knob

One more figure from the paper is useful because it shows the new shape of model quality.

Figure comparing HLE and TerminalBench 2.0 performance versus token usage across different reasoning effort settings
Reasoning effort from the paper. V4 is not a single point on a benchmark chart. The model is explicitly designed to trade more tokens for better outcomes. That is exactly the test-time scaling story the paper frames as the next frontier.

This figure is a quiet but important clue.

The frontier is shifting from “who has the best static base model?” to “who can turn extra context and extra reasoning budget into useful work at acceptable cost?”

V4 is engineered for that regime.

What V4 suggests about closed frontier models

This is why the paper matters beyond DeepSeek itself.

If an open model can get this close, what is the remaining moat for the best closed models?

V4 suggests a fairly concrete answer:

  1. Long-context efficiency is part of intelligence. A model that can cheaply search and retain huge context can look smarter in practice even if the base weights are only somewhat better.
  2. Post-training matters enormously. Specialist training, tool-use trajectories, RL infrastructure, and distillation appear to explain a large fraction of the real-world gap.
  3. Inference systems are part of the product. KV-cache design, sparse retrieval, batching, and context management now shape user-visible capability.
  4. The secret sauce is probably a stack, not a trick. Better data, better evals, better optimizers, better tooling, better test-time scaling, and better agents combine into the frontier.

That is why V4 is such a useful paper to read. It makes the closed-source recipe feel less mystical.

Final thoughts

DeepSeek V4 is important because it changes the center of gravity.

The old scaling story was mostly about more parameters and more data. The V4 story is about making very long context cheap enough to use routinely, then layering on better post-training for specialists, tools, and agents.

That does not mean V4 has fully matched the strongest closed models. The paper itself is more careful than that. But it does mean the path to getting close is increasingly visible.

And right now, that path looks like this:

That is what makes V4 more than a leaderboard event. It is a map of where the frontier is actually being built.

Sources