This article is based on the official DeepSeek-V4 technical report and DeepSeek's preview release. The paper figures shown below were extracted from the official PDF and stored locally in this site.
Why V4 matters
The headline is easy to miss.
Yes, DeepSeek-V4-Pro is large at 1.6T total parameters with 49B active, and DeepSeek-V4-Flash is still huge at 284B total with 13B active. But the more important point is that both models support a 1M-token context window, and the paper is explicit about why that matters: DeepSeek sees test-time scaling and long-horizon agentic work as the next frontier.
That changes the way we should read the release. V4 is not just a bigger open model. It is a model trying to make very long context economically usable by default.
The right half of that figure is the real story. DeepSeek reports that at 1M context:
V4-Proneeds only27%of the single-token inference FLOPs ofDeepSeek-V3.2V4-Proneeds only10%of the KV cacheV4-Flashgoes even lower on both curves
If those gains hold in practice, the model can spend more of its budget on actual reasoning instead of just hauling around context.
The architecture at a glance
Before we zoom into the long-context mechanism, it helps to see the whole block once.
There are three moving parts worth tracking:
CSA / HCA: the long-context attention mechanismResidual mixing: where mHC strengthens the residual pathDeepSeekMoE: the expert-based feed-forward stack
The easiest mistake here is to focus only on the MoE size. The paper’s real novelty is the attention stack.
What problem V4 is actually solving
DeepSeek V3.2 had already pushed agent training and reasoning much further than the older V3 line. It introduced “thinking in tool-use” and a large synthetic agent-training corpus spanning 1,800+ environments and 85k+ complex instructions. But V3.2 still lived with the old long-context cost curve.
That matters because many tasks that feel “frontier” are really context problems:
- searching across many documents
- carrying forward long tool traces
- browsing large codebases
- keeping intermediate reasoning alive instead of compressing it away
The model is not only limited by its weights. It is limited by how much context and scratch space it can afford to use.
Build the attention mechanism together
The paper introduces two complementary attention modules:
CSA: Compressed Sparse AttentionHCA: Heavily Compressed Attention
They are interleaved across layers. One preserves more detail and retrieves selectively. The other is cheaper and coarser.
CSA: keep detail where it matters
CSA compresses the KV cache first, then uses a sparse selection step to choose which compressed regions deserve attention. It also keeps a local sliding window so nearby detail is never lost.
The flow is:
- compress many KV tokens into fewer entries
- score which compressed regions matter
- retrieve only the top relevant compressed entries
- combine that with a local window of raw nearby tokens
This is a nice design because it preserves two kinds of structure at once:
- local precision for nearby dependencies
- cheap global reach for faraway dependencies
HCA: go much cheaper on long-range memory
HCA makes a different trade. It compresses the KV cache much more aggressively and uses that compressed memory as a cheap global scaffold.
Where CSA says “compress, then retrieve precisely,” HCA says “compress harder, then use the compressed stream itself as the global memory.”
That is why the two belong together. CSA is the higher-detail mode. HCA is the cheaper long-range mode.
An intuition builder
The paper’s actual FLOP curves depend on more than one knob. But we can still build the right intuition: compression lowers storage cost, sparse or compressed retrieval lowers per-query fan-out, and a local window preserves nearby detail.
Why this is better than V3.2
Now the upgrade story becomes clear.
DeepSeek-V3.2-Exp introduced DSA, which already made long-context attention more efficient than dense attention. But V4 pushes the idea further by turning long-context handling into a hybrid system:
- CSA for selective, more detailed long-range retrieval
- HCA for even cheaper coarse global memory
- layer interleaving so the model does not pay one uniform attention cost everywhere
That is the architectural side. On the training side, V4 also switches to a more serious specialist-then-distill pipeline:
- specialist models are trained for math, code, agent work, and instruction following
- those specialists are merged back into a general model with On-Policy Distillation
This is just as revealing as the attention change. The capability gap is no longer explained by pretraining alone.
Two other upgrades that matter
mHC: stronger residual routing
V4 also uses Manifold-Constrained Hyper-Connections (mHC). If you read the separate mHC paper, the idea is to make richer residual routing trainable by constraining the routing matrix to the Birkhoff polytope.
The short version is:
- richer routing than a plain residual stream
- better stability than unconstrained hyper-connections
- a cleaner path to deeper, more expressive information flow
This is not the headline of V4, but it matters. It is another example of a frontier pattern: add expressivity, then add the mathematical constraint that keeps it trainable.
Muon: optimization is part of the capability stack
The paper also highlights the Muon optimizer. That may sound like a footnote, but it should not.
At this scale, optimizer choice changes:
- how quickly capabilities appear during training
- how stable a large MoE run remains
- which architectural ideas are practical instead of merely elegant
In frontier systems, optimization and kernels are no longer “implementation details.” They are part of the model story.
Reasoning effort is now a first-class knob
One more figure from the paper is useful because it shows the new shape of model quality.
This figure is a quiet but important clue.
The frontier is shifting from “who has the best static base model?” to “who can turn extra context and extra reasoning budget into useful work at acceptable cost?”
V4 is engineered for that regime.
What V4 suggests about closed frontier models
This is why the paper matters beyond DeepSeek itself.
If an open model can get this close, what is the remaining moat for the best closed models?
V4 suggests a fairly concrete answer:
- Long-context efficiency is part of intelligence. A model that can cheaply search and retain huge context can look smarter in practice even if the base weights are only somewhat better.
- Post-training matters enormously. Specialist training, tool-use trajectories, RL infrastructure, and distillation appear to explain a large fraction of the real-world gap.
- Inference systems are part of the product. KV-cache design, sparse retrieval, batching, and context management now shape user-visible capability.
- The secret sauce is probably a stack, not a trick. Better data, better evals, better optimizers, better tooling, better test-time scaling, and better agents combine into the frontier.
That is why V4 is such a useful paper to read. It makes the closed-source recipe feel less mystical.
Final thoughts
DeepSeek V4 is important because it changes the center of gravity.
The old scaling story was mostly about more parameters and more data. The V4 story is about making very long context cheap enough to use routinely, then layering on better post-training for specialists, tools, and agents.
That does not mean V4 has fully matched the strongest closed models. The paper itself is more careful than that. But it does mean the path to getting close is increasingly visible.
And right now, that path looks like this:
- compress context before paying full attention cost
- preserve detail selectively instead of uniformly
- treat optimization and inference as part of the architecture
- push much harder on agent training and post-training consolidation
That is what makes V4 more than a leaderboard event. It is a map of where the frontier is actually being built.
Michael Wan Interactive Insights