- Published on
DeepSeek-V2 is a Mixture-of-Experts (MoE) language model featuring 236B total parameters with 21B active parameters per token.
It was pretrained on 8.1 trillion tokens and supports a 128K context window. It specialises in math, code and reasoning.
The model introduces two key architectural innovations: Multi-head Latent Attention (MLA) for key-value compression, and DeepSeekMoE for enhanced Feed-Forward Network performance.
The model requires 8x80GB GPUs for BF16 inference and is available through Hugging Face Transformers and vLLM implementations.
DeepSeek-V2 is released under a commercial-use friendly license.