• Published on

    DeepSeek-V2 is a Mixture-of-Experts (MoE) language model featuring 236B total parameters with 21B active parameters per token.

    It was pretrained on 8.1 trillion tokens and supports a 128K context window. It specialises in math, code and reasoning.

    The model introduces two key architectural innovations: Multi-head Latent Attention (MLA) for key-value compression, and DeepSeekMoE for enhanced Feed-Forward Network performance.

    The model requires 8x80GB GPUs for BF16 inference and is available through Hugging Face Transformers and vLLM implementations.

    DeepSeek-V2 is released under a commercial-use friendly license.