June 20, 2025
How Deepseek Works?
Deconstructing DeepSeek: An Exhaustive Technical Analysis of Architecture, Training, and Reasoning
I. Introduction: The DeepSeek Paradigm
In the rapidly evolving landscape of artificial intelligence, the emergence of a new player capable of challenging established leaders is a rare and significant event. DeepSeek, a Chinese AI company, has achieved precisely this, not merely by replicating existing technologies but by introducing a distinct paradigm rooted in extreme computational efficiency. This report provides an exhaustive technical analysis of how DeepSeek’s models work, deconstructing the company’s foundational principles, its novel architectural innovations, its sophisticated data and training methodologies, and the unique reinforcement learning framework that underpins its advanced reasoning capabilities. By examining the synergy between its corporate origins, its strategic response to geopolitical constraints, and its technological breakthroughs, we can understand how DeepSeek has managed to achieve state-of-the-art performance while fundamentally altering the economic calculus of large-scale AI development.
1.1 A New Contender: From Quantitative Finance to Frontier AI
DeepSeek’s identity and strategy are inextricably linked to its origins. The company, officially Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., was founded in Hangzhou, China, in mid-2023.¹ Its founder and CEO is Liang Wenfeng, who also co-founded and leads High-Flyer, one of China’s premier quantitative hedge funds.² This lineage is the crucible in which DeepSeek’s core philosophy was forged. High-Flyer specializes in AI-driven quantitative trading, a domain where competitive advantage is dictated by algorithmic efficiency, speed, and the ability to extract maximum predictive power from computational resources.¹ By 2021, the fund was operating exclusively on AI trading algorithms, powered by its own custom computing clusters.¹
The journey toward general-purpose AI began as an internal project within the hedge fund. In April 2023, High-Flyer announced the formation of an artificial general intelligence (AGI) research lab dedicated to developing AI tools unrelated to its core financial business.¹ This initiative was rapidly spun off into the independent entity DeepSeek just two months later, in July 2023, with High-Flyer serving as its principal investor and financial backer.¹ This swift transition from a specialized internal lab to a full-fledged AGI company underscores a deliberate and strategic pivot. The company’s stated focus is on fundamental research with no immediate plans for commercialization, a posture that not only emphasizes its long-term vision but also allows it to navigate certain Chinese government regulations aimed at consumer-facing AI technologies.¹
1.2 The Philosophy of Extreme Efficiency: Challenging the Scaling-Law Status Quo
DeepSeek’s market entry was nothing short of disruptive, sending shockwaves through the technology sector and financial markets.² The release of its models, particularly DeepSeek-R1 in January 2025, was hailed by prominent figures like venture capitalist Marc Andreessen as AI’s “Sputnik moment”.² This potent analogy was not based on raw performance alone, but on the astonishing efficiency with which that performance was achieved. DeepSeek demonstrated the ability to create models with capabilities comparable to top-tier American counterparts, such as OpenAI’s GPT-4 and o1, but at a dramatically lower cost.¹ For instance, the company reported that its DeepSeek-V3 model was trained for approximately $6 million, a stark contrast to the estimated $100 million-plus cost for developing GPT-4.¹ This revelation triggered a significant selloff in tech stocks, with companies like Nvidia, Oracle, and Meta losing nearly $1 trillion in combined market capitalization, as it called into question the prevailing “spend-at-all-costs” approach to AI leadership.²
This obsession with efficiency is not merely an ideological choice but a strategic imperative shaped by DeepSeek’s unique corporate DNA and the geopolitical environment. The culture inherited from High-Flyer prizes maximizing output from minimal input. This was compounded by the necessity of operating under U.S. trade restrictions that limit China’s access to the most advanced AI semiconductors.¹ Faced with a hardware ceiling, DeepSeek was forced to innovate on the software and architectural layers to compensate. The company leveraged a stockpile of older Nvidia A100 chips acquired before the ban, as well as weaker, export-compliant GPUs, and crucially, used fewer of them than its Western rivals.¹ This constraint became a catalyst for invention, forcing the development of a technological stack—from model architecture to training algorithms—designed from the ground up for extreme resource optimization. In a statement, Nvidia itself commended DeepSeek’s work as an “excellent AI advancement” that leveraged “widely-available models and compute that is fully export-control compliant”.² This confluence of a finance-driven efficiency culture and hardware limitations is the primary driver behind DeepSeek’s entire technological strategy.
Another pillar of DeepSeek’s strategy is its commitment to an “open-weight” model, which serves as a powerful tool for market penetration and talent acquisition. By releasing powerful models with permissive licenses for their weights, DeepSeek rapidly cultivated a global community of developers and researchers.⁵ This approach crowdsources improvements, bug fixes, and third-party integrations, as evidenced by the vibrant ecosystem that has formed around its models.⁸ As a new and relatively unknown entity, this strategy allowed DeepSeek to build credibility and a user base with remarkable speed. The buzz generated by the open-source community, with the DeepSeek app rocketing to the top of download charts in the U.S. and globally, served as highly effective, capital-efficient marketing.² This visibility, combined with the company’s focus on fundamental research, has made it an attractive destination for top AI talent, particularly from leading Chinese universities.¹
1.3 An Overview of the Model Ecosystem: From Coders to Reasoners
DeepSeek’s model releases show a clear and logical progression of increasing capability, specialization, and architectural sophistication. The evolution of its ecosystem provides a roadmap to understanding the company’s iterative approach to research and development.
The journey began with specialized models before moving to general-purpose and highly advanced reasoning systems:
DeepSeek Coder (November 2023): This was the company’s inaugural release, a family of models specifically designed for coding tasks. Trained on a 2 trillion token dataset heavily weighted towards code (87%), it established DeepSeek’s initial competence in a high-value domain.³
DeepSeek-LLM (December 2023): A month later, the company released its first general-purpose large language model. The 67 billion parameter model demonstrated broader capabilities in reasoning, math, and language understanding, signaling DeepSeek’s ambition to compete with established giants.³
DeepSeek-V2 (May 2024): This release marked a major architectural leap. DeepSeek-V2 introduced the two foundational innovations that define the company’s efficiency-first approach: Multi-Head Latent Attention (MLA) and the DeepSeekMoE architecture. It featured 236 billion total parameters, with only 21 billion active during inference, and a 128K token context window.³
DeepSeek-Coder-V2 (July 2024): Building on the V2 architecture, this model applied the new MLA and MoE concepts back to the coding domain, scaling to 236B parameters and supporting 338 programming languages.³
DeepSeek-V3 (December 2024): This model represents the scaling of the V2 architecture to a massive 671 billion total parameters (with 37B active). It incorporated further refinements to the training process and MoE load balancing, setting new benchmarks for open-weight models.³
DeepSeek-R1 (January 2025): The company’s current flagship reasoning model, DeepSeek-R1, is built upon the V3 architecture but is distinguished by its unique training pipeline. It employs a novel reinforcement learning algorithm, Group Relative Policy Optimization (GRPO), to achieve advanced, multi-step reasoning capabilities on par with leading closed-source models like OpenAI’s o1.³
Beyond these headline releases, the DeepSeek ecosystem also includes a variety of specialized variants, such as DeepSeek-Math for mathematical reasoning, DeepSeek-VL for vision-language tasks, and numerous distilled and quantized versions designed for deployment in resource-constrained environments.¹ This comprehensive family of models is summarized in Table 1.
Table 1: DeepSeek Model Evolution and Key Specifications
Model Name | Release Date | Total Parameters | Active Parameters (MoE) | Context Window | Key Innovation Introduced |
---|---|---|---|---|---|
DeepSeek Coder | Nov 2023 | 1B - 33B | N/A (Dense) | 16K | Project-level code understanding |
DeepSeek-LLM | Dec 2023 | 67B | N/A (Dense) | 4K | First general-purpose model |
DeepSeek-V2 | May 2024 | 236B | 21B | 128K | Multi-Head Latent Attention (MLA), DeepSeekMoE |
DeepSeek-Coder-V2 | July 2024 | 236B | 21B | 128K | Application of V2 architecture to coding |
DeepSeek-V3 | Dec 2024 | 671B | 37B | 128K | Scaled V2 architecture, Auxiliary-Loss-Free MoE |
DeepSeek-R1 | Jan 2025 | 671B | 37B | 64K / 128K | Group Relative Policy Optimization (GRPO) for reasoning |
Data compiled from sources:.³
II. The Architectural Foundations of Efficiency
At the heart of DeepSeek’s ability to deliver high performance at low cost are two synergistic architectural innovations: a refined Mixture-of-Experts (MoE) paradigm called DeepSeekMoE, and a novel attention mechanism known as Multi-Head Latent Attention (MLA). The former is primarily a training-time optimization that reduces the computational load (FLOPs) of each forward pass, while the latter is an inference-time optimization that tackles the memory bandwidth bottleneck of the KV cache. Together, they form a holistic solution that addresses the two primary costs of developing and deploying large models. This deliberate design for efficiency is what enables DeepSeek’s open-source strategy to be a practical reality, making it feasible for a broad community to run, fine-tune, and deploy its models without requiring hyperscaler-level resources.
2.1 The Mixture-of-Experts (MoE) Revolution
The fundamental concept behind MoE is a shift from dense to sparse activation. In a traditional dense transformer model, every parameter in the network is activated to process every single input token. This is computationally expensive, especially as models scale into the hundreds of billions of parameters. MoE architectures, in contrast, replace the dense Feed-Forward Network (FFN) layers of a transformer with a collection of smaller FFNs, known as “experts”.¹⁸ For each token, a lightweight “gating network” or “router” dynamically selects a small subset of these experts to activate.¹⁹ This means that while the model may have a very large number of total parameters, only a fraction are used for any given computation, drastically reducing the FLOPs required for both training and inference.⁷
DeepSeekMoE: A Deeper Dive
DeepSeek’s implementation, known as DeepSeekMoE, introduces critical refinements to earlier MoE designs to address common issues like knowledge redundancy (where multiple experts learn the same things) and knowledge hybridity (where individual experts are not specialized enough).²¹ The architecture employs two principal strategies to achieve what it calls “ultimate expert specialization”.²¹
Fine-Grained Expert Segmentation: Instead of using a small number of large experts, DeepSeekMoE segments the FFNs into a much larger number of smaller, more fine-grained experts.¹⁸ For example, in DeepSeek-V2 and V3, MoE layers contain 64 routed experts.²⁴ While the total number of parameters remains constant, this segmentation allows for more precise decomposition of knowledge. The router then activates a larger number of these smaller experts (e.g., 6 in DeepSeek-V2), allowing for a more flexible and adaptable combination of specialized knowledge for any given token.²²
Shared Expert Isolation: This is a key architectural innovation. DeepSeekMoE isolates a small number of experts (e.g., 2 in DeepSeek-V2) to serve as “shared experts”.²¹ These shared experts are activated for
every token, regardless of the router’s decision. Their purpose is to capture and consolidate common knowledge that is broadly applicable across different contexts and domains. By compressing this universal knowledge into the shared experts, the model mitigates redundancy among the routed experts, allowing them to focus on learning highly distinct and specialized knowledge.²²
Innovations in Load Balancing
A persistent challenge in MoE models is ensuring that the computational load is distributed evenly across all experts. If the routing mechanism consistently favors a few experts, they become over-trained while others are under-utilized and may “collapse,” wasting model capacity. Traditional methods address this by adding an “auxiliary loss” term to the main training objective, which penalizes imbalanced routing. However, optimizing this secondary objective can interfere with the primary goal of minimizing prediction loss, thereby degrading overall model performance.¹⁴
DeepSeek-V3 pioneers an auxiliary-loss-free load balancing strategy to circumvent this trade-off.¹⁴ Instead of a separate loss term, this method introduces a learnable bias term directly into the token-to-expert affinity scores calculated by the router. During training, the system monitors the load on each expert at every step. If an expert is identified as being overloaded or underloaded, its corresponding bias term is dynamically adjusted by a factor,
γ. This gently encourages or discourages the router from selecting that expert in subsequent steps, achieving balanced load distribution without the performance penalty associated with an auxiliary loss function.²⁵
2.2 Taming the KV Cache: The Multi-Head Latent Attention (MLA) Mechanism
While MoE addresses the computational cost of training, Multi-Head Latent Attention (MLA) tackles the primary bottleneck of inference: the size of the Key-Value (KV) cache. In a standard transformer, during autoregressive generation (predicting one token at a time), the Key (K) and Value (V) vectors for all previous tokens in the sequence are cached in GPU memory.²⁸ This avoids recomputing them for every new token, which would be prohibitively slow. However, for models with long context windows, the memory required to store this cache becomes enormous, limiting the maximum sequence length and reducing inference throughput.¹²
MLA Explained
Introduced in DeepSeek-V2, MLA is a novel attention mechanism that directly attacks this problem by fundamentally changing what is stored in the cache.⁷
Low-Rank Projection: The central idea of MLA is to compress the high-dimensional Key and Value vectors into a much smaller, low-dimensional “latent space” before they are written to the KV cache.³⁰ This is achieved by passing the full-dimension vectors through a “down-projection” linear layer. The resulting compressed vectors are what get cached. This single change dramatically reduces the memory footprint of the cache. For DeepSeek-V2, this resulted in a 93.3% reduction in KV cache size compared to its dense predecessor.³²
Operations in Compressed Space: To avoid having to decompress the entire cache for every new token, attention calculations are reordered to be performed within this compressed latent space. The query vector is projected into the compressed space, and attention scores are computed against the compressed keys. The final output is then generated by aggregating the compressed values.²⁹ This approach significantly reduces the memory bandwidth required during inference, which is often the limiting factor on modern accelerators.³⁰
Decoupled Positional Encoding: To further improve efficiency with long contexts, MLA decouples the Rotary Position Embedding (RoPE) from the main semantic content. It introduces a separate, dedicated set of queries and keys for handling positional information, which avoids certain computational inefficiencies that arise in standard attention mechanisms when sequence lengths become very large.²⁸
Hardware-Centric Analysis and Refinements
The design of MLA reflects a deep understanding of hardware constraints. By drastically reducing memory access, MLA shifts the bottleneck of the attention mechanism away from memory bandwidth and more towards computation. This leads to more stable and efficient performance, particularly on the kind of bandwidth-limited hardware that DeepSeek was often compelled to use due to export controls.²⁹
The architecture was further improved in DeepSeek-V3 to handle extremely long contexts of up to 128K tokens. These refinements include techniques like dynamic low-rank projection, where the strength of the KV compression is adjusted based on the sequence length (less compression for short sequences, more for long ones), and adaptive query compression, which scales the query dimension at different layers of the model to balance expressiveness and memory usage.²⁶ This hardware-aware co-design is a hallmark of DeepSeek’s engineering philosophy.
Table 2: Architectural Comparison: MHA/GQA vs. MLA
Feature | Multi-Head Attention (MHA) | Grouped-Query Attention (GQA) | Multi-Head Latent Attention (MLA) |
---|---|---|---|
KV Cache Size | Very Large (L×nh×dk) | Medium (L×nkv×dk) | Very Small (L×dlatent) |
Memory Bandwidth | High | Medium | Low |
Computational Cost | High | Slightly Reduced | Low (in compressed space) |
Long Context Suitability | Poor (Memory bottleneck) | Better (Reduced cache) | Excellent (Drastically reduced cache) |
Key Innovation | Parallel attention heads | Shared K and V heads across query groups | Low-rank projection of K and V into a latent space |
Where L is sequence length, nh is number of query heads, nkv is number of KV heads, dk is head dimension, and dlatent is the compressed latent dimension. Data compiled from sources:.¹²
III. The Engine of Intelligence: Data Curation and Training Methodology
The remarkable capabilities of DeepSeek’s models are not solely the result of architectural innovation; they are equally dependent on the vast and meticulously curated datasets used for training and the highly optimized, hardware-aware process through which that training is conducted. DeepSeek’s approach demonstrates a philosophy where data quality and structure are treated as first-class components of the model’s design, and where training stability is viewed as a key competitive advantage achieved through intense hardware-software co-design.
3.1 Crafting the Corpus: Scaling to 14.8 Trillion Tokens
DeepSeek’s models are trained on datasets of staggering scale, which have grown with each successive model generation. The initial DeepSeek Coder and DeepSeek-LLM models were trained on a corpus of 2 trillion tokens.⁹ This was scaled to 8.1 trillion tokens for DeepSeek-V2 and ultimately to a massive 14.8 trillion tokens for the pre-training of DeepSeek-V3.¹⁴
This corpus is intentionally diverse and multilingual to ensure the models develop broad capabilities. The data is sourced from a wide array of materials, including large-scale web scrapes (such as Common Crawl), extensive code repositories from GitHub, scientific literature, books, and a rich collection of mathematical texts.¹⁹ While primarily focused on English and Chinese, the datasets include other languages to enhance multilingual performance.¹⁶
For its specialized models, DeepSeek carefully engineers the composition of the training data. For example, the original DeepSeek Coder was trained on a dataset comprising 87% code and 13% natural language (including code-related text from sources like GitHub Markdown and StackExchange).¹⁰ For the more advanced DeepSeek-Coder-V2, this mix was adjusted to 60% code, 10% math-focused text, and 30% general natural language, reflecting a strategy to imbue the coding model with stronger mathematical reasoning abilities.³⁶
3.2 The Curation Gauntlet: A Multi-Stage Quality Pipeline
DeepSeek’s methodology for data preparation is far more sophisticated than a simple “data dump.” It employs a rigorous, multi-stage pipeline designed to maximize data quality, information density, and logical coherence.³⁴
Iterative Refinement: The process is iterative. A small, high-quality seed dataset is used to train a classifier, which then scores and retrieves similar documents from a massive raw dataset. This refined data is then used to train a better classifier for the next round, progressively improving the quality of the collected data over multiple cycles.³⁴
Filtering: The raw data is subjected to stringent filtering. This includes rule-based filters (e.g., removing files with excessively long lines or a low percentage of alphabetic characters) and more advanced linguistic and semantic assessments to discard low-quality documents or entire web domains.¹⁶
Deduplication: To combat redundancy and increase the informational value of the training data, DeepSeek utilizes advanced deduplication techniques. For code, this includes repository-level minhash algorithms that can identify and remove near-duplicate projects, ensuring the model sees a wider variety of unique code structures.¹⁰
Dependency-Aware Structuring for Code: A particularly insightful innovation for the Coder models is the treatment of data structure as an architectural component. Instead of feeding code files to the model in an arbitrary order, the curation pipeline parses the dependencies between files within a single software repository. The files are then reordered and concatenated into a single training example, ensuring that dependencies (e.g., class definitions, imported libraries) appear before the code that relies on them.¹⁰ This presents the model with a logically coherent curriculum that mirrors how a human developer would read and understand a project, fostering a deeper, project-level comprehension of code.
Remixing: In the final stage, the composition of the curated dataset is carefully adjusted or “remixed” to ensure a balanced representation across different domains (e.g., STEM, creative writing, Q&A), which helps to prevent model bias and improve performance on underrepresented topics.¹⁶
3.3 The Pre-Training Process: Hardware-Software Co-Design
The ability to train a 671B parameter model on 14.8T tokens economically is a testament to DeepSeek’s mastery of hardware-software co-design. The company operates its own high-performance computing (HPC) clusters, such as “Fire-Flyer 2,” which feature Nvidia GPUs (including H800s for V3) connected by high-speed (200 Gbit/s) interconnects like InfiniBand.¹ The key is maximizing the utilization of this hardware.
FP8 Mixed Precision Training: DeepSeek-V3 was one ofthe first models to validate the feasibility and effectiveness of training at an extremely large scale using FP8 (8-bit floating point) mixed precision.¹⁴ Compared to the more common 16-bit precision (FP16/BF16), FP8 significantly reduces the memory footprint of model weights and activations, and it can accelerate computation on compatible hardware. Achieving this without sacrificing performance or stability is a major engineering feat.¹⁴
Overcoming Communication Bottlenecks: In large-scale, distributed MoE training, the “all-to-all” communication required to send tokens to their designated experts across different nodes can become a severe bottleneck. DeepSeek’s team developed custom communication kernels and optimized scheduling frameworks to nearly achieve full overlap between computation and communication.¹⁴ This means that while some parts of the GPU are performing calculations, other parts are simultaneously handling data transfer, maximizing hardware utilization and drastically improving training efficiency.¹⁴
Remarkable Stability: The culmination of this meticulous co-design is a remarkably stable training process. The company reported that for the entire pre-training run of DeepSeek-V3, they encountered no irrecoverable loss spikes and did not have to perform any rollbacks to previous checkpoints.¹⁴ At this scale, where a single loss spike can waste days or weeks of expensive GPU time, this stability is not a minor detail. It is a direct indicator of a mature and highly optimized training stack, which provides a significant competitive advantage by enabling faster iteration cycles and more predictable R&D timelines.
Post-Training Context Extension: After the initial pre-training is complete, models like DeepSeek-V3 undergo a separate, two-stage fine-tuning process to extend their context length, first to 32K tokens and subsequently to 128K tokens, enabling them to handle very long documents.²⁷
IV. Emergent Reasoning: The Group Relative Policy Optimization (GRPO) Framework
Beyond architectural efficiency and data scale, DeepSeek’s most advanced contribution lies in its methodology for teaching models how to reason. This is most evident in the training of the DeepSeek-R1 model, which moves beyond the standard industry practice of using reinforcement learning for simple preference alignment. Instead, DeepSeek employs RL as a core mechanism for instilling complex, multi-step problem-solving capabilities. This is achieved through a novel and highly efficient algorithm called Group Relative Policy Optimization (GRPO) and a sophisticated multi-stage training pipeline that pragmatically balances emergent discovery with controlled, stable behavior.
4.1 Beyond Standard Alignment: RL as a Core Capability Driver
Reinforcement Learning from Human Feedback (RLHF) has become the standard final step in training large language models. Its primary purpose is to align a pre-trained model to be more “helpful and harmless” by fine-tuning it based on human preferences for different responses.⁴⁰ DeepSeek, however, reimagines the role of RL. Instead of using it merely for stylistic tuning, the company leverages it as a primary driver for teaching the model the fundamental skill of reasoning.⁴²
This new approach was first tested in an experiment to create DeepSeek-R1-Zero. This model was developed by taking a pre-trained base model and applying reinforcement learning directly, with no intermediate supervised fine-tuning (SFT) stage.¹⁶ The experiment was a success in demonstrating that complex reasoning abilities—such as generating long, coherent Chain-of-Thought (CoT) explanations and even self-verifying its answers—could emerge solely from a reward signal.⁴² However, this “pure RL” approach also led to instability and undesirable “quirky behaviors,” such as the model inexplicably mixing different languages within a single response.⁴⁰ This highlighted the power of RL for capability discovery but also its limitations in ensuring controlled, reliable behavior.
4.2 From PPO to GRPO: A More Efficient RL Pipeline
The standard algorithm for RLHF is Proximal Policy Optimization (PPO). While effective, PPO is computationally expensive and memory-intensive. This is because it requires training and running two large neural networks in tandem: the policy model (the LLM itself, which generates actions/text) and a critic model (also called a value model), which is trained to evaluate the policy’s outputs and estimate the expected future reward from a given state.⁴³ The critic model is typically similar in size to the policy model, effectively doubling the resource requirements for the RL training phase.⁴⁵
This high cost is antithetical to DeepSeek’s efficiency-first philosophy. The development of GRPO is the algorithmic embodiment of this philosophy, optimizing the learning algorithm itself for resource efficiency. Group Relative Policy Optimization (GRPO), first introduced in the DeepSeekMath paper, is DeepSeek’s innovative alternative to PPO.⁴³ The core innovation of GRPO is the complete elimination of the critic model.⁴⁸ The key insight is that an absolute measure of a response’s quality from a critic is not necessary; a
relative measure of its quality compared to other possible responses is sufficient to provide a strong learning signal.
The GRPO process works as follows:
Group Sampling: For a given input prompt, the policy model is used to generate a group of multiple, distinct responses (e.g., trying several different approaches to solve a math problem).⁴⁷
Reward Scoring: Each response in the group is scored using a reward function. This function can be a separate reward model or, crucially for reasoning tasks, a rule-based or outcome-based check (e.g., does the code pass its unit tests?).⁴³
Relative Advantage Calculation: This is the critical step. Instead of querying a critic model, GRPO calculates a relative advantage for each response directly from the group’s rewards. This is typically done by normalizing the reward of each response against the mean and standard deviation of all rewards within the group. A response is thus judged not in isolation, but on how it performs compared to its peers generated from the same prompt.⁴⁷
Policy Update: The policy model’s parameters are then updated using a clipped surrogate objective, mathematically similar to the one used in PPO. However, it uses the far more efficient, group-relative advantage signal as its gradient. This process dramatically reduces the memory and compute overhead of the RL stage, making large-scale reasoning training economically viable.⁴⁵
Table 3: Algorithmic Comparison: PPO vs. GRPO
Feature | Proximal Policy Optimization (PPO) | Group Relative Policy Optimization (GRPO) |
---|---|---|
Core Components | Policy Model, Critic (Value) Model, Reward Model | Policy Model, Reward Model (No Critic) |
Advantage Estimation | Uses a trained Critic/Value model to estimate the baseline reward (A(s,a)=R−V(s)) | Calculates a relative advantage by normalizing rewards within a group of sampled responses |
Computational Cost | High (Requires training and inference for two large models) | Low (Requires only one model, plus simple statistical calculations) |
Memory Usage | High (Memory for gradients and activations of both policy and critic models) | Low (Roughly half the memory of PPO due to elimination of the critic) |
Key Use Case | General LLM alignment (InstructGPT/ChatGPT) | Efficient reasoning capability training (DeepSeek-R1, DeepSeekMath) |
Data compiled from sources:.⁴³
4.3 The Multi-Stage RL Pipeline for DeepSeek-R1
The final, production-grade DeepSeek-R1 model was not trained with the unstable “pure RL” approach of R1-Zero. Instead, it was crafted using a sophisticated, multi-stage pipeline that synthesizes the strengths of SFT and RL to achieve both high capability and reliable behavior.¹⁷ This pipeline represents a pragmatic solution to the trade-off between emergent discovery and controlled instruction-following.
Stage 1: Cold-Start Supervised Fine-Tuning (SFT): The process begins with the pre-trained base model (e.g., DeepSeek-V3). To address the stability and readability issues observed in R1-Zero, this base model is first fine-tuned on a small but very high-quality, curated dataset of examples featuring long, structured Chain-of-Thought reasoning. This “cold start” SFT stage provides the model with a strong, stable foundation in generating coherent, human-readable reasoning traces.¹⁷
Stage 2: Reasoning-Oriented RL with GRPO: The SFT-tuned model then enters the first RL phase using GRPO. Crucially, the rewards in this stage are not based on subjective human preference but on verifiable outcomes. For math and coding problems, rule-based reward models are used: code is executed in a sandboxed environment to see if it compiles and passes unit tests, or a mathematical answer is checked against the known correct solution. This directly incentivizes the model to learn correct, logical reasoning paths.²⁵
Stage 3: Synthetic Data Generation and Further SFT: The model resulting from Stage 2, now possessing stronger reasoning skills, is used to generate a large synthetic dataset of new reasoning problems and solutions (e.g., 800k samples).⁴⁷ This dataset is then filtered for quality (a process called rejection sampling), and the high-quality synthetic data is used for a second round of SFT. This step serves to broaden the model’s capabilities and distill its newfound reasoning skills into a more stable, instruction-following format.⁴²
Stage 4: Final RL Alignment: The pipeline concludes with a final RL phase, again using GRPO. This stage is more akin to traditional RLHF, where the goal is to align the model for general helpfulness and harmlessness, fine-tuning its conversational style and ensuring it adheres to safety guidelines.⁴²
This iterative cycle of SFT → RL → SFT → RL acts as a powerful bootstrapping mechanism for intelligence. SFT provides stability, RL builds capability, and the newly acquired capability is then used to generate higher-quality data for the next SFT stage, creating a virtuous feedback loop.
4.4 Knowledge Distillation: Transferring Reasoning from R1 to V3
DeepSeek also employs an innovative post-training technique to enhance its general-purpose models. The specialized reasoning abilities developed in the R1 model through the intensive GRPO pipeline are “distilled” into the standard DeepSeek-V3 model. This process involves a fine-tuning pipeline that elegantly incorporates the verification and reflection patterns learned by R1 into V3. The result is a notable improvement in V3’s reasoning performance on standard benchmarks, without making it as verbose or single-mindedly focused as the dedicated R1 reasoner, thus preserving its general-purpose utility.¹⁴
V. Performance, Benchmarks, and Competitive Analysis
The ultimate measure of DeepSeek’s innovative architecture and training methodologies is its performance on standardized industry benchmarks. A quantitative and qualitative analysis reveals that DeepSeek’s models are not just efficient but also highly competitive, often outperforming established leaders in key areas, particularly those requiring logical reasoning. This performance, when viewed through the lens of cost, establishes a new and disruptive value proposition in the AI market. DeepSeek is not just competing on capability, but on performance per dollar.
5.1 Quantitative Evaluation: Performance Across Key Benchmarks
DeepSeek’s models have consistently posted impressive scores across a range of benchmarks designed to test general knowledge, reasoning, mathematics, and coding.
General Knowledge and Reasoning: On broad, multi-task evaluations, DeepSeek’s flagship models are highly competitive. On the widely cited MMLU (Massive Multitask Language Understanding) benchmark, DeepSeek-V3 achieves a score of 88.5%, and DeepSeek-R1 scores 90.8%, placing them in the same elite tier as GPT-4o at 88.7%.⁵⁵ On the more difficult GPQA benchmark, designed to be “Google-proof,” DeepSeek-V3 outperforms GPT-4o with a score of 59.1% to 46.0%.⁵⁷
Mathematical Reasoning: This is a standout domain for DeepSeek, directly reflecting the success of its GRPO training methodology, which optimizes for verifiable correctness. On the MATH benchmark, DeepSeek-R1 achieves a remarkable score of 97.3%, far surpassing GPT-4o’s 76.6%.⁵⁶ On the competition-level AIME 2024 math problems, DeepSeek-V3 scores 39.2% compared to GPT-4o’s 13.1%.⁵⁷ The precursor model, DeepSeekMath 7B, had already demonstrated performance approaching that of much larger closed-source models like Gemini-Ultra and GPT-4.⁵¹
Coding Prowess: DeepSeek’s Coder series and its general-purpose models demonstrate state-of-the-art coding abilities.
DeepSeek-Coder-V2 achieves an exceptional 90.2% pass@1 rate on the HumanEval benchmark, a score on par with or exceeding the best closed-source models.³⁶ It was also the first open-weight model to surpass a 10% score on the notoriously difficult SWE-Bench, which involves fixing real-world bugs in GitHub repositories.⁵⁹
On the BigCodeBench, a challenging suite of practical programming tasks, both DeepSeek-R1 and DeepSeek-V3 rank in the top 10 globally, competitive with various versions of Claude-3.7 Sonnet, o1, and o3-mini.⁶⁰
DeepSeek-V3 scores 82.6% on the multilingual version of HumanEval, slightly ahead of GPT-4o’s 80.5%.⁵⁵
API Performance: In terms of serving speed, DeepSeek’s first-party API delivers a median output of around 28 tokens per second for both the R1 and V3 models, with a time-to-first-token of approximately 2.5 seconds.⁶¹
5.2 Qualitative Analysis and User Feedback
Beyond the numbers, qualitative assessments and user feedback provide a more nuanced picture of the models’ strengths and weaknesses.
Strengths: Reviewers consistently praise DeepSeek models for their proficiency in structured tasks that require logic and precision, such as mathematics, coding, and technical problem-solving.¹⁵ A key feature highlighted by users is the DeepSeek-R1 model’s ability to expose its reasoning process through detailed Chain-of-Thought outputs, which enhances transparency and trust.⁵
Weaknesses: The models’ intense focus on logical correctness appears to come at the expense of creative and stylistic flair. Some users find that models like Claude 3.5 Sonnet or GPT-4o are superior for creative writing and are more enjoyable for general “daily driving” tasks.⁶² More significant are the concerns around safety and censorship. Analyses have noted that DeepSeek models can have a higher rate of producing unsafe responses compared to some competitors.¹⁵ Furthermore, the models actively self-censor on politically sensitive topics. For instance, when queried about the 1989 Tiananmen Square massacre, the model may initially provide an accurate response but then replace it within seconds with a generic refusal message, raising concerns about transparency and ideological alignment.⁵
Training Data Contamination: Some analysts have observed that the response style of DeepSeek-V3 can be “eerily similar” to that of GPT-4o, down to specific word choices and phrasing. This has led to speculation that DeepSeek’s training data may have been “contaminated” with a large volume of GPT-4o-generated content, either deliberately or inadvertently.⁶²
5.3 Comparative Landscape: DeepSeek vs. The Incumbents
When placed alongside its primary competitors, DeepSeek carves out a distinct position based on a combination of specialized strengths and a disruptive economic model.
DeepSeek vs. OpenAI (GPT-4o, o1): DeepSeek’s R1 and V3 models are positioned as direct competitors. As the benchmarks show, they often meet or exceed OpenAI’s models in reasoning, math, and coding tasks.⁵⁵ OpenAI currently maintains an edge in multimodality, as GPT-4o can process image and audio inputs, a capability DeepSeek’s current models lack (though it is planned for R2).⁵⁷ The most significant differentiator, however, is cost. DeepSeek’s API is priced at a fraction of OpenAI’s, making it roughly 9 to 10 times cheaper for a comparable volume of input and output tokens.⁵⁷ This fundamentally changes the economic viability of building applications on top of high-end AI.
DeepSeek vs. Anthropic (Claude): Anthropic’s Claude 3.5 Sonnet is frequently cited by users as being superior for creative writing and certain coding tasks, with a more polished and “enjoyable” conversational style.⁶² However, DeepSeek’s models consistently demonstrate a lead in quantitative reasoning and mathematics benchmarks.⁶⁵
This performance data, summarized in Table 4, underscores that DeepSeek’s benchmark dominance in specific areas is a predictable outcome of its training priorities. The GRPO pipeline, with its rule-based rewards for verifiable correctness, directly optimizes for the skills measured by benchmarks like MATH and HumanEval.
Table 4: Multi-Benchmark Performance Comparison: DeepSeek vs. Competitors
Benchmark | DeepSeek-V3 | DeepSeek-R1 | GPT-4o | Claude 3.5 Sonnet |
---|---|---|---|---|
MMLU (General Knowledge) | 88.5% | 90.8% | 88.7% | N/A |
MATH (Math Reasoning) | N/A | 97.3% | 76.6% | N/A |
AIME 2024 (Comp. Math) | 39.2% | N/A | 13.1% | N/A |
HumanEval (Code Gen) | 82.6% (Mul) | N/A | 90.2% | N/A |
BigCodeBench (Code Gen) | 34.5 | 35.1 | N/A | 35.8 |
SWE-Bench (Code Repair) | 42.0% | N/A | 33.2% | N/A |
GPQA (Reasoning) | 59.1% | N/A | 46.0% | N/A |
Aider-Polyglot (Code Edit) | 49.6% | 71.0% | 30.7% | N/A |
Scores represent pass@1 accuracy or equivalent primary metric for each benchmark. Best performance in bold. Data compiled from sources:.⁵⁵
VI. The Developer Ecosystem: Implementation and Deployment
For any AI model to have a real-world impact, it must be accessible, affordable, and legally viable for developers and businesses to build upon. DeepSeek has established a comprehensive ecosystem that addresses these practical considerations through a well-documented API, a disruptive pricing model, a nuanced approach to open-source licensing, and a rapidly growing community of third-party integrations.
6.1 Accessing the Models: The DeepSeek API Platform
DeepSeek provides programmatic access to its models through a straightforward API platform that is compatible with OpenAI’s API framework, simplifying integration for developers familiar with that ecosystem.⁶⁷ The platform offers two primary model endpoints:
deepseek-chat: This endpoint points to a version of the DeepSeek-V3 model and is optimized for general-purpose conversational tasks, content generation, and summarization.⁶⁹
deepseek-reasoner: This endpoint utilizes the DeepSeek-R1 model and is designed specifically for tasks requiring complex, multi-step reasoning, such as advanced mathematics and coding problems.⁶⁹
The API supports a 64K token context length and standard features like JSON output and function calling.⁶⁹ A key detail for the
deepseek-reasoner model is that its output token count, for billing purposes, includes the intermediate Chain-of-Thought (CoT) steps it generates before providing the final answer.⁶⁹
The pricing structure is a core part of DeepSeek’s disruptive strategy. It is token-based, with different rates for input and output, but with a unique feature called Context Caching.⁶⁴ This feature is a direct monetization of the model’s internal efficiency. The system automatically detects if the beginning (prefix) of a new prompt is identical to one it has processed recently.
Cache Miss: A completely new prompt is a “cache miss” and is charged at the standard input rate.
Cache Hit: If a prompt prefix is recognized from the cache, those tokens are a “cache hit” and are charged at a drastically reduced rate—often 4-5 times cheaper than a cache miss.⁶⁹
This pricing model creates a powerful financial incentive for developers to design cache-friendly applications (e.g., by structuring multi-turn conversations to reuse the historical context as a static prefix), aligning the user’s economic goals with the system’s computational efficiency.⁶⁴ The standard pricing, detailed in Table 5, is already an order of magnitude cheaper than leading competitors, and the caching mechanism provides a path to even greater cost savings.⁵⁷
Table 5: DeepSeek API Pricing and Model Endpoints
API Model Name | Underlying Model | Primary Use Case | Input Price (Cache Miss) | Input Price (Cache Hit) | Output Price | Max Context |
---|---|---|---|---|---|---|
deepseek-chat | DeepSeek-V3 | General, Chat, Content | $0.27 / 1M tokens | $0.07 / 1M tokens | $1.10 / 1M tokens | 64K tokens |
deepseek-reasoner | DeepSeek-R1 | Reasoning, Math, Code | $0.55 / 1M tokens | $0.14 / 1M tokens | $2.19 / 1M tokens | 64K tokens |
Standard pricing as of early 2025. Off-peak discounts may also apply. Data compiled from sources:.⁶⁹
6.2 A Dual Approach to Licensing: Open yet Restricted
DeepSeek has adopted a sophisticated dual-license strategy that attempts to balance the viral adoption of open source with the need to enforce responsible use. This approach separates the licensing of the underlying code from the model weights themselves.⁷²
Code License (MIT): The source code for the models, available in DeepSeek’s GitHub repositories, is released under the highly permissive MIT License.⁷³ This allows anyone to freely use, copy, modify, and distribute the code for any purpose, including commercial use, with the only major requirement being the preservation of the original copyright and license notices.⁷⁴
Model License (Custom): The powerful model weights, however, are governed by a custom DeepSeek Model License.⁷² This license has two distinct characteristics:
Permissive Grants: It grants broad, perpetual, worldwide, royalty-free rights for users to use, modify, create derivative works from, and commercially deploy the models. It is not a “copyleft” license, meaning developers who fine-tune a DeepSeek model are not required to open-source their resulting derivative model.⁷³
Use-Based Restrictions: Crucially, the license includes an appendix of explicit use-based restrictions. It prohibits the use of the model and its derivatives for certain activities, including military applications, illegal purposes, generating harmful or false content, and violating personal rights.⁷² This clause is binding and must be included in the license of any derivative works that are distributed.
This dual-license framework is a carefully crafted legal strategy. The permissive MIT license for the code encourages widespread developer adoption and experimentation. The custom model license, with its embedded use-based restrictions, functions as a mandatory Acceptable Use Policy, allowing DeepSeek to foster an open ecosystem while attempting to mitigate the ethical and legal risks associated with the misuse of its powerful technology. However, because these restrictions forbid use in certain fields of endeavor, the model license does not technically meet the definition of “Open Source” as defined by the Open Source Initiative (OSI), making terms like “source-available” or “open-weight” more accurate descriptors.⁷²
6.3 Applications and Integrations: A Survey of the Ecosystem
The combination of high performance, low cost, and a permissive licensing framework has catalyzed the rapid growth of a vibrant ecosystem around DeepSeek’s technology. The models are being integrated into a wide array of applications across numerous industries, from healthcare and finance to software development and education.⁶⁷
A survey of the awesome-deepseek-integration repository on GitHub reveals the breadth of this adoption.⁸ Third-party developers have built:
Chat Clients and Interfaces: Tools like LibreChat, Chatbox, and Chatworm provide desktop and web interfaces for interacting with DeepSeek models.⁸
Developer and IDE Tools: A multitude of plugins for code editors like Neovim and VS Code (e.g., Continue, codecompanion.nvim) leverage DeepSeek-Coder for AI-powered code completion, debugging, and explanation.⁸
Productivity and Research Applications: DeepSeek has been integrated into the popular Zotero reference manager for analyzing research papers, as well as into various translation tools, browser extensions, and IM application bots.⁸
Application Development Platforms: Low-code/no-code platforms like Dify and Wordware allow users to build custom AI applications and workflows powered by DeepSeek’s API.⁸
Significantly, DeepSeek’s models have also achieved legitimacy in the enterprise space through integrations with major Western technology companies. Amazon has made DeepSeek models available on its AWS Bedrock platform, Microsoft offers them on Azure AI Foundry, and partnerships exist with Dell and the AI search engine Perplexity.⁷⁷ These integrations signal a high level of trust in the performance and stability of DeepSeek’s technology, even as broader geopolitical concerns remain.
VII. Synthesis and Future Directions
DeepSeek’s rapid ascent from an obscure spin-off of a Chinese hedge fund to a formidable player on the global AI stage is a story of strategic focus and technical innovation. The company has successfully challenged the prevailing “scaling at all costs” paradigm by engineering a suite of models that deliver state-of-the-art performance with remarkable computational and economic efficiency. This achievement is not the result of a single breakthrough but the synergistic effect of a coherent philosophy that permeates every layer of its technology stack—from its hardware-aware architectural design and meticulous data curation to its novel, efficiency-focused training algorithms.
7.1 Recapitulation of Key Innovations and Their Synergy
The core of “how DeepSeek works” lies in the interplay of several key pillars:
An Efficiency-First Philosophy: Inherited from its quantitative finance parent, High-Flyer, and reinforced by the necessity of operating under hardware constraints, DeepSeek’s entire R&D process is optimized for maximizing performance per unit of compute.
Synergistic Architectural Innovations: The DeepSeekMoE architecture makes training massive models economically feasible by activating only a fraction of parameters per token. Simultaneously, Multi-Head Latent Attention (MLA) makes inference efficient by drastically reducing the memory footprint of the KV cache. These are not independent features but a coupled solution to the primary costs of building and running large models.
Advanced Reinforcement Learning: The development of Group Relative Policy Optimization (GRPO) represents a fundamental optimization of the learning algorithm itself. By eliminating the computationally expensive critic model used in standard PPO, GRPO makes it viable to use RL to teach complex, verifiable reasoning skills at scale.
Data as Architecture: The company’s approach to data curation, particularly the dependency-aware structuring of its code datasets, treats the logical flow of information as a critical component of the model’s learning environment, moving beyond simple data cleaning.
Disruptive Economics: The combination of these efficiencies allows DeepSeek to offer its models via API at a price point that is an order of magnitude lower than its performance-tier competitors, fundamentally altering the economic calculus for developers and businesses.
7.2 Analysis of Limitations, Controversies, and Ethical Considerations
Despite its technical prowess, DeepSeek faces significant challenges and controversies that could temper its future growth, particularly in Western markets. The company’s greatest hurdle may not be technical, but one of trust.
Data Privacy and Geopolitics: As a Chinese company, DeepSeek operates under a different legal and political framework. Its privacy policy states that user data, including prompts and chat history, is stored on servers located in China.⁴ For many Western enterprises concerned with data sovereignty, security, and potential government access, this is a significant barrier to adoption. As one analyst noted, it is unlikely that a U.S. Global 2000 company would choose to build its core AI infrastructure on a Chinese startup’s platform, regardless of cost or performance advantages.⁴
Censorship and Transparency: The models have been observed to actively self-censor on topics deemed politically sensitive by the Chinese government. This practice raises concerns about transparency, ideological alignment, and the reliability of the models as sources of objective information.⁵
Safety and Alignment: While the model license includes use-based restrictions to prevent harmful applications, some independent analyses have found that DeepSeek’s models may produce a higher rate of unsafe or biased responses compared to some competitors. This suggests a need for more robust safety protocols and alignment techniques.¹⁵
Hardware Dependency: While DeepSeek has demonstrated remarkable ingenuity in optimizing for less powerful hardware, its ability to train the next generation of even larger models remains fundamentally dependent on access to high-performance GPUs, a supply chain that is subject to ongoing geopolitical tensions and trade restrictions.¹
7.3 The Future Trajectory: DeepSeek-R2 and the Path to AGI
DeepSeek’s stated mission is the long-term pursuit of AGI, and its roadmap indicates a clear strategy to move from catching up with the state of the art to defining it.⁸⁰ The announced
DeepSeek-R2 model, scheduled for release in early 2025, signals a strategic shift from achieving parity on existing modalities to pioneering the next frontier of capabilities.⁶³
The planned features for DeepSeek-R2 include:
Robust Multimodal Capabilities: A primary focus is on introducing the ability to natively process and reason across text, images, audio, and basic video. This is a direct move to compete with the next generation of models like GPT-4o and Gemini, where multimodality is a core feature.⁶³
Advanced Multilingual Reasoning: The goal is to move beyond strong performance in just English and Chinese to achieve consistent, high-level reasoning across a much broader set of global languages.⁶³
Novel Training Techniques: DeepSeek has indicated that R2 will be trained with new, proprietary methods such as Generative Reward Modeling (GRM) and Self-Principled Critique Tuning. This suggests a move beyond refining existing paradigms (like PPO to GRPO) and toward the development of fundamentally new approaches to model training and alignment.⁶³
This evolution from the V3/R1 generation to R2 marks a maturation of DeepSeek’s R&D efforts. The first wave of models proved they could compete with the world’s best by leveraging extreme efficiency. The next wave aims to lead innovation by defining new capabilities and training methodologies. Backed by a stable financial parent in High-Flyer and guided by a long-term, research-focused vision, DeepSeek has firmly established itself as an enduring and influential force in the quest to develop artificial general intelligence.
Cited works
en.wikipedia.org, https://en.wikipedia.org/wiki/DeepSeek
What is DeepSeek? Here’s a quick guide to the Chinese AI company | PBS News, https://www.pbs.org/newshour/science/what-is-deepseek-heres-a-quick-guide-to-the-chinese-ai-company
What is DeepSeek AI? (Features, OpenAI Comparison, & More) - Exploding Topics, https://explodingtopics.com/blog/deepseek-ai
What is DeepSeek, and why is it causing Nvidia and other stocks to slump? - CBS News, https://www.cbsnews.com/news/what-is-deepseek-ai-china-stock-nvidia-nvda-asml/
What Is DeepSeek? Everything to Know About the New Chinese AI Tool - CNET, https://www.cnet.com/tech/services-and-software/what-is-deepseek-everything-to-know-about-the-new-chinese-ai-tool/
DeepSeek V3 vs GPT-4o: Which is Better? - Analytics Vidhya, https://www.analyticsvidhya.com/blog/2024/12/gpt-4o-vs-deepseek-v3/
DeepSeek: Everything you need to know about this new LLM in one place - Daily.dev, https://daily.dev/blog/deepseek-everything-you-need-to-know-about-this-new-llm-in-one-place
deepseek-ai/awesome-deepseek-integration: Integrate the DeepSeek API into popular softwares - GitHub, https://github.com/deepseek-ai/awesome-deepseek-integration
DeepSeek AI Versions Breakdown : A Detailed Guide to Every Versions, https://www.oneclickitsolution.com/centerofexcellence/aiml/deepseek-ai-versions-breakdown-detailed-guide-to-every-versions
deepseek-ai/DeepSeek-Coder: DeepSeek Coder: Let the … - GitHub, https://github.com/deepseek-ai/DeepSeek-Coder
DeepSeek LLM: Let there be answers - GitHub, https://github.com/deepseek-ai/DeepSeek-LLM
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model - arXiv, https://arxiv.org/html/2405.04434v4
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence : r/LocalLLaMA - Reddit, https://www.reddit.com/r/LocalLLaMA/comments/1dhx449/deepseekcoderv2_breaking_the_barrier_of/
deepseek-ai/DeepSeek-V3 - GitHub, https://github.com/deepseek-ai/DeepSeek-V3
DeepSeek-V3 Technical Report - ResearchGate, https://www.researchgate.net/publication/387512415_DeepSeek-V3_Technical_Report
DeepSeek AI: Advancing Open-Source LLMs with MoE & Reinforcement Learning | DeepSeek-R1 & V3 Explained - Inferless, https://www.inferless.com/learn/the-ultimate-guide-to-deepseek-models
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies - arXiv, https://arxiv.org/html/2501.17030v1
DeepSeek MoE & V2 - Creative Strategies, https://creativestrategies.com/deepseek-moe-v2/
DeepSeek and the Power of Mixture of Experts (MoE) - DEV Community, https://dev.to/sayed_ali_alkamel/deepseek-and-the-power-of-mixture-of-experts-moe-ham
DeepSeek-V3: Overview and Insights from arXiv - BytePlus, https://www.byteplus.com/en/topic/375666
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models - arXiv, https://arxiv.org/html/2401.06066v1
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models - arXiv, https://arxiv.org/pdf/2401.06066
Yet Another DeepSeek Overview - Tamanna Hossain-Kay, https://www.tamanna-hossain-kay.com/post/2025/02/08/deepseek/
deepseek-ai/DeepSeek-V2-Lite - Hugging Face, https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
DeepSeek-V3 Technical Report - The VITALab website, https://vitalab.github.io/article/2025/02/11/DeepSeekV3.html
The DeepSeek Series: A Technical Overview - Martin Fowler, https://martinfowler.com/articles/deepseek-papers.html
DeepSeek-V3 Technical Report - arXiv, https://arxiv.org/html/2412.19437v1
A Review of DeepSeek Models’ Key Innovative Techniques - arXiv, https://arxiv.org/pdf/2503.11486?
Multi-Head Latent Attention: Benefits in Memory and Computation — Blog - DataCrunch, https://datacrunch.io/blog/multi-head-latent-attention-benefits-in-memory-and-computation
Hardware-Centric Analysis of DeepSeek’s Multi-Head Latent Attention - arXiv, https://www.arxiv.org/pdf/2506.02523
junfanz1/MiniGPT-and-DeepSeek-MLA-Multi-Head-Latent-Attention - GitHub, https://github.com/junfanz1/MiniGPT-and-DeepSeek-MLA-Multi-Head-Latent-Attention
DeepSeek-V2 Large Language Model (LLM) Architecture: An Introduction - Metric Coders, https://www.metriccoders.com/post/deepseek-v2-large-language-model-llm-architecture-an-introduction
DeepSeek V2 · Models - Dataloop, https://dataloop.ai/library/model/deepseek-ai_deepseek-v2/
Top 5 Most Successful Data Curation Strategies in DeepSeek | Label Studio, https://labelstud.io/blog/top-5-most-successful-data-curation-strategies-in-deepseek/
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence - arXiv, https://arxiv.org/pdf/2401.14196
DeepSeek-Coder-V2 Tutorial: Examples, Installation, Benchmarks | DataCamp, https://www.datacamp.com/tutorial/deepseek-coder-v2
DeepSeek Explained: Why This AI Model Is Gaining Popularity | DigitalOcean, https://www.digitalocean.com/resources/articles/deepseek-explained
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures - arXiv, https://arxiv.org/html/2505.09343v1
Deepseek V3 is officially released (code, paper, benchmark results) : r/LocalLLaMA - Reddit, https://www.reddit.com/r/LocalLLaMA/comments/1hmmtt3/deepseek_v3_is_officially_released_code_paper/
How DeepSeek-R1 and Kimi k1.5 Use Reinforcement Learning to Improve Reasoning, https://www.deeplearning.ai/the-batch/how-deepseek-r1-and-kimi-k1-5-use-reinforcement-learning-to-improve-reasoning/
The State of Reinforcement Learning for LLM Reasoning - Sebastian Raschka, https://sebastianraschka.com/blog/2025/the-state-of-reinforcement-learning-for-llm-reasoning.html
DeepSeek-R1: How Did They Make an OpenAI-Level Reasoning Model So Damn Efficient? : r/singularity - Reddit, https://www.reddit.com/r/singularity/comments/1i9lkbh/deepseekr1_how_did_they_make_an_openailevel/
Bite: How Deepseek R1 was trained - Philschmid, https://www.philschmid.de/deepseek-r1
DeepSeek R1 Explained: A Cost-Efficient Reasoning Focused LLM - Turing, https://www.turing.com/resources/understanding-deepseek-r1
Beyond Supervised Fine Tuning: How Reinforcement Learning Empowers AI with Minimal Labels - Fireworks AI, https://fireworks.ai/blog/reinforcement-learning-with-verifiable-reward
GRPO (Group Relative Policy Optimization) explanation compared to PPO - Reddit, https://www.reddit.com/r/ChatGPTPro/comments/1ibph6u/grpo_group_relative_policy_optimization/
Why GRPO is Important and How it Works - Oxen.ai, https://ghost.oxen.ai/why-grpo-is-important-and-how-it-works/
PPO vs. GRPO: The Future of AI Training | OpenAI o1 vs. DeepSeek R1 - Appy Pie Automate, https://www.appypieautomate.ai/blog/comparison/openai-o1-ppo-vs-deepseek-r1-grpo
Demystifying Policy Optimization in RL: An Introduction to PPO and GRPO, https://towardsdatascience.com/demystifying-policy-optimization-in-rl-an-introduction-to-ppo-and-grpo/
GRPO Reinforcement Learning Explained (DeepSeekMath Paper) - AI Papers Academy, https://aipapersacademy.com/deepseekmath-grpo/
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - arXiv, https://arxiv.org/pdf/2402.03300
Group Relative Policy Optimization (GRPO) Illustrated Breakdown - Ebrahim Pichka, https://epichka.com/blog/2025/grpo/
Training DeepSeek-R1: The Math Behind Group Relative Policy Optimization (GRPO), https://wordgptpro.com/blog/deepseek-r1-grpo-training
Understanding DeepSeek R1—A Reinforcement Learning-Driven Reasoning Model, https://kili-technology.com/large-language-models-llms/understanding-deepseek-r1
DeepSeek V3 vs ChatGPT 4o - Codefinity, https://codefinity.com/blog/DeepSeek-V3-vs-ChatGPT-4o
GPT-4o vs Deepseek-R1 - Eden AI, https://www.edenai.co/post/gpt-4o-vs-deepseek-r1
GPT-4o vs DeepSeek-V3 - LLM Stats, https://llm-stats.com/models/compare/gpt-4o-2024-08-06-vs-deepseek-v3
DeepSeek Model V3: Architecture, Performance, Downloads & Use Cases - MuneebDev, https://muneebdev.com/deepseek-model-v3-guide/
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence - arXiv, https://arxiv.org/html/2406.11931v1
BigCodeBench Leaderboard, https://bigcode-bench.github.io/
DeepSeek: Models Intelligence, Performance & Price - Artificial Analysis, https://artificialanalysis.ai/providers/deepseek
Notes on Deepseek v3: Is it truly better than GPT-4o and 3.5 Sonnet? - Reddit, https://www.reddit.com/r/LocalLLaMA/comments/1hr56e3/notes_on_deepseek_v3_is_it_truly_better_than/
DeepSeek-R2: China’s Powerful New AI Model for 2025, https://deepseek.ai/blog/deepseek-r2-ai-model-launch-2025
DeepSeek Pricing: An Affordable AI Solution - Lark, https://www.larksuite.com/en_us/blog/deepseek-pricing
Build with DeepSeek R1 API, https://aimlapi.com/build-with-deepseek-r1-api
DeepSeek R1 0528 Hits 71% (+14.5 pts from R1) on Aider Polyglot Coding Leaderboard : r/LocalLLaMA - Reddit, https://www.reddit.com/r/LocalLLaMA/comments/1l76ab7/deepseek_r1_0528_hits_71_145_pts_from_r1_on_aider/
DeepSeek Use Cases: Exploring Advanced AI and Its Real-World Applications - Bitrue, https://www.bitrue.com/blog/deepseek-ai-use-cases-and-disclaimer
Deepseek API Complete Guide: Mastering the DeepSeek API for Developers | Zuplo Blog, https://zuplo.com/blog/2025/03/07/deepseek-api
Models & Pricing - DeepSeek API Docs, https://api-docs.deepseek.com/quick_start/pricing
DeepSeek API: A Guide With Examples and Cost Calculations - DataCamp, https://www.datacamp.com/tutorial/deepseek-api
Get Started with DeepSeek R1 API: Setup, Usage, and Pricing - Cody, https://meetcody.ai/blog/deepseek-r1-api-pricing/
DeepSeek model license review | Black Duck Blog, https://www.blackduck.com/blog/deepseek-license.html
DeepSeek License FAQ, https://deepseeklicense.github.io/
DeepSeek-V2/LICENSE-CODE at main - GitHub, https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-CODE
It seems like people need an explanation of what OpenSource, MIT License means : r/DeepSeek - Reddit, https://www.reddit.com/r/DeepSeek/comments/1ia28ts/it_seems_like_people_need_an_explanation_of_what/
DeepSeek-V2/LICENSE-MODEL at main - GitHub, https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL
Top 10 DeepSeek Use Cases to Explore, https://www.straive.com/blogs/top-10-deepseek-use-cases-to-explore/
7 Real-world Applications of DeepSeek V3 - Analytics Vidhya, https://www.analyticsvidhya.com/blog/2025/02/real-world-applications-of-deepseek-v3/
DeepSeek’s reasoning AI shows power of small models, efficiently trained | IBM, https://www.ibm.com/think/news/deepseek-r1-ai
deepseek-ai (DeepSeek) - Hugging Face, https://huggingface.co/deepseek-ai