[1] What makes DeepSeek so special?

Mayank Pratap Singh

Mayank Pratap Singh

@Mayank Pratap Singh

[1] What makes DeepSeek so special?

DeepSeek has developed a series of LLMs with major optimizations in efficiency and reasoning capabilities.

Let's first see the different LLMs built by DeepSeek

Here is the website of Deepseek deepseek.com Go below in the footer to see various models

The main model that caught attention is DeepSeek-R1.

  1. DeepSeek LLM (V1) – Focused on math and coding.
  2. DeepSeek-V2 – Optimized for coding.
  3. DeepSeek-V3 (671B parameters) – A large-scale model.
  4. DeepSeek-R1 – A reasoning model.

DeepSeek-R1 offers performance comparable to OpenAI's top models at a fraction of the cost while being open-source.

DeepSeek = Fraction of the cost + Open Source

DeepSeek Version 3 has 671 billion parameters, and DeepSeek R1 is a model derived from DeepSeek V3.

OpenAI vs DeepSeek (Math Puzzle Test)

Let's compare DeepSeek with an OpenAI model on a math problem involving integrals.

We have the following integral question:

The interesting aspect of this problem is its answer: 22/7 - π.

Proof that 22/7 exceeds π

Wikipedia [Proof that 22/7 exceeds]

We provided this problem to both LLMs:

Math puzzle

Chatgpt Answer ChatGPT's Answer: 2/105 (incorrect).

DeepSeek Answer DeepSeek's Answer: Correctly solved the problem.

While this is just one example, and a single test is not enough for a complete comparison, it does highlight that DeepSeek performed well in this case.

Here’s a comparison table between DeepSeek-V3 and GPT-4 based on key factors:

FeatureDeepSeek-V3GPT-4
PerformanceSimilar or superior on many tasksHigh performance across various benchmarks
CostFew cents per million tokensFew dollars per million tokens
HostingCan be self-hostedClosed-source API only
Token PricingVery lowRelatively high
Open/Closed SourceOpen-sourceClosed-source
Model Size671 billion parametersEstimated 1 trillion+ parameters (exact size unknown)
Hardware RequirementsHigh computational resources needed for self-hostingNot applicable (hosted by OpenAI)

Evaluation Scores of different models

An analysis of evaluation scores vs pricing reveals that DeepSeek-V3 stands out as the only top-performing model with an affordable price. It offers high performance at a significantly lower cost compared to other models like GPT-4. Source

Here’s a comparison between DeepSeek-V3 and LLaMA 2

FeatureDeepSeek-V3LLaMA 2
DeveloperDeepSeek AIMeta
Open-SourceYesYes
ScalabilityLarger-scale model with Mixture of Experts (MoE)Fixed model sizes (7B, 13B, 70B)
InnovationUses Reinforcement Learning (RL), Mixture of Experts (MoE), and advanced training techniquesStandard transformer-based training

Both DeepSeek-V3 and LLaMA 2 contribute to the open AI ecosystem, but DeepSeek-V3 surpasses LLaMA-70B in scale and performance. With innovations like Mixture of Experts and Reinforcement Learning, DeepSeek-V3 offers greater capability and efficiency, making it a more advanced option for AI applications.

DeepSeek: Strengths and Weaknesses

Strengths of DeepSeek

DeepSeek offers several advantages, making it a compelling choice for AI applications:

Open-Source Freedom

  • Provides full control and transparency, allowing customization and self-hosting.

Cost Efficiency

  • Significantly cheaper than proprietary models like GPT-4 and Claude.
  • Reduces operational costs, making it ideal for budget-conscious teams.

Competitive Performance

  • Excels in reasoning, math, and coding tasks.
  • Strong performance compared to closed-source alternatives.

Weaknesses of DeepSeek

While DeepSeek is powerful, it comes with certain challenges:

Relatively New Model

  • Less polished and refined than GPT-4 or Claude in default settings.
  • Requires users to implement their own safeguards for safety and responsible AI use.

High Infrastructure Requirements

  • Deploying a 671B-parameter model is complex and demands significant computing power.
  • Smaller DeepSeek variants are available, but hosting still requires robust infrastructure.

Considerations for Adoption

Organizations must carefully assess these factors before choosing DeepSeek:

  • Big enterprises may avoid DeepSeek due to concerns around guardrails, safety, and compliance.
  • Small, lean, fast-growing startups can significantly cut costs by leveraging DeepSeek’s open-source capabilities.

DeepSeek is a powerful yet evolving AI model, best suited for organizations willing to invest in customization and infrastructure.

What Is Special About DeepSeek?

DeepSeek delivers similar performance to expensive models at a significantly lower cost.

Deepseek 4 major things We have 4 major things to talk about

  • Innovative Architecture

  • Training Methodology

  • GPU Optimization Tricks

  • Model Ecosystem

Innovative Architecture

Innovative Architecture

  • Multi-Head Latent Attention
  • Mixture of Experts (MoE)
  • Multi-Token Prediction (MTP)
  • Quantization
  • Rotary Positional Encodings (RoPE)

We will cover each of these in their own individual blogs but for now let's see a surface level overview

Multi-head Latent Attention

Innovative Architecture So, to put it simply, in Multi-Head Latent Attention, the Key and Value representations are mapped into a latent space instead of being directly derived from the input tokens.

Mixture of Experts

Mixture of Experts

Illustration Source: A Visual Guide to Mixture of Experts

In the Mixture of Experts (MoE) architecture, the model consists of multiple expert networks (e.g., four experts in this case). However, instead of activating the entire model for every input, only a subset of these expert networks is used at a given time.

Routing

Illustration Source: A Visual Guide to Mixture of Experts

A specialized routing mechanism determines which experts should be activated for a particular input. This dynamic activation strategy significantly reduces computational costs while maintaining high performance. By selectively utilizing only the most relevant experts, MoE improves efficiency and scalability, making it highly suitable for large-scale models.

Multi-token Predicton

Multi-token Predicton

Previously, in traditional LLMs, the model generated text one token at a time, where each predicted token was fed back into the model to generate the next one. While this approach ensured high accuracy, it was computationally expensive and slowed down inference.

With Multi-Token Prediction (MTP), instead of predicting a single token at each step, the model generates multiple tokens in parallel. This significantly accelerates text generation and improves efficiency without compromising quality.

By leveraging this technique, DeepSeek and similar models can enhance performance, reduce latency, and optimize resource usage, making large-scale language modeling more practical for real-world applications.

Quantization

Multi-token Predicton Instead of representing every parameter with high-precision floating-point numbers, quantization reduces the numerical precision by using fewer bits. This means storing and computing values with lower precision, such as int8 instead of float32, which significantly reduces memory usage and speeds up computations.

Although this process slightly reduces the model's precision, it retains almost the same quality of output while making the model more efficient and lightweight. This optimization is crucial for deploying large models on resource-constrained devices without sacrificing too much performance.

Rotary positional encoding

Multi-token Predicton Illustration Source: karthick.ai

Traditional transformers use absolute positional encodings to help the model understand token order. Rotary Positional Encoding (RoPE), on the other hand, encodes positional information directly into the attention mechanism using rotational transformations. This allows the model to capture relative positional relationships more effectively, improving its ability to handle long-range dependencies and generalize better to unseen sequences.

Training Methodology

  • The Rise of Reinforcement Learning (RL)
  • Rule-Based Reward System

The DeepSeek paper revitalized Reinforcement Learning (RL) by integrating it into model training at a large scale. Instead of relying solely on human-labeled data, DeepSeek utilized Large-Scale RL to teach the model complex reasoning skills.

Group Relative Policy Optimization (GRPO)

As part of this approach, DeepSeek introduced Group Relative Policy Optimization (GRPO), a novel framework that enhances reinforcement learning efficiency.

This is one of the key reasons why DeepSeek excels at reasoning tasks, making it a highly capable model for complex problem-solving.


GPU Optimization Tricks

  • NVIDIA Parallel Thread Execution (PTX)

Instead of using traditional CUDA programming, DeepSeek leveraged Parallel Thread Execution (PTX) to optimize performance at a lower level.

Read more: Parallel Thread Execution (PTX) Explained

Understanding PTX: A Simple Analogy

Think of CUDA as writing high-level code in Python or Java, whereas PTX is like bytecode—an intermediate representation that runs closer to machine code.

Multi-token Predicton

By using PTX, DeepSeek achieved better low-level GPU optimizations, leading to faster and more efficient computations.

High-level programming languages do not operate directly at the machine level, making them more abstract but less efficient in execution. Machine-level code, on the other hand, runs at the fastest possible speed since it interacts directly with hardware.

Read more: DeepSeek’s AI Breakthrough: Bypassing CUDA with PTX

This approach played a crucial role in speeding up computations and enhancing architectural efficiency, making DeepSeek a standout model in AI development.


Model Ecosystem

One of DeepSeek’s key strengths is model distillation, where larger models are compressed into smaller versions without significant performance loss.

For example, DeepSeek has successfully distilled models down to just 1.5B parameters, making AI more accessible and deployable on limited hardware.


Why Is DeepSeek a Turning Point in History?

DeepSeek proved that a small startup could reach parity with the best AI models by using novel techniques and fewer resources.

By drastically reducing the cost to develop and operate large AI models, DeepSeek is democratizing AI. Its low-cost development (~$5.6M for V3) raised concerns about the sustainability of big-tech companies pouring billions into AI research.

Global Market Impact

The global financial markets reacted swiftly:

  • News of DeepSeek’s breakthrough contributed to a significant drop in U.S. tech stocks in January 2025.
  • The Nasdaq fell by 3.4%, and Nvidia’s market cap plunged as investors reconsidered the AI hardware demand landscape.

Geopolitical Implications

DeepSeek's low-cost, open-source AI model posed a direct challenge to companies like OpenAI, Microsoft, and Google. It also raised concerns about AI supply chains and GPU markets, prompting:

  • Countries to invest in developing their own foundational models to reduce dependence on external AI technologies.
  • Strategic shifts in the AI and semiconductor industries as nations reassessed their AI infrastructure and funding strategies.

Source

This blog is based on my personal notes from a Vizuara video on DeepSeek. You can check it out for a more in-depth explanation.

Vizuara Video on DeepSeek

That’s it for now! See you Soon