DeepSeek-V3 & MindSpore: Training Deployment Made Easy

2025/02/20

DeepSeek-V3 & MindSpore: Training Deployment Made Easy

Background

Exciting News for AI Developers: DeepSeek-V3 is Here!

The powerful DeepSeek-V3 model is now integrated with the MindSpore AI framework, running on Ascend AI hardware. This unlocks ready-to-use pre-training and inference, validated by large-scale cluster deployments.

Thanks to MindSpore's foundation model kits and its multi-dimensional hybrid distributed capabilities, automatic parallelism, and dryrun simulation, adapting DeepSeek-V3's advanced architectures was remarkably fast. Furthermore, MindSpore ensures efficient inference deployment, with optimizations for complex structures like MLA and DeepSeekMoE.

Start building with DeepSeek-V3 on MindSpore today! Dive into the open source code and unleash AI potential:

- Pre-training code: https://github.com/mindspore-lab/mindformers/tree/dev/research/deepseek3

- Inference code: https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3

Let's now proceed with the tutorial on DeepSeek-V3 pre-training deployment.

Environment Setup

MindSpore Transformers offers seamless support for DeepSeek-V3 pre-training. We've got a sample configuration file for 128 Atlas 800T A2 (64 G) servers in our repository, with WikiText-2 dataset included. Refer to the README for details.

Want to try it on a single device? We provide a modified configuration that reduces the DeepSeek-V3 model parameter count, allowing you to implement pre-training on a single Atlas 800T A2 (64G) server.

To prepare for training, an Atlas 800T A2 (64 GB) server is required. The environment dependencies for MindSpore Transformers are as follows:

- Python 3.10

- MindSpore 2.4.10

- CANN 8.0.RC3.beta1

- Firmware and driver 24.1.RC3

MindSpore offers a dedicated Docker image for DeepSeek-V3 pre-training. Follow these steps for usage.

1. Download the Docker image.

Use the following command to download the Docker image.

docker pull swr.cn-central-221.ovaijisuan.com/mindformers/deepseek_v3_mindspore2.4.10-train:20250209

2. Create a container from the image.

image_name=swr.cn-central-221.ovaijisuan.com/mindformers/deepseek_v3_mindspore2.4.10-train:20250209
docker_name=deepseek_v3
docker run -itd -u root \
--ipc=host --net=host \
--privileged \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /etc/localtime:/etc/localtime \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/bin/hccn_tool \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /var/log/npu:/usr/slog \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /etc/hccn.conf:/etc/hccn.conf \
--name "$docker_name" \
"$image_name" \
/bin/bash

3. Access the container environment.

Enter the newly created container and navigate to the designated code directory using the command below.

docker exec -ti deepseek_v3 bash
cd /home/work/mindformers

Dataset Preparation

Taking the WikiText-2 dataset as an example, follow the steps below to convert the dataset into a Megatron BIN file.

1. Download the dataset and tokenizer model file.

Dataset: WikiText2 dataset

Tokenizer model: DeepSeek-V3 tokenizer.json

2. Generate a Megatron BIN file.

Place the dataset file wiki.train.tokens and the tokenizer model file tokenizer.json under the /home/work/dataset directory.

Use the following command to convert the dataset file to Megatron BIN format.

cd /home/work/mindformers/research/deepseek3
python wikitext_to_bin.py \
--input /home/work/dataset/wiki.train.tokens \
--output-prefix /home/work/dataset/wiki_4096 \
--vocab-file /home/work/dataset/tokenizer.json \
--seq-length 4096 \
--worker 1

Configuration Example

This procedure outlines the steps to configure a single-node environment for DeepSeek-V3 pre-training. Start with the pretrain_deepseek3_671b.yaml configuration file and save the modified version as pretrain_deepseek3_1b.yaml.

1. Adjust model configuration.

# model config
model:
  model_config:
    type: DeepseekV3Config
    auto_register: deepseek3_config.DeepseekV3Config
    seq_length: 4096
    hidden_size: 2048                                # Set the value to 2048.
    num_layers: &num_layers 3                       # Set the value to 3.
    num_heads: 8                                    # Set the value to 8.
    max_position_embeddings: 4096
    intermediate_size: 6144                          # Set the value to 6144.
    offset: 0                                        # Set the value to 0.
    ……

2. Adjust mixture-of-experts (MoE) configuration.

Follow the instruction below to use the DeepSeek-V3 dedicated Docker image.

#moe
moe_config:
  expert_num: &expert_num 16                      # Set the value to 16.
  first_k_dense_replace: 1                        # Set the value to 1.
  ……

3. Modify parallel configuration.

# parallel config for devices num=8
parallel_config:
  data_parallel: 2                                   # Set the value to 2.
  model_parallel: 2                                  # Set the value to 2.
  pipeline_stage: 2                                  # Set the value to 2.
  expert_parallel: 2                                 # Set the value to 2.
  micro_batch_num: µ_batch_num 4           # Set the value to 4.
      parallel:
        parallel_optimizer_config:
          optimizer_weight_shard_size: 8                  # Set the value to 8.
  ……

4. Adjust learning rate configuration.

# lr schedule
   lr_schedule:
     type: ConstantWarmUpLR
     warmup_steps: 20                                    # Set the value to 20.

5. Modify dataset configuration.

- Configure the dataset path:

# dataset

   train_dataset: &train_dataset

     data_loader:

 type: BlendedMegatronDatasetDataLoader

       config:

         data_path:

           - 1

           - "/home/work/dataset/wiki_4096_text_document"  # Set the dataset path.

    ……

- Configure the dataset parallel communication configuration path:

# mindspore context init config

   context:

     ascend_config:

       parallel_speed_up_json_path: "/home/work/mindformers/research/deepseek3/parallel_speed_up.json"  # Set the dataset parallel communication configuration path.

Training Task Initiation

Navigate to the code root directory and execute the following command to launch a single-node Atlas 800T A2 (64G) pre-training task.

cd /home/work/mindformers
bash scripts/msrun_launcher.sh "run_mindformer.py \
--register_path research/deepseek3 \
--config research/deepseek3/deepseek3_671b/pretrain_deepseek3_1b.yaml"

Once the startup script is executed, the task will be launched in the background. Training logs are saved under /home/work/mindformers/output/msrun_log. Use the following command to view the training logs (Due to pipeline parallelism being enabled with pipeline_stage: 2, loss is only displayed in the log of the last card, worker_7.log. Loss in other logs will show as 0).

tail -f /home/work/mindformers/output/msrun_log/worker_7.log

Training loss chart is shown below:

During the training process, weight checkpoints will be saved in the /home/work/mindformers/output/checkpoint directory.

That wraps up our tutorial on DeepSeek-V3 pre-training deployment with MindSpore!

Now it's your turn — get started today by accessing the DeepSeek-V3 image on MindSpore to conduct pre-training deployment and unlock streamlined development workflows.

And a sneak peek —— next week, we will bring you a tutorial on DeepSeek-V3 inference deployment using MindSpore. Stay tuned for that!

Learning

Core Frameworks

Foundation Model

Scientific Computing

Domain Suites

Tools

Ecosystem

Technical learning

Community Organization

Contribution and Growth

Interaction and Communication

Events

News

DeepSeek-V3 & MindSpore: Training Deployment Made Easy

DeepSeek-V3 & MindSpore: Training Deployment Made Easy

Background

Environment Setup

Dataset Preparation

Configuration Example

Training Task Initiation