MindSpore

Start

  • Overall Structure
  • Models

Quick Start

  • Installation
  • Calling Source Code to Start

Usage Tutorials

  • Development Migration
  • Multimodal Model Development
  • Pretraining
  • Supervised Fine-Tuning (SFT)
  • Evaluation
  • Inference
  • Quantization
  • Service Deployment
  • Dynamic Graph Parallelism

Function Description

  • Weight Format Conversion
  • Distributed Weight Slicing and Merging
  • Distributed Parallelism
  • Dataset
  • Weight Saving and Resumable Training
  • Training Metrics Monitoring
  • High Availability
  • Safetensors Weights
  • Fine-Grained Activations SWAP

Precision Optimization

  • Large Model Accuracy Optimization Guide

Performance Optimization

  • Large Model Performance Optimization Guide

API

  • mindformers
  • mindformers.core
  • mindformers.dataset
  • mindformers.generation
  • mindformers.models
  • mindformers.modules
  • mindformers.pet
  • mindformers.pipeline
  • mindformers.tools
  • mindformers.wrapper

Appendix

  • Environment Variable Descriptions
  • Configuration File Descriptions

FAQ

  • Model-Related
    • Q: How to deal with network runtime error “Out of Memory” (OOM)?
  • Function-Related
  • MindSpore Transformers Contribution Guidelines
  • Modelers Contribution Guidelines

RELEASE NOTES

  • Release Notes
MindSpore
  • »
  • Model-Related
  • View page source

Model-Related

View Source On Gitee

Q: How to deal with network runtime error “Out of Memory” (OOM)?

A: First of all, the above error refers to insufficient memory on the device, which may be caused by a variety of reasons, and it is recommended to carry out the following aspects of the investigation.

  1. Use the command npu-smi info to verify that the card is exclusive.

  2. It is recommended to use the default yaml configuration for the corresponding network when running network.

  3. Increase the value of max_device_memory in the corresponding yaml configuration file of the network. Note that some memory needs to be reserved for inter-card communication, which can be tried with incremental increases.

  4. Adjust the hybrid parallelism strategy, increase pipeline parallelism (pp) and model parallelism (mp) appropriately, and reduce data parallelism (dp) accordingly, keep dp * mp * pp = device_num, and increase the number of NPUs if necessary.

  5. Try to reduce batch size or sequence length.

  6. Turn on selective recalculation or full recalculation, turn on optimizer parallelism.

  7. If the problem still needs further troubleshooting, please feel free to raise issue for feedback.


Previous Next

© Copyright MindSpore.

Built with Sphinx using a theme provided by Read the Docs.