[{"data":1,"prerenderedAt":125},["ShallowReactive",2],{"content-query-nZ7Kvw4Cc3":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":119,"_id":120,"_source":121,"_file":122,"_stem":123,"_extension":124},"/news/zh/2025-12-12","zh",false,"","昇思人工智能框架峰会 | 昇思MindSpore MoE模型性能优化方案，提升训练性能15%+","系统性地解决了MoE架构在大规模分布式训练中面临的通信开销大、断流频率高、显存占用高等核心瓶颈","2025-12-12","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2025/07/25/199b735845bf4106b44b2035dc97bd39.png","news",{"type":14,"children":15,"toc":116},"root",[16,24,30,39,47,52,57,68,73,81,89,94,99,106,111],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"昇思人工智能框架峰会-昇思mindspore-moe模型性能优化方案提升训练性能15",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":23,"value":29},"昇思MindSpore MoE性能优化方案主要包含机间通信合并、零冗余通信、AlltoAllV收发异构复用3项关键技术。这些技术协同作用，系统性地解决了MoE架构在大规模分布式训练中面临的通信开销大、断流频率高、显存占用高等核心瓶颈。",{"type":17,"tag":25,"props":31,"children":32},{},[33],{"type":17,"tag":34,"props":35,"children":36},"strong",{},[37],{"type":23,"value":38},"# 01",{"type":17,"tag":25,"props":40,"children":41},{},[42],{"type":17,"tag":34,"props":43,"children":44},{},[45],{"type":23,"value":46},"机间通信合并特性",{"type":17,"tag":25,"props":48,"children":49},{},[50],{"type":23,"value":51},"当前的流行MoE架构存在着专家数多、单个专家计算量小的特点。如DeepSeek V3每个层的路由专家个数高达256个，在训练实践中为了减小显存压力往往开启专家并行(EP)，将专家切分到不同的卡上。然而，当EP数大于单个节点的的NPU/GPU数量时，专家会被切分到不同节点上，在token dispatch和combine阶段，需要进行AlltoAll的机间通信。因机间带宽远小于机内带宽，此时，机间通信不可避免地成为通信性能的瓶颈。",{"type":17,"tag":25,"props":53,"children":54},{},[55],{"type":23,"value":56},"昇思MindSpore团队针对这一问题，采用跨机AllGather通信与机内AlltoAll通信相结合的方式，解决AlltoAll机间通信性能差的问题。首先将所需的tokens通过跨机AllGather同步到机间，然后在机间进行tokens的排序与AlltoAll通信。基于这种分层的通信方式降低了跨机通信数据量，有效地提升了整体通信性能。经过在DeepSeek V3 671B实训测试，在EP=16时端到端吞吐性能提升15%。机间通信合并与原始通信方案的示意如图1。",{"type":17,"tag":58,"props":59,"children":61},"div",{"style":60},"text-align: center;",[62],{"type":17,"tag":63,"props":64,"children":67},"img",{"src":65,"style":66,"alt":7},"/category/information/news/banner/2025-12-12-1.jpg","display: block;margin: 0 auto;max-width:60%",[],{"type":17,"tag":25,"props":69,"children":70},{},[71],{"type":23,"value":72},"图1. 机间通信合并与原始通信方案的示意图",{"type":17,"tag":25,"props":74,"children":75},{},[76],{"type":17,"tag":34,"props":77,"children":78},{},[79],{"type":23,"value":80},"# 02",{"type":17,"tag":25,"props":82,"children":83},{},[84],{"type":17,"tag":34,"props":85,"children":86},{},[87],{"type":23,"value":88},"AlltoAllV收发异构复用特性",{"type":17,"tag":25,"props":90,"children":91},{},[92],{"type":23,"value":93},"在MoE的token dispatch 以及 token combine阶段各需要执行一次AlltoAllV的通信计算。在下发AlltoAllV算子时需要send_list/receive_list的参数信息，而这两个参数内存在device侧，需要对其进行一次device to host操作将其搬运至Host侧内存。因此在正向token dispatch及token combine阶段各存在1次因device to host引发的下发断流（即，下发流程需要等待device to host操作完成后，才能下发其余算子）。如果考虑反向计算，断流次数就变成4次，对性能造成严重影响。",{"type":17,"tag":25,"props":95,"children":96},{},[97],{"type":23,"value":98},"为此，昇思MindSpore采用AlltoAllV收发异构复用技术来减少断流次数，其核心思想在于在提前对token dispatch的send_list/receive_list进行device to host，将其缓存在Host，然后基于缓存的send_list/receive_list实现提前下发token combine阶段的AlltoAllV，其原理如图2所示。",{"type":17,"tag":58,"props":100,"children":101},{"style":60},[102],{"type":17,"tag":63,"props":103,"children":105},{"src":104,"style":66,"alt":7},"/category/information/news/banner/2025-12-12-2.jpg",[],{"type":17,"tag":25,"props":107,"children":108},{},[109],{"type":23,"value":110},"昇思MindSpore通过其异构能力实现AlltoAllV收发send_list/receive_list的异构复用，将断流次数从4次降低到1次。在DeepSeek V3 671B实训测试，端到端性能提升5%。",{"type":17,"tag":25,"props":112,"children":113},{},[114],{"type":23,"value":115},"昇思MindSpore针对MoE性能提升的业界难题，成体系地采用优化技术，包括但不限于上述2项技术，构筑了昇思MindSpore面向超大规模MoE训练的高效通信底座，更多的技术介绍与交流，请关注昇思人工智能框架峰会。",{"title":7,"searchDepth":117,"depth":117,"links":118},4,[],"markdown","content:news:zh:2025-12-12.md","content","news/zh/2025-12-12.md","news/zh/2025-12-12","md",1776506059737]