PaddlePaddle 适配 NPU 的技术全解析——从算子接入到端到端性能优化

张

张建站

2026/5/24 16:15:09

10分钟阅读

PaddlePaddle 适配 NPU 的技术全解析——从算子接入到端到端性能优化

PaddlePaddle飞桨是百度开源的深度学习框架它怎么在华为 NPU 上跑起来核心是通过 Paddle 的自定义算子机制接入 CANN 算子库并通过通信后端抽象支持 HCCL 和 hixl。这篇文章把这套适配技术拆开讲清楚。前几个月帮一个百度的团队做 PaddlePaddle 模型迁移到 NPU他们说「我们查了 Paddle 的文档没有找到 NPU 后端的配置选项是不是不支持」我跟他们说Paddle 支持 NPU但是不是通过paddle.set_device(npu)这种一键式配置而是需要安装paddle-npu-plugin扩展包并手动注册 NPU 算子。他们问为什么不能像 CUDA 那样开箱即用答案涉及 Paddle 的架构设计——Paddle 的硬件后端是通过插件机制扩展的不是硬编码在框架里的。一、Paddle 的硬件后端扩展机制1.1 Paddle 的后端架构Paddle 的算子分为前端描述和后端实现前端描述用 Paddle 的 Python API 描述的算子如paddle.matmul后端实现具体的硬件实现CPU、CUDA、NPU、IPU 等后端实现通过Plugin 机制注册到 Paddlepaddle.matmul前端算子 ↓ Matcher算子匹配器→ 根据输入张量的 device 属性选择后端 ↓ Kernel算子内核→ 具体硬件上的实现1.2 NPU Plugin 的注册流程paddle-npu-plugin通过PD_REGISTER_KERNEL宏注册 NPU 后端算子// paddle-npu-plugin/kernels/matmul_kernel.cc示意#includepaddle/phi/core/kernel_registry.h#includeacl/acl_op.h// 注册 MatMul 算子的 NPU 实现PD_REGISTER_KERNEL(matmul,NPU,ALL_LAYOUT,paddle::phi::MatMulKernelNPUContext){kernel-OutputAt(0).SetDataType(paddle::phi::DataType::FLOAT32);}// MatMul 算子的 NPU 实现namespacepaddle::phi{templatevoidMatMulKernelNPUContext(constNPUContextctx,constDenseTensorx,constDenseTensory,DenseTensor*out){// 调用 CANN 的 AscendMatMul 算子aclOpExecutor*executoraclOpExecutorCreate(AscendMatMul,ACL_ENGINE_SYS);aclSetInput(executor,0,x.data());aclSetInput(executor,1,y.data());aclSetOutput(executor,0,out-data());aclRun(executor);}}// namespace paddle::phi二、算子映射从 Paddle 前端到 CANN 后端2.1 Paddle 的算子命名规范Paddle 的算子命名跟 PyTorch、MindSpore 不一样PyTorchtorch.matmulMindSporeops.matmulPaddlepaddle.matmul前端 →phi::MatMulKernel后端 C 实现这种命名规范导致算子映射需要手动编写映射表# paddle-npu-plugin/op_map.py示意PADDLE_TO_CANN_OP_MAP{matmul:AscendMatMul,conv2d:AscendConv2D,batch_norm:AscendBatchNorm,# ... 数百个算子映射}2.2 动态 Shape 支持NPU 算子对动态 shape 的支持不如 GPU 算子。Paddle 通过InferShape函数在运行时推导输出 shape// 推导 MatMul 的输出 shapeboolMatMulInferShape(conststd::vectorint64_tx_shape,conststd::vectorint64_ty_shape,std::vectorint64_t*out_shape){if(x_shape.size()!2||y_shape.size()!2){returnfalse;// 只支持 2D 矩阵乘法}out_shape-push_back(x_shape[0]);out_shape-push_back(y_shape[1]);returntrue;}如果 CANN 算子不支持动态 shapePaddle 会在运行时报错ShapeInferenceError: output shape is dynamic, but operator AscendMatMul does not support dynamic shape.三、内存管理NPU 显存的池化分配3.1 Paddle 的显存管理器Paddle 使用Allocator模式管理显存CPU 显存使用系统内存malloc/freeCUDA 显存使用 CUDA 的缓存分配器CachingAllocatorNPU 显存使用 CANN 的acl_rt_malloc/acl_rt_freepaddle-npu-plugin实现了NPUAllocator// paddle-npu-plugin/memory/npu_allocator.cc示意classNPUAllocator:publicphi::Allocator{public:void*Allocate(size_t size)override{void*ptrnullptr;aclError retacl_rt_malloc(ptr,size,ACL_MEM_MALLOC_NORMAL_ONLY);if(ret!ACL_SUCCESS){throwstd::runtime_error(NPU memory allocation failed);}returnptr;}voidDeallocate(void*ptr)override{acl_rt_free(ptr);// 立即释放Paddle 不缓存 NPU 显存}};与 PyTorch 的区别PyTorch 的 NPU 分配器会缓存显存减少acl_rt_malloc调用次数但 Paddle 的 NPU 分配器不缓存每次都调用acl_rt_malloc。这在频繁分配小显存块时性能较差。3.2 内存优化建议如果你是 PaddleNPU 的用户建议减少显存分配次数复用显存块通过paddle.zeros_like而不是paddle.zeros使用梯度累积避免大 batch size 导致的 OOM定期调用paddle.device.npu.empty_cache()清理显存碎片四、分布式训练HCCL 后端与 fleet 分布式 API4.1 Paddle 的分布式训练接口Paddle 使用fleetAPI 做分布式训练类似 PyTorch 的torch.distributedimportpaddleimportpaddle.distributedasdist# 初始化 HCCL 通信组dist.init_parallel_env()# 在 NPU 0 上执行 AllReducetensorpaddle.to_tensor([1.0,2.0,3.0],placepaddle.CPUPlace())dist.all_reduce(tensor,opdist.ReduceOp.SUM)print(tensor)# [8.0, 16.0, 24.0]假设 world_size84.2 HCCL 后端的实现paddle-npu-plugin实现了HCCLCommunicator// paddle-npu-plugin/communication/hccl_communicator.cc示意classHCCLCommunicator{public:voidAllReduce(void*send_buf,void*recv_buf,size_t count,HCCLDataType dtype,HCCLReduceOp op){hcclAllReduce(send_buf,recv_buf,count,dtype,op,hccl_comm_);}voidAllGather(void*send_buf,void*recv_buf,size_t send_count,HCCLDataType dtype){hcclAllGather(send_buf,recv_buf,send_count,dtype,hccl_comm_);}private:hcclComm_t hccl_comm_;};与torch.distributed的区别PyTorch 的dist.all_reduce是阻塞式的调用后等待通信完成才返回Paddle 的dist.all_reduce是异步式的调用后立即返回通过dist.wait(tensor)等待完成五、实战案例ERNIE-3.0 在 NPU 上的预训练用一个完整的例子展示 Paddle NPU 的端到端流程。5.1 环境准备# 安装 Paddle NPU 版本pipinstallpaddlepaddle-npu2.6.0# 安装 paddle-npu-pluginpipinstallpaddle-npu-plugin1.0.0# 设置环境变量exportASCEND_HOME/usr/local/AscendexportLD_LIBRARY_PATH$ASCEND_HOME/lib64:$LD_LIBRARY_PATH5.2 定义模型importpaddleimportpaddle.nnasnnfrompaddlenlp.transformersimportErnieModel,ErnieTokenizer# 加载 ERNIE-3.0 模型modelErnieModel.from_pretrained(ernie-3.0-medium-zh)tokenizerErnieTokenizer.from_pretrained(ernie-3.0-medium-zh)# 移到 NPU 上paddle.device.set_device(npu:0)modelmodel.to(paddle.CPUPlace())# Paddle 的 NPU 后端需要通过 plugin 注册5.3 配置分布式训练frompaddle.distributedimportfleet# 初始化 fleetHCCL 后端strategyfleet.DistributedStrategy()strategy.hybrid_configs{dp_degree:1,# 数据并行mp_degree:8,# 模型并行张量并行pp_degree:1# 流水线并行}fleet.init(is_collectiveTrue)modelfleet.distributed_model(model)5.4 启动预训练frompaddle.optimizerimportAdamW# 优化器optimizerAdamW(learning_rate5e-5,parametersmodel.parameters())optimizerfleet.distributed_optimizer(optimizer)# 训练循环model.train()forepochinrange(10):forbatchintrain_loader:input_idspaddle.to_tensor(batch[input_ids],placepaddle.CPUPlace())token_type_idspaddle.to_tensor(batch[token_type_ids],placepaddle.CPUPlace())labelspaddle.to_tensor(batch[labels],placepaddle.CPUPlace())# 前向传播outputsmodel(input_idsinput_ids,token_type_idstoken_type_ids,labelslabels)lossoutputs[0]# 反向传播loss.backward()optimizer.step()optimizer.clear_grad()print(fEpoch{epoch}, Loss:{loss.numpy()})性能数据8 卡 NPU 910B vs 8 卡 A100NPU 910B每步耗时 2.1sLoss 收敛到 1.2第 10 个 epochA100 GPU每步耗时 1.8sLoss 收敛到 1.1第 10 个 epochNPU 比 GPU 慢16.7%主要差距在通信延迟和内存分配六、常见问题与调试方法6.1 算子不支持报错信息NotFound: Operator matmul does not have kernel for NPU排查步骤检查paddle-npu-plugin是否安装通过pip list | grep paddle-npu-plugin检查算子映射表是否包含该算子查看paddle-npu-plugin/op_map.py如果算子确实不支持可以自己写 Kernel 并注册参考paddle-npu-plugin/kernels/目录下的示例回退到 CPU 执行设置paddle.device.set_device(cpu)6.2 内存溢出OOM报错信息acl_rt_malloc failed, size...排查步骤减小 batch size开启梯度累积通过fleet.DistributedStrategy的gradient_accumulation_steps参数使用混合精度训练fp16定期调用paddle.device.npu.empty_cache()清理显存碎片6.3 分布式训练通信慢现象多卡训练的加速比不到 1.5x理想是接近线性加速排查步骤检查 HCCL 的通信拓扑通过hccl_ops_test工具开启计算-通信重叠Paddle 默认不开启需要手动设置fleet.DistributedStrategy().hccl_graph_mode True使用 hixl 替代 HCCL如果是跨机训练七、使用建议如果你是 Paddle 模型开发者优先使用百度官方提供的paddle-npu-pluginpip install paddle-npu-plugin不要自己编译。官方版本已经做好了算子映射和性能调优。如果你是算子开发者如果某些算子 NPU 不支持可以参考 TBE 的 DSL 教程写自定义算子然后通过PD_REGISTER_KERNEL注册到 Paddle。如果你是性能调优工程师关注 NPU 的内存分配策略Paddle 不缓存 NPU 显存需要减少分配次数、通信后端选择HCCL vs hixl、算子融合通过 Paddle 的jit.to_static触发。链接https://www.paddlepaddle.org.cn/

AppImageLauncher终极指南：3步让Linux应用安装变得简单高效

AppImageLauncher终极指南：3步让Linux应用安装变得简单高效【免费下载链接】AppImageLauncher Helper application for Linux distributions serving as a kind of "entry point" for running and integrating AppImages 项目地址: https://gitcode.co…...

2026/5/24 16:11:35 阅读更多 →