CANN/pto-isa异步通信Demo
Allgather Async Demo【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isaDemonstrates the allgather collective operation using PTO async instructions across multiple NPU devices.A2/A3 build(defaultSOC_VERSIONAscend910B1): Demos 1–3 using SDMA engine (TPUT_ASYNC/TGET_ASYNCvia HCCL)A5 build(SOC_VERSIONAscend950PR_9599): Demos 4–6 using URMA engine (HCCP V2 Jetty RDMA).The two engine paths use different host infrastructure (a2a3/common.hppvsa5/common.hpp) with incompatible ACL/runtime initialization, so each build only compiles and links one set.PrerequisitesCANN Toolkit (version 9.0.0 or above) installed (ASCEND_HOME_PATHset viaset_env.sh)CANN Ops package (version 9.0.0 or above) installedMPICH installedEnough NPU devices for your MPI rank count (default./run.shuses 8 ranks;./run.sh 2 …uses 2). Typically one rank maps to one device.Quick Startsource /path/to/set_env.sh ./run.sh # 8 ranks, default SoC Ascend910B1 (A2/A3, Demos 1–3) ./run.sh 4 # 4 ranks ./run.sh 2 Ascend950PR_9599 # 2 ranks, A5 (Demos 4-6)What It DoesEach rank contributes 256int32_tvalues. After allgather, every rank holds all ranks data.SDMA Demos (A2/A3 build)TPUT_ASYNC Allgather (multi-core): Launched withnRanks, ...— each AICORE handles one target ranks communication in parallel. The AICORE whereblock_idx myRankperforms a local copy; all others usepto::comm::TPUT_ASYNCto write data to the corresponding remote rank.TGET_ASYNC Allgather (multi-core): Launched withnRanks, ...— each AICORE pulls data from one source rank in parallel. The AICORE whereblock_idx myRankperforms a local copy; all others usepto::comm::TGET_ASYNCto read data from the corresponding remote rank.Ring TPUT_ASYNC Allgather: Ring algorithm with N-1 rounds for N ranks. In round 0, each rank copies itssendBuflocally and pushes it to the next rank viaTPUT_ASYNC. In subsequent rounds, each rank forwards the chunk it received in the previous round to the next rank. Each round is a separate kernel launch with a host-side barrier in between.URMA Demos (A5 build)URMA TPUT_ASYNC Allgather (multi-core): Same algorithm as Demo 1, usingTPUT_ASYNCDmaEngine::URMAwithUrmaPeerMrBaseAddrfor remote addressing.URMA TGET_ASYNC Allgather (multi-core): Same algorithm as Demo 2, usingTGET_ASYNCDmaEngine::URMA.URMA Ring TPUT_ASYNC Allgather: Same ring algorithm as Demo 3, usingTPUT_ASYNCDmaEngine::URMA. Runs N-1 rounds; on 2 ranks this is a single round verifying basic AllGather correctness. The recv→forward path is naturally exercised when N≥3.Key PTO APIsDemos 1–3 (SDMA / HCCL)pto::comm::AsyncSession,BuildAsyncSession(SDMA overload, used withSdmaWorkspaceManagerand HCCL context)pto::comm::TPUT_ASYNC,TGET_ASYNC(default SDMA engine)pto::comm::AsyncEvent,WaitSdmaWorkspaceManager,HcclRemotePtr(host)Demos 4–6 (URMA)pto::comm::BuildAsyncSessionDmaEngine::URMA,TPUT_ASYNC/TGET_ASYNCDmaEngine::URMAUrmaWorkspaceManager,UrmaPeerMrBaseAddr(host)Project Structureallgather_async/ ├── CMakeLists.txt -- Build configuration (bisheng CCE) ├── csrc/ │ ├── kernel/ │ │ ├── allgather_kernel.cpp -- SDMA kernels host launchers (A2/A3) │ │ ├── allgather_kernel.h -- SDMA host-side function declarations │ │ ├── allgather_urma_kernel.cpp -- URMA kernels host launchers (A5) │ │ └── allgather_urma_kernel.h -- URMA host-side function declarations │ └── host/ │ └── main.cpp -- Entry point (MPI init, run demos, report) ├── run.sh -- One-click build and run ├── README.md -- English documentation └── README_zh.md -- Chinese documentationDependency Installation1. CANN ToolkitCANN Toolkit version 9.0.0 or above. Available via two methods:Option 1: Download from the Ascend CommunityOption 2: Direct download (preview build): x86_64 / aarch64For installation instructions, refer to Quick Install CANN.After installation, set up the environment (default install path):source /usr/local/Ascend/ascend-toolkit/set_env.shCustom install path:source ${install_path}/ascend-toolkit/set_env.sh2. CANN OpsCANN Ops package (version 9.0.0 or above). Download the ops-legacy package for your hardware platform:Hardwarex86_64aarch64A2DownloadDownloadA3DownloadDownloadInstallation follows the same procedure as the Toolkit. Refer to Quick Install CANN.3. MPICHRecommended version 3.2.1. Build and install from source:# Example with version 3.2.1 version3.2.1 wget https://www.mpich.org/static/downloads/${version}/mpich-${version}.tar.gz tar -xzf mpich-${version}.tar.gz cd mpich-${version} ./configure --prefix/usr/local/mpich --disable-fortran make make installSet environment variables:export MPI_HOME/usr/local/mpich export PATH${MPI_HOME}/bin:${PATH}Verify thatmpirunis available:mpirun --versionManual Build# A2/A3 build (Demos 1-3) mkdir -p build cd build cmake .. -DSOC_VERSIONAscend910B1 make -j$(nproc) cd .. mpirun -n 8 ./build/bin/allgather_demo # A5 build (Demos 4-6) rm -rf build mkdir -p build cd build cmake .. -DSOC_VERSIONAscend950PR_9599 make -j$(nproc) cd .. mpirun -n 2 ./build/bin/allgather_demoSOC_VERSIONdetermines which kernel set is compiled: A2/A3 builds only the SDMA kernel; A5 builds only the URMA kernel. A clean rebuild (rm -rf build) is needed when switching between SoC targets.Expected OutputA5 (2 ranks, URMA Demos 4–6) PTO Allgather Async Demo Ranks: 2 --- Demo 4: URMA Multi-core TPUT_ASYNC --- [URMA_TPUT_MC PASS] Rank 0: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] [URMA_TPUT_MC PASS] Rank 1: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] --- Demo 5: URMA Multi-core TGET_ASYNC --- [URMA_TGET_MC PASS] Rank 0: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] [URMA_TGET_MC PASS] Rank 1: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] --- Demo 6: URMA Ring TPUT_ASYNC --- [URMA_RING_TPUT PASS] Rank 0: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] [URMA_RING_TPUT PASS] Rank 1: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] All demos PASSED A3 (8 ranks, SDMA Demos 1–3) PTO Allgather Async Demo Ranks: 8 --- Demo 1: Multi-core TPUT_ASYNC --- [TPUT_ASYNC_MC PASS] Rank 0: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] slot[2][2000,2001,2002,...] ... ... --- Demo 2: Multi-core TGET_ASYNC --- [TGET_ASYNC_MC PASS] Rank 0: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] slot[2][2000,2001,2002,...] ... ... --- Demo 3: Ring TPUT_ASYNC --- [RING_TPUT_ASYNC PASS] Rank 0: slot[0][0,1,2,...] slot[1][1000,1001,1002,...] slot[2][2000,2001,2002,...] ... ... All demos PASSED 【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考