Vec Reduction on a2 (cmax brcb Pattern)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when implementing per-row reductions (max, sum) on a2 using the vec pipeline. On a2 there are noReg/RegList, so reductions use UB-to-UBcmax/caddbrcb.GoalGet per-row max (or sum) correct on a2, including the broadcast step that is easy to forget.1. The cmax output formatcmax(dst, src)reduces one repeat (64 float elements 8 blocks of 8) to asingle scalar. The scalar is stored atdst[rep * dst_rep_stride]— one float element per repeat.With the defaultdst_rep_stride1, the scalars are packed densely:dst[0] max of row 0 dst[1] max of row 1 ... dst[63] max of row 63This isnota C0 block layout. The 8-element block structure thatsub/vmaxexpect is not satisfied.2. The bug: using cmax output directly in subIf you pass the cmax output tosubwithblk_stride0:subreads a C0 block (8 elements) and broadcasts it across all 8 blocks of each repeatBut the 8 elements in that block are maxes of8 different rows, not 8 copies of one rows maxResult: each row gets subtracted by the wrong max →expproduces huge or wrong valuesSymptom: output values 1.0 fromexp(score - max)where max should be the row max.3. The fix: brcb broadcast between cmax and subAfter cmax, usebrcbto expand each scalar to fill a full C0 block:ub_max_s Tensor(DT.float, [HALF_M, 1], Position.UB) # cmax scalars ub_max Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast result cmax(ub_max_s, ub_tmp) brcb(ub_max, ub_max_s, dst_blk_stride1, dst_rep_stride8)How brcb works:repeat infer_repeat_brcb(src) HALF_M * 1 // 8 8For each repeat: reads 8 scalars fromsrc[rep*8 : rep*88]For each of 8 blocks: fillsdst[block_begin : block_begin C0]with one scalarWithdst_blk_stride1, dst_rep_stride8: blocks are contiguous, repeats advance by 8 blocksResult:ub_max[n*8 : n*88]all containmax_of_row_nfor n in 0..63.3a. Dense row[1, 64]- broadcast[64, 8]also needs explicitbrcbparamsWhen the scalar statistics arrive as one dense row such as:qkmaxbuf Tensor(DT.float, [1, 64], Position.UB)qksumbuf Tensor(DT.float, [1, 64], Position.UB)and the destination is the usual broadcast format:qkmaxbrcb Tensor(DT.float, [64, 8], Position.UB)donotrely on defaultbrcb(...)parameter inference.Validated pattern:qkmaxbuf qkmax[bh:bh 1, row0:row0 64] brcb(qkmaxbrcb, qkmaxbuf, repeat64 // 8, dst_blk_stride1, dst_rep_stride8)Why this matters:the source load into[1,64]is finethe failure comes from the broadcast configuration, not from the GM - UB read itselfwith the validated explicit parameters, rowris expanded toqkmaxbrcb[r, 0:8]Concrete reproducer:tmp/validate_row64_brcb.pyPractical rule:for row-stat broadcasts on a2, treatbrcb(..., dst_blk_stride1, dst_rep_stride8)as mandatorywhen the source is[1,64], also pinrepeat64 // 8explicitly in validated kernels instead of trusting defaults4. Complete row-max pattern for [HALF_M, 128] float dataHALF_M 64 HALF_N 64 ub_data Tensor(DT.float, [HALF_M, 128], Position.UB) ub_tmp Tensor(DT.float, [HALF_M, HALF_N], Position.UB) ub_max_s Tensor(DT.float, [HALF_M, 1], Position.UB) ub_max Tensor(DT.float, [HALF_M, 8], Position.UB) # Step 1: element-wise max of two 64-col halves → 64 values per row vmax(ub_tmp, ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, HALF_N:128]) # Step 2: reduce 64 → 1 scalar per row cmax(ub_max_s, ub_tmp) # Step 3: broadcast each scalar to fill a C0 block (8 identical elements) brcb(ub_max, ub_max_s, dst_blk_stride1, dst_rep_stride8) # Step 4: subtract (sliced to align repeat with narrow max buf) sub(ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, 0:HALF_N], ub_max) sub(ub_data[0:HALF_M, HALF_N:128], ub_data[0:HALF_M, HALF_N:128], ub_max)Why each step is needed:vmax: 128 columns exceed one repeat (64 elements). Must merge to 64 first.cmax: reduces 64 → 1 scalar per row. Output is dense, not block-aligned.brcb: fills C0 blocks so thatsubwithblk_stride0broadcasts correctly.sub with slicing: seeagent/references/constraints/vec-stride.mdfor why.5. Why[M, 8]broadcast format fails for binary ops between two narrow buffersAfterbrcb, the result tensor has shape[HALF_M, 8]withspan[1]8C0. Stride inference for[64, 8]float gives:blk_stride0, rep_stride1, repeat8.Withblk_stride0, all 8 blocks within one repeat address thesame8 elements. So each repeat touches 8 unique elements, and 8 repeats touch 8×864 elements. But the buffer contains 64×8512 elements. The remaining 448 arenever reached.This meansvmax(buf_a[64,8], buf_a[64,8], buf_b[64,8])only computes the max for the first 8 rows. Rows 8–63 are left unchanged.Root cause:blk_stride0is the broadcast stride designed forsub(wide, wide, narrow), where the wide destinations repeat cadence drives iteration and the narrow source stays per-row. It was never intended for element-wise operations between two identically-shaped narrow buffers.Diagnostic method: before choosing a tensor format for any vec binary operation, manually trace:infer_repeat(dst)span[0] * span[1] / (256 // dtype.size)infer_strides(tensor)— check ifblk_stride0or1total unique elements repeat × (8 if blk_stride1 else 1) × elements_per_blockcompare against the actual element count (shape[0] * shape[1])If the totals disagree, the operation will silently skip elements.Reference implementation:easyasc/stub_functions/vec/vecutils.py(infer_strides,infer_repeat).6. Using[M, 1]scalar format for binary ops between reduction outputsThecmaxoutput[HALF_M, 1]hasspan[1]1. Stride inference for[64, 1]float:span[1]1matches neither64nor8, so defaults apply:blk_stride1, rep_stride8, repeat1.Withblk_stride1and 8 blocks per repeat:Block 0: elements[0:8]Block 1: elements[8:16]…Block 7: elements[56:64]Total: 1 repeat × 8 blocks × 8 elements 64 elements all rows✓Sovmax(dst[64,1], src1[64,1], src2[64,1])correctly computes per-row element-wise max over all 64 dense scalars fromcmaxoutput. No rows are skipped.Key insight: operate on the dense scalar[M, 1]format BEFOREbrcbbroadcast. Onlybrcbto[M, 8]after the scalar-level operation is complete.Validated pattern for running max across tiles:ub_max_s Tensor(DT.float, [HALF_M, 1], Position.UB) # per-tile cmax output ub_rmax_s Tensor(DT.float, [HALF_M, 1], Position.UB) # running max (persistent) ub_max Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast for sub # before inner loop: initialize running max dup(ub_rmax_s, neg_large) # inside each tile: cmax(ub_max_s, ub_tmp) # per-tile row max vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # update in [M,1] format brcb(ub_max, ub_rmax_s, dst_blk_stride1, dst_rep_stride8) # broadcast AFTER update sub(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_max) sub(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_max)Hereneg_largeis a sufficiently large finite negative sentinel, not literalfloat(-inf).UB overhead for running max: one extra[64, 1]float tensor 0.25 KB.6a. Copying[M,1]scalar state across iterationsThe validated running-max pattern often needs a snapshot of the previous scalar state before updating it, for example to computeexp(prev_m - curr_m)in streamed attention.Donotsnapshot[M,1]buffers withub_to_ub.Why this fails:ub_to_ubworks inC0-sized blocksfor float[64,1], that means an 8-element block copy per rowthe operation does not mean copy one scalar per rowStable fix:allocate a zero buffer in the same[M,1]formatuse a vec binary op such asadd(dst, src, zero)to make the copyExample:ub_prev_s DBuff(DT.float, [HALF_M, 1], Position.UB) ub_rmax_s Tensor(DT.float, [HALF_M, 1], Position.UB) ub_zero_s Tensor(DT.float, [HALF_M, 1], Position.UB) dup(ub_zero_s, 0.0) add(ub_prev_s[slot], ub_rmax_s, ub_zero_s) # safe scalar-format copy vmax(ub_rmax_s, ub_rmax_s, ub_max_s) sub(ub_prev_s[slot], ub_prev_s[slot], ub_rmax_s) exp(ub_prev_s[slot], ub_prev_s[slot])Study:agent/example/kernels/a2/flash_attn_unnorm.pyagent/references/patterns/a2-cube-vec-cube-vec.md7. Adapting for row sum (cadd)Same pattern, replacevmax→add,cmax→cadd:add(ub_tmp, ub_data[0:M, 0:64], ub_data[0:M, 64:128]) cadd(ub_sum_s, ub_tmp) brcb(ub_sum, ub_sum_s, dst_blk_stride1, dst_rep_stride8) div(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_sum) div(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_sum)For streamed normalized attention on a2, the stable update order is:computeexpdiff exp(prev_max - curr_max)in[M,1]compute the float probability tilep exp(score - curr_max)reducesum_jfrom that float tile withaddcaddupdaterow_sum row_sum * expdiff sum_jin[M,1]castpto half only after the sum update if the downstream cube stage needsp.half().float()8. UB costBufferShapeBytes (float)ub_tmp[64, 64]16 KBub_max_s[64, 1]0.25 KBub_max[64, 8]2 KBTotal reduction overhead~18.25 KBFiles to studyagent/example/kernels/a2/flash_attn_score.py— per-tile independent row maxagent/example/kernels/a2/flash_attn_score_iter.py— running max across tiles using[M,1]scalarvmaxagent/example/kernels/a2/flash_attn_unnorm.py— delayedexpdiffcomputed from copied[M,1]running stateagent/example/kernels/a2/flash_attn_full.py— running sum final sliceddivon top of the delayed numerator pipelineeasyasc/simulator_v2/ops/vec/v.pyandeasyasc/simulator_v2/ops/vec/_legacy_vpipe.py— current vec runtime path forcmax,brcb, anddupeasyasc/stub_functions/vec/group.py— cmax stub with dst_rep_stride defaulteasyasc/stub_functions/vec/dupbrcb.py— dup and brcb stubseasyasc/stub_functions/vec/vecutils.py—infer_stridesandinfer_repeatlogic【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考