Open
Conversation
Eliminates a full DRAM round-trip for the intermediate left_swished buffer in the SwiGLU pipeline by computing silu(x) * y in a single vectorized kernel loop. Reduces SwiGLU prefill from 5 to 4 runlist entries. New operator: AIESiLUMul with fused C++ kernels for both AIE2 (LUT tanh) and AIE2+ (hardware tanh). Integrated into swiglu_prefill. Also fixes a pre-existing bug in swiglu_prefill/test.py (errors_2 -> errors_3). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collapses three separate NPU designs (GEMV W1, GEMV W2, SiLU+Mul) into a single fused operator. Each AIE core loads vector x once, processes both W1 and W2 rows through a shared A FIFO with pre-interleaved weights, then computes silu(left)*right entirely in L1 via kernel-local static buffers. The intermediate vectors never touch DRAM. Reduces SwiGLU decode from 4 to 2 runlist entries and eliminates the left/right buffer allocations. Uses 4 AIE columns (DMA channel limit: 2 input + 1 output per tile). Note: swiglu_prefill unchanged — uses GEMM (not GEMV) so dual-GEMV fusion does not apply. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on, INT4 dequant-GEMV Implements Phases 1-3 and 5 from the decode dataflow optimization plan: Phase 1 - fused_qkv_proj: Concatenates Wq+Wk+Wv into single GEMV (M=3072, K=2048), eliminating 3 redundant input vector loads. Reuses existing mv.o. Phase 2 - flowkv_decode: Streaming decode attention with online softmax. 2-tile pipeline per KV head group (score tile + value tile). Intermediates flow tile-to-tile via on-chip ObjectFIFOs. Uses aie::exp2 for safe exp. Phase 3 - swiglu_fused_decode: Complete SwiGLU fusion with 2-stage tile pipeline. Dual-GEMV+SiLU+Mul feeds directly into down-projection GEMV via inter-tile ObjectFIFO. The 32 KB intermediate never touches DDR. Host reduces 4 column partials. Benchmarked 1.32x speedup at Llama dims (5410 us -> 4103 us, 24.5 GB/s effective bandwidth). Phase 5 - fused_dequant_gemv: Fused INT4 weight dequantization + GEMV in single kernel pass. Uses proven aie::unpack chain from expand.cc for INT4->bf16 conversion. 4x DDR bandwidth reduction. Also fixes: - Static buffer overflow in dual_gemv_silu_mul (1024 -> 2048 elements) - Production-scale test params for swiglu_decode (2048, 8192) All operators verified on AMD Ryzen AI 9 HX 370 (RyzenAI-npu4) hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
📊 Test Results for Small Benchmark/Test Suite04bb5c6 (2026_03_07_18_40_14) IRONCLADTested on
📈 Trends (vs main branch) for Small Benchmark/Test Suite04bb5c6 (2026_03_07_18_40_14) IRONCLAD Trendsaxpy_1_cols_2_channels_2048_tile_2048_3.0
axpy_1_cols_2_channels_2048_tile_2048_3.0_0
axpy_2_cols_2_channels_2048_tile_1024_3.0
axpy_2_cols_2_channels_2048_tile_1024_3.0_0
axpy_4_cols_2_channels_2048_tile_512_3.0
axpy_4_cols_2_channels_2048_tile_512_3.0_0
axpy_8_cols_2_channels_2048_tile_256_3.0
axpy_8_cols_2_channels_2048_tile_256_3.0_0
dequant_1_cols_1_channels_2048_tile_2048
dequant_1_cols_1_channels_2048_tile_2048_0
dequant_1_cols_2_channels_2048_tile_1024
dequant_1_cols_2_channels_2048_tile_1024_0
dequant_2_cols_1_channels_2048_tile_1024
dequant_2_cols_1_channels_2048_tile_1024_0
dequant_2_cols_2_channels_2048_tile_512
dequant_2_cols_2_channels_2048_tile_512_0
dequant_4_cols_1_channels_2048_tile_512
dequant_4_cols_1_channels_2048_tile_512_0
dequant_4_cols_2_channels_2048_tile_256
dequant_4_cols_2_channels_2048_tile_256_0
dequant_8_cols_1_channels_2048_tile_256
dequant_8_cols_1_channels_2048_tile_256_0
dequant_8_cols_2_channels_2048_tile_128
dequant_8_cols_2_channels_2048_tile_128_0
dual_gemv_silu_mul_2048x2048_4tsi_512tso_4col0
eltwise_add_1_cols_2_channels_2048_tile_2048
eltwise_add_2_cols_2_channels_2048_tile_1024
eltwise_add_4_cols_2_channels_2048_tile_512
eltwise_add_8_cols_2_channels_2048_tile_256
eltwise_mul_1_cols_2_channels_2048_tile_2048
eltwise_mul_2_cols_2_channels_2048_tile_1024
eltwise_mul_4_cols_2_channels_2048_tile_512
eltwise_mul_8_cols_2_channels_2048_tile_256
flowkv_decode_32h_8kv_64d_128s_32cs_4col0
fused_dequant_gemv_2048x2048_1tsi_512tso_4col_g32_0
fused_qkv_proj_3072x2048_4tsi_768tso_4col0
gelu_1_cols_1_channels_2048_tile_2048
gelu_1_cols_2_channels_2048_tile_1024
gelu_2_cols_1_channels_2048_tile_1024
gelu_2_cols_2_channels_2048_tile_512
gelu_4_cols_1_channels_2048_tile_512
gelu_4_cols_2_channels_2048_tile_256
gelu_8_cols_1_channels_2048_tile_256
gelu_8_cols_2_channels_2048_tile_128
gemm_1792x896x1152_64x32x48_8cols_ccolmaj
gemm_192x384x64_48x96x16_4cols
gemm_192x384x64_48x96x16_4cols_bcolmaj_ccolmaj
gemm_2048x2048x2048_64x64x32_8_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x32_8_cols_0_bcolmaj_1_ccolmaj_0
gemm_2048x2048x2048_64x64x32_8_cols_1_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_1cols
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_1_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_1_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2_cols_1_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_1_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2cols_bcolmaj
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_1_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8_cols_1_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8cols_bcolmaj_ccolmaj
gemm_384x1536x1792_32x48x64_4cols_bcolmaj
gemm_896x1792x640_32x64x80_8cols_ccolmaj
layer_norm_1_cols_1_channels_2048_tile_2048
layer_norm_1_cols_2_channels_2048_tile_1024
layer_norm_2_cols_1_channels_2048_tile_1024
layer_norm_2_cols_2_channels_2048_tile_512
layer_norm_4_cols_1_channels_2048_tile_512
layer_norm_4_cols_2_channels_2048_tile_256
layer_norm_8_cols_1_channels_2048_tile_256
layer_norm_8_cols_2_channels_2048_tile_128
matrix_vector_mul_128x128_32_1col
matrix_vector_mul_128x128_32_1col0
matrix_vector_mul_128x128_32tsi_128tso_1col0
matrix_vector_mul_2048x8192_1_1col
matrix_vector_mul_2048x8192_1_1col0
matrix_vector_mul_2048x8192_1_2col
matrix_vector_mul_2048x8192_1_2col0
matrix_vector_mul_2048x8192_1_4col
matrix_vector_mul_2048x8192_1_4col0
matrix_vector_mul_2048x8192_1_8col
matrix_vector_mul_2048x8192_1_8col0
matrix_vector_mul_2048x8192_1tsi_1024tso_2col0
matrix_vector_mul_2048x8192_1tsi_2048tso_1col0
matrix_vector_mul_2048x8192_1tsi_256tso_8col0
matrix_vector_mul_2048x8192_1tsi_512tso_4col0
matrix_vector_mul_8192x2048_4_1col
matrix_vector_mul_8192x2048_4_1col0
matrix_vector_mul_8192x2048_4_2col
matrix_vector_mul_8192x2048_4_2col0
matrix_vector_mul_8192x2048_4_4col
matrix_vector_mul_8192x2048_4_4col0
matrix_vector_mul_8192x2048_4_8col
matrix_vector_mul_8192x2048_4_8col0
matrix_vector_mul_8192x2048_4tsi_1024tso_1col0
matrix_vector_mul_8192x2048_4tsi_1024tso_2col0
matrix_vector_mul_8192x2048_4tsi_1024tso_4col0
matrix_vector_mul_8192x2048_4tsi_1024tso_8col0
mem_copy_16_cores_2_chans_2048_tile_128_False
mem_copy_16_cores_2_chans_2048_tile_128_False0
mem_copy_1_cols_1_channels_2048_tile_2048
mem_copy_1_cols_2_channels_2048_tile_1024
mem_copy_1_cores_1_chans_2048_tile_2048_False
mem_copy_1_cores_1_chans_2048_tile_2048_False0
mem_copy_2_cols_1_channels_2048_tile_1024
mem_copy_2_cols_2_channels_2048_tile_512
mem_copy_2_cores_1_chans_2048_tile_1024_False
mem_copy_2_cores_1_chans_2048_tile_1024_False0
mem_copy_2_cores_2_chans_2048_tile_1024_False
mem_copy_2_cores_2_chans_2048_tile_1024_False0
mem_copy_4_cols_1_channels_2048_tile_512
mem_copy_4_cols_2_channels_2048_tile_256
mem_copy_4_cores_1_chans_2048_tile_512_False
mem_copy_4_cores_1_chans_2048_tile_512_False0
mem_copy_4_cores_2_chans_2048_tile_512_False
mem_copy_4_cores_2_chans_2048_tile_512_False0
mem_copy_8_cols_1_channels_2048_tile_256
mem_copy_8_cols_2_channels_2048_tile_128
mem_copy_8_cores_1_chans_2048_tile_256_False
mem_copy_8_cores_1_chans_2048_tile_256_False0
mem_copy_8_cores_2_chans_2048_tile_256_False
mem_copy_8_cores_2_chans_2048_tile_256_False0
mha
mha0
mha_16384_64_1_8_0_0
relu_1_cols_1_channels_2048_tile_2048
relu_2_cols_1_channels_2048_tile_1024
relu_4_cols_1_channels_2048_tile_512
relu_8_cols_1_channels_2048_tile_256
rms_norm_1_cols_1_channels_2048_tile_2048
rms_norm_1_cols_2_channels_2048_tile_1024
rms_norm_2_cols_1_channels_2048_tile_1024
rms_norm_2_cols_2_channels_2048_tile_512
rms_norm_4_cols_1_channels_2048_tile_512
rms_norm_4_cols_2_channels_2048_tile_256
rms_norm_8_cols_1_channels_2048_tile_256
rms_norm_8_cols_2_channels_2048_tile_128
rope_1_cols_2_channels_4096_tile_4096_0
rope_1c_32rows_512cols_32arows_0m
rope_1c_32rows_512cols_8arows_0m
rope_2_cols_2_channels_4096_tile_2048_0
rope_2c_32rows_512cols_32arows_0m
rope_2c_32rows_512cols_8arows_0m
rope_4_cols_2_channels_4096_tile_1024_0
rope_8_cols_2_channels_4096_tile_512_0
rope_8c_32rows_512cols_32arows_0m
rope_8c_32rows_512cols_8arows_0m
sigmoid_1_cols_1_channels_2048_tile_2048
sigmoid_2_cols_1_channels_2048_tile_1024
sigmoid_4_cols_1_channels_2048_tile_512
sigmoid_8_cols_1_channels_2048_tile_256
silu_1_cols_1_channels_2048_tile_2048
silu_2_cols_1_channels_2048_tile_1024
silu_4_cols_1_channels_2048_tile_512
silu_8_cols_1_channels_2048_tile_256
silu_mul_1_cols_2_channels_2048_tile_2048
silu_mul_2_cols_2_channels_2048_tile_1024
silu_mul_4_cols_2_channels_2048_tile_512
silu_mul_8_cols_2_channels_2048_tile_256
softmax_1_cols_2_channels_4096_tile_2048
softmax_2_cols_2_channels_4096_tile_1024
softmax_2_cols_2_channels_4096_tile_512
swigluNo metrics available. swiglu_decode_1x2048x2048
swiglu_decode_1x2048x2048_0
swiglu_fused_decode_2048x2048_0
tanh_1_cols_1_channels_2048_tile_2048
tanh_2_cols_1_channels_2048_tile_1024
tanh_4_cols_1_channels_2048_tile_512
tanh_8_cols_1_channels_2048_tile_256
transpose_2048_M_64_N_1_cols_1_channels_64_m_64_n_8_s
transpose_2048_M_64_N_1_cols_1_channels_64_m_64_n_8_s0
transpose_2048_M_64_N_1_cols_2_channels_64_m_64_n_8_s
transpose_2048_M_64_N_1_cols_2_channels_64_m_64_n_8_s0
weighted_rms_norm_1_cols_2_channels_2048_weights_2048
weighted_rms_norm_2_cols_2_channels_2048_weights_1024
weighted_rms_norm_4_cols_2_channels_2048_weights_512
weighted_rms_norm_8_cols_2_channels_2048_weights_256
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
swiglu_fused_decode: Replace DDR partial-sum reduction with on-chip MemTile join + dedicated reduction tile. Stage 2 partials flow through ObjectFIFO join into a single reduction tile that sums element-wise, producing the final output directly. Eliminates 12 KB of DDR output traffic (16 KB partials -> 4 KB single output) and removes host-side partials.sum(dim=0) call. 3-stage pipeline: dual-GEMV+SiLU+Mul -> down-proj GEMV -> on-chip reduce. flowkv_decode: Fuse RoPE rotation into the score tile kernel. Q angles (cos/sin interleaved, 128 bytes) are packed into the Q FIFO buffer alongside query vectors, staying within the 2-input DMA channel limit. The score tile applies two-halves RoPE to Q in-register before the Q*K^T dot product. K is assumed pre-rotated in the KV cache (standard practice). Eliminates RoPE as a separate operator invocation. Both changes verified on AMD Ryzen AI NPU hardware (10/10 tests passed). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On-chip MemTile reduction added ~400 us pipeline serialization to save only 12 KB DDR traffic (~0.5 us). DDR partials with host sum() is 1.32x faster than baseline vs 1.20x with on-chip reduce. Keep the faster approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
📊 Test Results for Small Benchmark/Test Suiteae902f6 (2026_03_07_19_01_16) IRONCLADTested on
📈 Trends (vs main branch) for Small Benchmark/Test Suiteae902f6 (2026_03_07_19_01_16) IRONCLAD Trendsaxpy_1_cols_2_channels_2048_tile_2048_3.0
axpy_1_cols_2_channels_2048_tile_2048_3.0_0
axpy_2_cols_2_channels_2048_tile_1024_3.0
axpy_2_cols_2_channels_2048_tile_1024_3.0_0
axpy_4_cols_2_channels_2048_tile_512_3.0
axpy_4_cols_2_channels_2048_tile_512_3.0_0
axpy_8_cols_2_channels_2048_tile_256_3.0
axpy_8_cols_2_channels_2048_tile_256_3.0_0
dequant_1_cols_1_channels_2048_tile_2048
dequant_1_cols_1_channels_2048_tile_2048_0
dequant_1_cols_2_channels_2048_tile_1024
dequant_1_cols_2_channels_2048_tile_1024_0
dequant_2_cols_1_channels_2048_tile_1024
dequant_2_cols_1_channels_2048_tile_1024_0
dequant_2_cols_2_channels_2048_tile_512
dequant_2_cols_2_channels_2048_tile_512_0
dequant_4_cols_1_channels_2048_tile_512
dequant_4_cols_1_channels_2048_tile_512_0
dequant_4_cols_2_channels_2048_tile_256
dequant_4_cols_2_channels_2048_tile_256_0
dequant_8_cols_1_channels_2048_tile_256
dequant_8_cols_1_channels_2048_tile_256_0
dequant_8_cols_2_channels_2048_tile_128
dequant_8_cols_2_channels_2048_tile_128_0
dual_gemv_silu_mul_2048x2048_4tsi_512tso_4col0
eltwise_add_1_cols_2_channels_2048_tile_2048
eltwise_add_2_cols_2_channels_2048_tile_1024
eltwise_add_4_cols_2_channels_2048_tile_512
eltwise_add_8_cols_2_channels_2048_tile_256
eltwise_mul_1_cols_2_channels_2048_tile_2048
eltwise_mul_2_cols_2_channels_2048_tile_1024
eltwise_mul_4_cols_2_channels_2048_tile_512
eltwise_mul_8_cols_2_channels_2048_tile_256
flowkv_decode_32h_8kv_64d_128s_32cs_4col0
fused_dequant_gemv_2048x2048_1tsi_512tso_4col_g32_0
fused_qkv_proj_3072x2048_4tsi_768tso_4col0
gelu_1_cols_1_channels_2048_tile_2048
gelu_1_cols_2_channels_2048_tile_1024
gelu_2_cols_1_channels_2048_tile_1024
gelu_2_cols_2_channels_2048_tile_512
gelu_4_cols_1_channels_2048_tile_512
gelu_4_cols_2_channels_2048_tile_256
gelu_8_cols_1_channels_2048_tile_256
gelu_8_cols_2_channels_2048_tile_128
gemm_1792x896x1152_64x32x48_8cols_ccolmaj
gemm_192x384x64_48x96x16_4cols
gemm_192x384x64_48x96x16_4cols_bcolmaj_ccolmaj
gemm_2048x2048x2048_64x64x32_8_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x32_8_cols_0_bcolmaj_1_ccolmaj_0
gemm_2048x2048x2048_64x64x32_8_cols_1_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_1cols
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_1_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_1_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2_cols_1_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_1_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2cols_bcolmaj
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_1_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8_cols_1_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8cols_bcolmaj_ccolmaj
gemm_384x1536x1792_32x48x64_4cols_bcolmaj
gemm_896x1792x640_32x64x80_8cols_ccolmaj
layer_norm_1_cols_1_channels_2048_tile_2048
layer_norm_1_cols_2_channels_2048_tile_1024
layer_norm_2_cols_1_channels_2048_tile_1024
layer_norm_2_cols_2_channels_2048_tile_512
layer_norm_4_cols_1_channels_2048_tile_512
layer_norm_4_cols_2_channels_2048_tile_256
layer_norm_8_cols_1_channels_2048_tile_256
layer_norm_8_cols_2_channels_2048_tile_128
matrix_vector_mul_128x128_32_1col
matrix_vector_mul_128x128_32_1col0
matrix_vector_mul_128x128_32tsi_128tso_1col0
matrix_vector_mul_2048x8192_1_1col
matrix_vector_mul_2048x8192_1_1col0
matrix_vector_mul_2048x8192_1_2col
matrix_vector_mul_2048x8192_1_2col0
matrix_vector_mul_2048x8192_1_4col
matrix_vector_mul_2048x8192_1_4col0
matrix_vector_mul_2048x8192_1_8col
matrix_vector_mul_2048x8192_1_8col0
matrix_vector_mul_2048x8192_1tsi_1024tso_2col0
matrix_vector_mul_2048x8192_1tsi_2048tso_1col0
matrix_vector_mul_2048x8192_1tsi_256tso_8col0
matrix_vector_mul_2048x8192_1tsi_512tso_4col0
matrix_vector_mul_8192x2048_4_1col
matrix_vector_mul_8192x2048_4_1col0
matrix_vector_mul_8192x2048_4_2col
matrix_vector_mul_8192x2048_4_2col0
matrix_vector_mul_8192x2048_4_4col
matrix_vector_mul_8192x2048_4_4col0
matrix_vector_mul_8192x2048_4_8col
matrix_vector_mul_8192x2048_4_8col0
matrix_vector_mul_8192x2048_4tsi_1024tso_1col0
matrix_vector_mul_8192x2048_4tsi_1024tso_2col0
matrix_vector_mul_8192x2048_4tsi_1024tso_4col0
matrix_vector_mul_8192x2048_4tsi_1024tso_8col0
mem_copy_16_cores_2_chans_2048_tile_128_False
mem_copy_16_cores_2_chans_2048_tile_128_False0
mem_copy_1_cols_1_channels_2048_tile_2048
mem_copy_1_cols_2_channels_2048_tile_1024
mem_copy_1_cores_1_chans_2048_tile_2048_False
mem_copy_1_cores_1_chans_2048_tile_2048_False0
mem_copy_2_cols_1_channels_2048_tile_1024
mem_copy_2_cols_2_channels_2048_tile_512
mem_copy_2_cores_1_chans_2048_tile_1024_False
mem_copy_2_cores_1_chans_2048_tile_1024_False0
mem_copy_2_cores_2_chans_2048_tile_1024_False
mem_copy_2_cores_2_chans_2048_tile_1024_False0
mem_copy_4_cols_1_channels_2048_tile_512
mem_copy_4_cols_2_channels_2048_tile_256
mem_copy_4_cores_1_chans_2048_tile_512_False
mem_copy_4_cores_1_chans_2048_tile_512_False0
mem_copy_4_cores_2_chans_2048_tile_512_False
mem_copy_4_cores_2_chans_2048_tile_512_False0
mem_copy_8_cols_1_channels_2048_tile_256
mem_copy_8_cols_2_channels_2048_tile_128
mem_copy_8_cores_1_chans_2048_tile_256_False
mem_copy_8_cores_1_chans_2048_tile_256_False0
mem_copy_8_cores_2_chans_2048_tile_256_False
mem_copy_8_cores_2_chans_2048_tile_256_False0
mha
mha0
mha_16384_64_1_8_0_0
relu_1_cols_1_channels_2048_tile_2048
relu_2_cols_1_channels_2048_tile_1024
relu_4_cols_1_channels_2048_tile_512
relu_8_cols_1_channels_2048_tile_256
rms_norm_1_cols_1_channels_2048_tile_2048
rms_norm_1_cols_2_channels_2048_tile_1024
rms_norm_2_cols_1_channels_2048_tile_1024
rms_norm_2_cols_2_channels_2048_tile_512
rms_norm_4_cols_1_channels_2048_tile_512
rms_norm_4_cols_2_channels_2048_tile_256
rms_norm_8_cols_1_channels_2048_tile_256
rms_norm_8_cols_2_channels_2048_tile_128
rope_1_cols_2_channels_4096_tile_4096_0
rope_1c_32rows_512cols_32arows_0m
rope_1c_32rows_512cols_8arows_0m
rope_2_cols_2_channels_4096_tile_2048_0
rope_2c_32rows_512cols_32arows_0m
rope_2c_32rows_512cols_8arows_0m
rope_4_cols_2_channels_4096_tile_1024_0
rope_8_cols_2_channels_4096_tile_512_0
rope_8c_32rows_512cols_32arows_0m
rope_8c_32rows_512cols_8arows_0m
sigmoid_1_cols_1_channels_2048_tile_2048
sigmoid_2_cols_1_channels_2048_tile_1024
sigmoid_4_cols_1_channels_2048_tile_512
sigmoid_8_cols_1_channels_2048_tile_256
silu_1_cols_1_channels_2048_tile_2048
silu_2_cols_1_channels_2048_tile_1024
silu_4_cols_1_channels_2048_tile_512
silu_8_cols_1_channels_2048_tile_256
silu_mul_1_cols_2_channels_2048_tile_2048
silu_mul_2_cols_2_channels_2048_tile_1024
silu_mul_4_cols_2_channels_2048_tile_512
silu_mul_8_cols_2_channels_2048_tile_256
softmax_1_cols_2_channels_4096_tile_2048
softmax_2_cols_2_channels_4096_tile_1024
softmax_2_cols_2_channels_4096_tile_512
swigluNo metrics available. swiglu_decode_1x2048x2048
swiglu_decode_1x2048x2048_0
swiglu_fused_decode_2048x2048_0
tanh_1_cols_1_channels_2048_tile_2048
tanh_2_cols_1_channels_2048_tile_1024
tanh_4_cols_1_channels_2048_tile_512
tanh_8_cols_1_channels_2048_tile_256
transpose_2048_M_64_N_1_cols_1_channels_64_m_64_n_8_s
transpose_2048_M_64_N_1_cols_1_channels_64_m_64_n_8_s0
transpose_2048_M_64_N_1_cols_2_channels_64_m_64_n_8_s
transpose_2048_M_64_N_1_cols_2_channels_64_m_64_n_8_s0
weighted_rms_norm_1_cols_2_channels_2048_weights_2048
weighted_rms_norm_2_cols_2_channels_2048_weights_1024
weighted_rms_norm_4_cols_2_channels_2048_weights_512
weighted_rms_norm_8_cols_2_channels_2048_weights_256
|
Contributor
📊 Test Results for Small Benchmark/Test Suite50e35ad (2026_03_07_19_17_53) IRONCLADTested on
📈 Trends (vs main branch) for Small Benchmark/Test Suite50e35ad (2026_03_07_19_17_53) IRONCLAD Trendsaxpy_1_cols_2_channels_2048_tile_2048_3.0
axpy_1_cols_2_channels_2048_tile_2048_3.0_0
axpy_2_cols_2_channels_2048_tile_1024_3.0
axpy_2_cols_2_channels_2048_tile_1024_3.0_0
axpy_4_cols_2_channels_2048_tile_512_3.0
axpy_4_cols_2_channels_2048_tile_512_3.0_0
axpy_8_cols_2_channels_2048_tile_256_3.0
axpy_8_cols_2_channels_2048_tile_256_3.0_0
dequant_1_cols_1_channels_2048_tile_2048
dequant_1_cols_1_channels_2048_tile_2048_0
dequant_1_cols_2_channels_2048_tile_1024
dequant_1_cols_2_channels_2048_tile_1024_0
dequant_2_cols_1_channels_2048_tile_1024
dequant_2_cols_1_channels_2048_tile_1024_0
dequant_2_cols_2_channels_2048_tile_512
dequant_2_cols_2_channels_2048_tile_512_0
dequant_4_cols_1_channels_2048_tile_512
dequant_4_cols_1_channels_2048_tile_512_0
dequant_4_cols_2_channels_2048_tile_256
dequant_4_cols_2_channels_2048_tile_256_0
dequant_8_cols_1_channels_2048_tile_256
dequant_8_cols_1_channels_2048_tile_256_0
dequant_8_cols_2_channels_2048_tile_128
dequant_8_cols_2_channels_2048_tile_128_0
dual_gemv_silu_mul_2048x2048_4tsi_512tso_4col0
eltwise_add_1_cols_2_channels_2048_tile_2048
eltwise_add_2_cols_2_channels_2048_tile_1024
eltwise_add_4_cols_2_channels_2048_tile_512
eltwise_add_8_cols_2_channels_2048_tile_256
eltwise_mul_1_cols_2_channels_2048_tile_2048
eltwise_mul_2_cols_2_channels_2048_tile_1024
eltwise_mul_4_cols_2_channels_2048_tile_512
eltwise_mul_8_cols_2_channels_2048_tile_256
flowkv_decode_32h_8kv_64d_128s_32cs_4col0
fused_dequant_gemv_2048x2048_1tsi_512tso_4col_g32_0
fused_qkv_proj_3072x2048_4tsi_768tso_4col0
gelu_1_cols_1_channels_2048_tile_2048
gelu_1_cols_2_channels_2048_tile_1024
gelu_2_cols_1_channels_2048_tile_1024
gelu_2_cols_2_channels_2048_tile_512
gelu_4_cols_1_channels_2048_tile_512
gelu_4_cols_2_channels_2048_tile_256
gelu_8_cols_1_channels_2048_tile_256
gelu_8_cols_2_channels_2048_tile_128
gemm_1792x896x1152_64x32x48_8cols_ccolmaj
gemm_192x384x64_48x96x16_4cols
gemm_192x384x64_48x96x16_4cols_bcolmaj_ccolmaj
gemm_2048x2048x2048_64x64x32_8_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x32_8_cols_0_bcolmaj_1_ccolmaj_0
gemm_2048x2048x2048_64x64x32_8_cols_1_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_1cols
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_1_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_0_bcolmaj_1_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2_cols_1_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_2_cols_1_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_2cols_bcolmaj
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8_cols_0_bcolmaj_1_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8_cols_1_bcolmaj_0_ccolmaj_0_0
gemm_2048x2048x2048_64x64x64_8cols_bcolmaj_ccolmaj
gemm_384x1536x1792_32x48x64_4cols_bcolmaj
gemm_896x1792x640_32x64x80_8cols_ccolmaj
layer_norm_1_cols_1_channels_2048_tile_2048
layer_norm_1_cols_2_channels_2048_tile_1024
layer_norm_2_cols_1_channels_2048_tile_1024
layer_norm_2_cols_2_channels_2048_tile_512
layer_norm_4_cols_1_channels_2048_tile_512
layer_norm_4_cols_2_channels_2048_tile_256
layer_norm_8_cols_1_channels_2048_tile_256
layer_norm_8_cols_2_channels_2048_tile_128
matrix_vector_mul_128x128_32_1col
matrix_vector_mul_128x128_32_1col0
matrix_vector_mul_128x128_32tsi_128tso_1col0
matrix_vector_mul_2048x8192_1_1col
matrix_vector_mul_2048x8192_1_1col0
matrix_vector_mul_2048x8192_1_2col
matrix_vector_mul_2048x8192_1_2col0
matrix_vector_mul_2048x8192_1_4col
matrix_vector_mul_2048x8192_1_4col0
matrix_vector_mul_2048x8192_1_8col
matrix_vector_mul_2048x8192_1_8col0
matrix_vector_mul_2048x8192_1tsi_1024tso_2col0
matrix_vector_mul_2048x8192_1tsi_2048tso_1col0
matrix_vector_mul_2048x8192_1tsi_256tso_8col0
matrix_vector_mul_2048x8192_1tsi_512tso_4col0
matrix_vector_mul_8192x2048_4_1col
matrix_vector_mul_8192x2048_4_1col0
matrix_vector_mul_8192x2048_4_2col
matrix_vector_mul_8192x2048_4_2col0
matrix_vector_mul_8192x2048_4_4col
matrix_vector_mul_8192x2048_4_4col0
matrix_vector_mul_8192x2048_4_8col
matrix_vector_mul_8192x2048_4_8col0
matrix_vector_mul_8192x2048_4tsi_1024tso_1col0
matrix_vector_mul_8192x2048_4tsi_1024tso_2col0
matrix_vector_mul_8192x2048_4tsi_1024tso_4col0
matrix_vector_mul_8192x2048_4tsi_1024tso_8col0
mem_copy_16_cores_2_chans_2048_tile_128_False
mem_copy_16_cores_2_chans_2048_tile_128_False0
mem_copy_1_cols_1_channels_2048_tile_2048
mem_copy_1_cols_2_channels_2048_tile_1024
mem_copy_1_cores_1_chans_2048_tile_2048_False
mem_copy_1_cores_1_chans_2048_tile_2048_False0
mem_copy_2_cols_1_channels_2048_tile_1024
mem_copy_2_cols_2_channels_2048_tile_512
mem_copy_2_cores_1_chans_2048_tile_1024_False
mem_copy_2_cores_1_chans_2048_tile_1024_False0
mem_copy_2_cores_2_chans_2048_tile_1024_False
mem_copy_2_cores_2_chans_2048_tile_1024_False0
mem_copy_4_cols_1_channels_2048_tile_512
mem_copy_4_cols_2_channels_2048_tile_256
mem_copy_4_cores_1_chans_2048_tile_512_False
mem_copy_4_cores_1_chans_2048_tile_512_False0
mem_copy_4_cores_2_chans_2048_tile_512_False
mem_copy_4_cores_2_chans_2048_tile_512_False0
mem_copy_8_cols_1_channels_2048_tile_256
mem_copy_8_cols_2_channels_2048_tile_128
mem_copy_8_cores_1_chans_2048_tile_256_False
mem_copy_8_cores_1_chans_2048_tile_256_False0
mem_copy_8_cores_2_chans_2048_tile_256_False
mem_copy_8_cores_2_chans_2048_tile_256_False0
mha
mha0
mha_16384_64_1_8_0_0
relu_1_cols_1_channels_2048_tile_2048
relu_2_cols_1_channels_2048_tile_1024
relu_4_cols_1_channels_2048_tile_512
relu_8_cols_1_channels_2048_tile_256
rms_norm_1_cols_1_channels_2048_tile_2048
rms_norm_1_cols_2_channels_2048_tile_1024
rms_norm_2_cols_1_channels_2048_tile_1024
rms_norm_2_cols_2_channels_2048_tile_512
rms_norm_4_cols_1_channels_2048_tile_512
rms_norm_4_cols_2_channels_2048_tile_256
rms_norm_8_cols_1_channels_2048_tile_256
rms_norm_8_cols_2_channels_2048_tile_128
rope_1_cols_2_channels_4096_tile_4096_0
rope_1c_32rows_512cols_32arows_0m
rope_1c_32rows_512cols_8arows_0m
rope_2_cols_2_channels_4096_tile_2048_0
rope_2c_32rows_512cols_32arows_0m
rope_2c_32rows_512cols_8arows_0m
rope_4_cols_2_channels_4096_tile_1024_0
rope_8_cols_2_channels_4096_tile_512_0
rope_8c_32rows_512cols_32arows_0m
rope_8c_32rows_512cols_8arows_0m
sigmoid_1_cols_1_channels_2048_tile_2048
sigmoid_2_cols_1_channels_2048_tile_1024
sigmoid_4_cols_1_channels_2048_tile_512
sigmoid_8_cols_1_channels_2048_tile_256
silu_1_cols_1_channels_2048_tile_2048
silu_2_cols_1_channels_2048_tile_1024
silu_4_cols_1_channels_2048_tile_512
silu_8_cols_1_channels_2048_tile_256
silu_mul_1_cols_2_channels_2048_tile_2048
silu_mul_2_cols_2_channels_2048_tile_1024
silu_mul_4_cols_2_channels_2048_tile_512
silu_mul_8_cols_2_channels_2048_tile_256
softmax_1_cols_2_channels_4096_tile_2048
softmax_2_cols_2_channels_4096_tile_1024
softmax_2_cols_2_channels_4096_tile_512
swigluNo metrics available. swiglu_decode_1x2048x2048
swiglu_decode_1x2048x2048_0
swiglu_fused_decode_2048x2048_0
tanh_1_cols_1_channels_2048_tile_2048
tanh_2_cols_1_channels_2048_tile_1024
tanh_4_cols_1_channels_2048_tile_512
tanh_8_cols_1_channels_2048_tile_256
transpose_2048_M_64_N_1_cols_1_channels_64_m_64_n_8_s
transpose_2048_M_64_N_1_cols_1_channels_64_m_64_n_8_s0
transpose_2048_M_64_N_1_cols_2_channels_64_m_64_n_8_s
transpose_2048_M_64_N_1_cols_2_channels_64_m_64_n_8_s0
weighted_rms_norm_1_cols_2_channels_2048_weights_2048
weighted_rms_norm_2_cols_2_channels_2048_weights_1024
weighted_rms_norm_4_cols_2_channels_2048_weights_512
weighted_rms_norm_8_cols_2_channels_2048_weights_256
|
Contributor
📊 Test Results for Test Example Applications50e35ad (2026_03_07_19_25_10) IRONCLADTested on
📈 Trends (vs main branch) for Test Example Applications50e35ad (2026_03_07_19_25_10) IRONCLAD Trendsllama_3.2_1b
llama_3.2_1b_prompt_13_tokens_1
llama_3.2_1b_prompt_13_tokens_40
llama_3.2_1b_prompt_2048_tokens_1
llama_3.2_1b_prompt_2048_tokens_40
|
andrej
reviewed
Mar 9, 2026
Comment on lines
+11
to
+12
| This module re-exports the GEMV design function for documentation and | ||
| to maintain the 4-file operator convention. |
Collaborator
There was a problem hiding this comment.
I don't think this is really needed. In my opinion it would be clearer to import GEMV directly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
swiglu_fused_decode(Phase 3),fused_qkv_proj(Phase 1),flowkv_decode(Phase 2), andfused_dequant_gemv(Phase 5)dual_gemv_silu_mulkernels (1024 → 2048 elements) and adds production-scale test parametersTest plan
dual_gemv_silu_mul— 5/5 passed (buffer overflow fix verified on hardware)fused_qkv_proj— 5/5 passed (reusesmv.o, Llama dims M=3072 K=2048)flowkv_decode— 5/5 passed (online softmax, 2-tile pipeline per KV head group)fused_dequant_gemv— 5/5 passed (INT4→bf16 fused dequant+GEMV)swiglu_fused_decode— 5/5 passed (2-stage pipeline, 1.32x speedup at Llama dims)black --check .andreuse lintHardware: AMD Ryzen AI 9 HX 370 (RyzenAI-npu4), XRT 2.21.75
Benchmark:
swiglu_fused_decodevsswiglu_decode(Llama 3.2 1B dims)🤖 Generated with Claude Code