How well a compiler stack can map tensor program execution to GPUs is critical for deep learning throughput. It includes things like thread/block schedules and memory allocation, as well as instruction selections (e.g. Tensor core MMA pipes). In this article we will focus on four dominant stacks—CUDA, ROCm, Triton, and TensorRT—from the compiler’s perspective and explains which optimizations move the needle in practice.
How does performance affect modern GPUs?
The same levers are used by all vendors:
- Operator scheduling & fusion: reduce kernel launches and round-trips to HBM; expose longer producer→consumer chains for register/shared-memory reuse. TensorRT, cuDNN “runtime fusion engines” This is a good example for conv and attention blocks.
- Tiling & data layout: To avoid memory conflicts or partitioning, match the tile shape to native fragment size of Tensor Core/WGMMA/WMMA. CUTLASS documents GEMM tiling on a warp-level for Tensor Cores, CUDA Cores.
- Precision & quantization: Inference: INT8/INT4 for calibration or QAT. TensorRT automatically calibrates and selects kernels under these precisions.
- Graph capture & runtime specialization: The graph is executed to reduce launch costs; the dynamic fusion (e.g. Attention) of subgraphs. Attention fusion engines now have graph support in cuDNN 9.
- Autotuning: The search results include the size of tiles, their unrolling factor, and how deep they are per arch/SKU. TensorRT is able to select builder tactic at the same time as Triton or CUTLASS.
Here’s how to implement the lens in each of these stacks.
CUDA Graphs: cuDNN (CuDNN), CUTLASS and nvcc/ptxas
Compiler path. CUDA Code Compilation through The nvcc PTX is then a good option. Ptxas Reduces PTX by lowering it to SASS. For kernels, the key to controlling optimization is feeding flags both into host and device phase. -Xptxas. Many developers overlook this. -O3 The host code alone is the only thing that affects this.
Kernel generation & libraries.
- CUTLASS provides parametric templates for GEMM/conv, implementing warp-level tiling, Tensor Core MMA pipelines, and smem iterators designed for conflict-free access—canonical references for writing peak kernels, including Hopper’s WGMMA path.
- CuDNN 9, introduced runtime fusion engines (notably for attention blocks), native CUDA Graph integration for those engines, and updates for new compute capabilities—materially reducing dispatch overheads and improving memory locality in Transformer workloads.
Performance Impairments
- The switch from PyTorch to CuDNN attention fusion reduces the number of kernel launch and global memory requests. CUDA GraphsIt reduces the CPU bottlenecks for short-sequence analysis.
- CUTLASS Tutorials show how poor-sized tiles can reduce tensor core throughput.
CUDA: the perfect tool for your needs If you need to have maximum control of instruction selection, smem choreography, or occupancy while remaining on NVIDIA GPUs, then this is the right solution.
ROCm : toolchain HIP/Clang rocBLAS/MIOpen & 6.x series
Compiler path. ROCm is used Clang/LLVM Compose HIP (CUDA-like) into GCN/RDNA ISA. Release notes for the 6.x series include component level optimizations, HW/OS coverage and perf.
Library kernels and books
- rocBLAS The following are some examples of how to get started: MIOpen Implement GEMM/conv primitives, with tiling and algorithms selection that are arch-aware. This is similar to cuBLAS/cuDNN. The changelog consolidates iterative improvements in performance across all libraries.
- Recent ROCm workstream includes better Triton On AMD GPUs you can enable Python kernel development while still lowering the backends to AMD through LLVM.
Performance Impairments
- For AMD GPUs the alignment of smem (shared memory bank) banks is equally important to AMD as it is for NVIDIA. Compiler assisted fusion (e.g. attention) and library autotuning for rocBLAS/MIOpen can close a significant portion of the gap between handwritten kernels depending on architecture and driver. Release documentation indicates continuous tuner improvements in 6.0–6.4.x.
The right tools for ROCm The AMD Accelerators must be optimized and natively supported, and the HIP kernels should have portability to existing CUDA-style Kernels. A clear LLVM toolset is also required.
Triton – a DSL/compiler for custom kernels
Compiler path. Triton, a Python embedded DSL lowers the via LLVMIt handles vectorization and memory coalescing while giving explicit control of block sizes and IDs. The build docs mention LLVM and the custom builds. NVIDIA developer materials cover Triton tunings for newer architectures, such as Blackwell with FP16/FP8 GEMM.
Optimizations.
- Autotuning over tile sizes,
num_warpsThe static and non-static stages of pipelining Masking For boundary conditions with scalar fallsbacks. shared-memory Overlapping global load with compute requires software pipelines and staging. - Triton’s design aims to Automatism This separation is outlined in the original announcement.
Performance Impairments
- Triton shines when you need a fused, shape-specialized kernel outside library coverage (e.g., bespoke attention variants, normalization-activation-matmul chains). On modern NVIDIA parts, vendor collabs report architecture-specific improvements in the Triton backend, reducing the penalty versus CUTLASS-style kernels for common GEMMs.
Triton can be the perfect tool. Python first iteration and autotuning are important to you if you need near-CUDA-performance for your custom fused ops.
TensorRT: Builder-time graph-optimization for inference
Compiler path. TensorRT is a graph-based ingestor that combines ONNX and frameworks to produce a hardware specific output. The engine is a powerful and efficient way to drive.. It performs during the building. Fusion of layer/tensor, precision calibration (INT8,FP8/FP16), Selecting kernel tacticsThese phases are described in best-practice documents. This is extended with LLM runtime optimizations in TensorRT.
Optimizations.
- Graph-level: constant folding, concat-slice canonicalization, conv-bias-activation fusion, attention fusion.
- Precision: post-training calibration (entropy/percentile/mse) and per-tensor quantization, plus smooth-quant/QAT workflows in TensorRT-LLM.
- Runtime: paged-KV cache, in-flight batching, and scheduling for multi-stream/multi-GPU deployments (TensorRT-LLM docs).
Performance Impairments
- Most of the largest gains are typically made from end-toend INT8 (or FP8 On Hopper/Blackwell, where supported), eliminating framework overhead through a single engine and aggressive attention-fusion. TensorRT’s engine builder generates plans per arch to avoid generic kernels.
Why TensorRT can be the best tool for you Quantization and large graph fusion are available on NVIDIA GPUs for production inference.
Choosing and tuning your stack: Practical advice
- Training vs. inference.
- Training/experimental kernels → CUDA + CUTLASS (NVIDIA) or ROCm + rocBLAS/MIOpen (AMD); Triton for custom fused ops.
- Production inference on NVIDIA → TensorRT/TensorRT-LLM for global graph-level gains.
- Exploit architecture-native instructions.
- On NVIDIA Hopper/Blackwell, ensure tiles map to WGMMA/WMMA The CUTLASS material shows the structure of GEMMs and smems at the warp-level.
- Triton and ROCm autotuners can be used to perform shape-specialized operations.
- Quantize first and then fuse.
- Quantization increases math density and reduces bandwidth. Kernel/graph Fusion reduces memory traffic. TensorRT’s builder time fusions and INT8/FP8 can often result in multiplicative benefits.
- Short sequences can be graphed.
- CUDA Graphs combined with cuDNN attention fusions reduce launch costs in autoregressive inference.
- Consider compiler flags to be first class.
- For CUDA, remember device-side flags: example,
-Xptxas -O3,-v(and-Xptxas -O0Diagnose the problem. Host-only-O3isn’t sufficient.
- For CUDA, remember device-side flags: example,
References:

