RightNow releases AutoKernel - an Open Source Framework for GPU Kernel optimization using Autonomous Agent loops.

Fast GPU code writing is one of machine learning’s most difficult specializations. Researchers at RightNow AI are working to automate the entire process.

RightNow AI has developed AutoKernel. This open-source tool applies an LLM agent to GPU kernels for PyTorch models of any type. The approach is straightforward: give it any model before you go to bed, and wake up to faster Triton kernels — no GPU expertise required.

https://arxiv.org/pdf/2603.21331

Why GPU Kernels are so difficult to optimize

The GPU kernel runs thousands of GPU cores in parallel. Kernels are responsible for the majority of computation time when you use a transformer like LLaMA, GPT-2 or LLaMA. They perform operations such as matrix multiplication, softmax, layer normalation, attention, etc. They are either generated by PyTorch or stored in libraries such as cuBLAS, cuDNN and cuBLAS.

The problem is that squeezing maximum performance out of these kernels requires reasoning simultaneously about arithmetic intensity, memory coalescing, register pressure, tile sizes, warp-level synchronization, and tensor core instruction selection — a combination of skills that takes years to develop. One high-performance Matmul kernel can have 200+ lines in CUDA code or Triton with dozens interdependent parameters. Manual tuning processes are difficult to scale and require a lot of expertise.

KernelBench benchmark, which tests frontier LLMs against 250 GPU kernels, revealed that, even with one-shot generation, the top models could only match PyTorch’s baseline performance less than 20% of the time. AutoKernel was developed in direct response to the gap.

The Loop: Revert, edit, or benchmark?

AutoKernel is based on the core insight that a kernel engineer’s work flow is a simple one: Write a candidate and benchmark it. Keep improvements, throw out regressions. Repeat. The framework automates this loop. An LLM agent modifies a single file — kernel.py — a fixed benchmark harness verifies correctness and measures throughput, and the result determines whether the change persists. Importantly, each experiment corresponds to a single git commit. The branch is advanced by keeping experiments; those that are reverted will be erased with git. git reset. The entire history is browsable with standard git tools, and experiment results are logged to a plain tab-separated results.tsv file — dependency-free, human-readable, and trivially parseable by the agent.

Each iteration takes approximately 90 seconds — 30 seconds for correctness checking, 30 seconds for performance benchmarking via Triton’s do_bench, and 30 seconds for agent reasoning and code modification. A 10-hour overnight run will yield 300-400 experiments, assuming 40 experiments an hour.

The design is based on Andrej Karpathy’s autoresearch, where an AI agent using a keep/revert on LLM code was able to discover 20 optimizations in 700 experiments within two days with a single GPU. AutoKernel translates this loop from kernel code to a completely different search area and uses a correctness-gated bench mark as an evaluation function, instead of a validation loss.

The Agent reads program.md which is a 909 line instruction document that encodes expert know-how into a 6-tier optimization playbook. The tiers progress from block size tuning (sweeping tile dimensions through powers of 2, adjusting num_warps and num_stages) through memory access patterns (coalesced loads, software prefetching, L2 swizzling), compute optimizations (TF32 accumulation, epilogue fusion), advanced techniques (split-K, persistent kernels, Triton autotune, warp specialization), architecture-specific strategies (TMA on Hopper, cp.async on Ampere, adjusted sizes for L4/RTX), and finally kernel-specific algorithms like online softmax for attention and Welford’s algorithm for normalization. It is designed so that it can be run continuously for more than 10 hours.

Prioritizing where it matters, first profile.

AutoKernel, unlike previous work which treats kernel issues in isolation starts with a PyTorch complete model. It uses torch.profiler with shape recording to capture per-kernel GPU time, then ranks optimization targets using Amdahl’s law — the mathematical principle that the overall speedup you can achieve is bounded by how much of the total runtime that component represents. A 1.5× speedup on a kernel consuming 60% of total runtime yields a 1.25× end-to-end gain. The same speedup on a kernel consuming 5% of runtime yields only 1.03×.

It detects GPUs from a database that includes NVIDIA’s (H100/A100/L40S/L4, A10/RTX 3090/4080/3080/3080) as well AMD’s (MI300X/MI325X/MI350X/MI355X/MI355X/MI355X/MI355X/MI355X/MI355X/MI355) accelerators. For unknown GPUs, it estimates peak FP16 throughput from SM count, clock rate, and compute capability — making the system usable across a wider range of hardware than just the latest NVIDIA offerings.

The orchestrator (orchestrate.py) transitions from one kernel to the next when any of four conditions are met: five consecutive reverts, 90% of GPU peak utilization reached, a two-hour elapsed time budget, or a 2× speedup already achieved on that kernel. The agent is prevented from wasting time on kernels that have diminishing returns, while other targets are awaiting.

The Five-Stage Correctness harness

AutoKernel takes this very seriously. Before any speedups are recorded, each candidate kernel is put through five different validation stages. The first stage runs a quick smoke test to detect compilation errors or mismatched shapes in less than a second. Stage 2 sweeps across 8 to 10 input configurations and three data types — FP16, BF16, and FP32 — to catch size-dependent bugs like boundary handling and tile remainder logic. The third stage tests the numerical stability of adversarial inputs. For softmax this is rows with large, identical values. The fourth stage verifies that determinism is achieved by running three copies of the input and demanding bitwise-identical outputs. This catches race condition in parallel reductions or nondeterministic atomics. Stage 5 exposes masking bugs as well as tile remainder errors by testing non-power-of-2 dimensions such 1023,4097,and 1537.

Tolerances are dtype-specific: FP16 uses atol = 10⁻², BF16 uses 2 × 10⁻², and FP32 uses 10⁻⁴. In the full evaluation of 34 configurations, on an NVIDIA h100, zero errors were reported across all outputs, including eager, custom, or compiled kernels.

Dual backend: Triton C++ and CUDA C++

AutoKernel is able to support Triton as well as CUDA C++ in the same framework. Triton is a Python-like domain-specific language that compiles JIT in 1 to 5 seconds, making it ideal for rapid iteration — the agent can modify block sizes, warp counts, pipeline stages, accumulator precision, and loop structure. Triton reaches up to 90% of the cuBLAS performance for matmul. CUDA C++ is included for cases requiring direct access to warp-level primitives, WMMA tensor core instructions (using 16×16×16 fragments), vectorized loads via float4 and half2, bank-conflict-free shared memory layouts, and double buffering. Each backend exposes the kernel_fn() The benchmark infrastructure will run the same regardless of which backend is used.

This system includes nine different kernels that cover the most important operations of modern transformer architectures. These include: cross_entropy (also known as rotary_embedding), rotary_embedding and reduce. As the oracle of correctness, each has its own PyTorch-based reference implementation. Reference.py is used to compute throughput and roofline usage in TFLOPS/GB/s along with detected GPU peak.

Benchmarking Results for H100

The results are impressive for kernels that use memory. RMSNorm achieves 5.29× over eager and 2.83× over torch.compile at the largest tested size, reaching 2,788 GB/s — 83% of H100’s 3,352 GB/s peak bandwidth. Softmax reaches 2,800 GB/s with a 2.82× speedup over eager and 3.44× over torch.compile. Cross-entropy achieves 2.21× over eager and 2.94× over torch.compile, reaching 2,070 GB/s. Gains on these kernels are a result of combining ATen multi-operation decompositions to Triton single-pass kernels, which minimize HBM (High-Bandwidth Memory traffic).

AutoKernel is faster than torch.compile in 12 out of 16 configurations that were benchmarked, even though torch.compile was using max-autotune’s Triton autotuning. TorchInductor’s generic autotuning and fusion does not always exploit the kernel-specific tiling or reduction strategies.

Matmul is notably harder — PyTorch’s cuBLAS backend is extensively tuned per GPU architecture. Triton’s starter achieves 278 TFLOPS which is well below the cuBLAS benchmark. However, at the 2048³ size, AutoKernel beats torch.compile by 1.55×, demonstrating that TorchInductor’s matmul autotuning is not always optimal either. The primary goal of continued agent development is to close the cuBLAS gaps.

In community deployment, an AutoKernel-optimized kernel took first place on the vectorsum_v2 B200 leaderboard with a latency of 44.086µs, outperforming the second-place entry at 44.249µs and third place at 46.553µs. A community user also reported that a single AutoKernel prompt — requiring approximately three minutes of agent interaction — produced a Triton FP4 matrix multiplication kernel that outperforms CUTLASS by 1.63× to 2.15× across multiple shapes on H100. CUTLASS was hand-optimized C++ templates code designed specifically for NVIDIA Tensor Cores. The result is especially notable.

What you need to know

AutoKernel automates the GPU tuning process, which takes weeks to perfect. By mechanizing the write-benchmark-keep/revert loop that expert kernel engineers already follow, the system runs 300 to 400 experiments per overnight session on a single GPU without any human intervention.
Prior to recording any speed, it is essential that the correctness of all data is confirmed. Every candidate kernel must pass a five-stage harness covering smoke tests, shape sweeps across 10+ configurations, numerical stability under adversarial inputs, determinism verification, and non-power-of-two edge cases — eliminating the risk of the agent “optimizing” It leads to wrong outputs
The memory-bound kernels are the most improved over PyTorch and torch.compile. On an NVIDIA H100, AutoKernel’s Triton kernels achieve 5.29× over eager on RMSNorm, 2.82× on softmax, and 2.21× on cross-entropy — with the gains coming from fusing multi-operation ATen decompositions into single-pass kernels that minimize HBM traffic.
Amdahl’s Law determines how much time an agent will spend on a particular task. Rather than optimizing kernels in isolation, AutoKernel profiles the entire PyTorch model The following are some examples of how to get started: allocates effort proportionally to each kernel’s share of total GPU runtime — ensuring that improvements compound at the model level, not just the kernel level.

Check out the Paper and Repo. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

RightNow releases AutoKernel – an Open Source Framework for GPU Kernel optimization using Autonomous Agent loops.

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

Kara Swisher would rather work for Sam Altman than Mark Zuckerberg

OpenAI Raid on Thinking Machines Lab

AI Devices are Coming. What Apps Will Be Included?

Hollywood’s AI Fatigue is Losing Its Audience

OpenAI staffer quits, alleging that the company’s economic research is drifting into AI advocacy

Top Insights

Model Context Protocol FAQs – Everything you need to know in 2025

Construct a Matryoshka-Optimized Sentence Embedding Mannequin for Extremely-Quick Retrieval with 64-Dimension Truncation

Latest News