Tencent Hunyuan releases HPC-Ops, a High Performance LLM Operator Library

Tencent Hunyuan opens sourced HPC-Ops, a production grade operator library for large language model inference architecture devices. HPC-Ops focuses primarily on the low-level CUDA kernels of core operators like Attention, Grouped GEMM and Fused MoE and exposes these through a compact-C or Python API to be integrated into existing inference systems.

HPC-Ops is used in internal large-scale services. These deployments result in an improvement of about 30% for Tencent-HY and around 17% for DeepSeek for mainstream inference cards. This is reported as a service-level improvement, which reflects the overall effect of faster kernels in a real Inference Pipeline.

HPC Ops: Scope, design and implementation

HPC-Ops, created by Tencent Hunyuan AI Infra, is an easy to use, production grade operator library that provides high performance for LLM. This project is not intended to replace existing serving frameworks. The project provides APIs for clean kernels which can be used by systems already handling scheduling, KV caching, batching and transport.

This API can be used seamlessly inside inference frameworks like vLLM or SGLang. The framework team is able to swap out HPC-Ops Kernels behind their abstractions, without having to change the behavior of the servers.

HPC-Ops builds on C++ with CUDA, CuTe, and CUTLASS. The kernels are small, simple examples which also act as a CUDA modern tutorial.

Kernel performance characteristics

This project will publish the maximum speedup observed for each operator in comparison to baselines. They are not microbenchmarks. Researchers stress that the performance of the benchmarks varies depending on the shape and workload, but the results show the maximum optimization.

HPC Ops, for Attention in Bf16, reports a speed-up of up to 1,33 times when comparing it to FlashInfer, FlashAttention, FlashAttention 2 and FlashAttention 3 and TensorRT. HPC Ops reports that Attention in fp8 is faster than FlashInfer and FlashAttention 3 and TensorRT.

The maximum speedup observed for FusedMoE is 1.49x in prefill, and up to 1.14x in decode. GroupGEMM’s fp8 compared to DeepGEMM reported up-to-1.1 times gains in prefill, and as much as 1.88 in decode.

This is important because the decoder bottleneck usually occurs in autoregressive generation when batch sizes are smaller and memory traffic takes precedence. Attention and GroupGEMM are the two pipelines that show the greatest relative improvement in decoding. This suggests that HPC-Ops concentrates on the parts that users care about.

Precision and supported kernels

This release is divided into three operators families.

Paged attention is supported by both the prefill kernel and the decoder. The memory layout used by frameworks such as vLLM to arrange key and value blocks into a paged format improves the reuse of memory for long sequences.
Grouped GEMM can be implemented using quantized GroupGEMM and fp8-weighted weights. HPC-Ops offers block-wise scaling and per-tensor scales, allowing teams to trade-off the quantization level against parameter storage costs and calibrating cost.
The Fused-MoE operator combines expert routing with expert computation into a single quantized operation. This operator also supports scaling per block and by tensor and uses expert weights from fp8.

HPC-Ops supports bf16, fp8 and other data formats natively in these kernels. This is in line with the production trend of moving towards lower precision formats to preserve accuracy, while reducing bandwidth and increasing tensor-core utilization.

The Key Takeaways

Tencent Hunyuan has open-sourced HPC-Ops to be used as a high-quality operator library on NVIDIA SM90 GPUs including H20. The C++ and CUDA Kernels are built using CuTe, CUTLASS and CUTLASS.
HPC-Ops reported a 30 percent increase in QPM for Tencent-HY on production cards and a 17 percent gain QPM for DeepSeek.
Microbenchmarks of operators show that they can speed up bf16 Attention, fp8 Attention, fp8 prefill and GroupGEMM by up to 1,49 and 2,22 respectively, compared to baselines as strong as FlashInfer FlashAttention TensorRT and DeepGEMM.
This library is focused on three operators families: Attention, which supports paged attention, Quantized GroupGEMM, Fused MoE, and fp8 expertweights. It also includes both per-tensor and block scaling.
HPC-Ops was designed to be an operator layer which integrates with existing frameworks like vLLM or SGLang. The roadmap targets an increased focus on long-context LLMs and quantization strategies including 8-bit and 4-bit.

Click here to find out more Repo here. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Michal Sutter, a data scientist with a master’s degree in data science from the University of Padova is an expert. Michal Sutter excels in transforming large datasets to actionable insight. He has a strong foundation in statistics, machine learning and data engineering.

Tencent Hunyuan releases HPC-Ops, a High Performance LLM Operator Library

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

PyKEEN: Coding for Training, Optimizing and Evaluating Knowledge Graph Embeddings

Robbyant LingBot World – a Real Time World Model of Interactive Simulations and Embodied AI

A Geometric Model of Cosmological Redshift via Angular Geometry in a Static Universe • AI Blog

X Didn’t Fix Grok’s ‘Undressing’ Problem. You just make people pay for it

The AI Slur ‘Clanker’ Has Become a Cover for Racist TikTok Skits

The Meta AI App Lets You ‘Discover’ People’s Bizarrely Personal Chats

The AI Image Generator Startup Exposed Database Leaked a Huge Trunk of Nude Pictures

Top Insights

Trump signs executive order that threatens states with punishment for passing AI laws

From Perception To Action: World Models’ Role in Embodied AI Systems

Latest News

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

Tencent Hunyuan releases HPC-Ops, a High Performance LLM Operator Library

HPC Ops: Scope, design and implementation

Kernel performance characteristics

Precision and supported kernels

The Key Takeaways

Related Posts