Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
  • TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost
  • Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.
  • OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator
  • This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!
  • NVIDIA Releases Ising – the First Open Quantum AI Model Family For Hybrid Quantum-Classical Systems
  • xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers
AI-trends.todayAI-trends.today
Home»Tech»Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

Tech By Gavin Wallace20/04/20267 Mins Read
Facebook Twitter LinkedIn Email
DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced
DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced
Share
Facebook Twitter LinkedIn Email

For years, the way large language models handle inference has been stuck inside a box — literally. The high-bandwidth RDMA networking that makes modern LLM working have confined prefill and decode within the same datacenter. Sometimes, even on the same rack. A team of researchers at Moonshot AI and Tsinghua University is making the case that this constraint is about to break down — and that the right architecture can already exploit that shift.

Prefill as a Service is a service that allows datacenters to serve each other. It does this by offloading prefill for long-context to dedicated, high-density prefill clusters. The result, in a case study using an internal 1T-parameter hybrid model, is 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous setup — while consuming only a fraction of available cross-datacenter bandwidth. Researchers note that at an equal cost of hardware, throughput is increased by approximately 15%. The full 54% benefit comes from combining H200 GPUs, which are more powerful, for prefill, with H20 GPUs, for decoding.

https://arxiv.org/pdf/2604.15039v1

The Existing Architecture Hits a Wall

Understanding why LLM is divided into two phases will help you understand PrfaaS. Prefill is the step where the model processes all of the input tokens and generates the KVCache — it is compute-intensive. Decode is where the model generates output tokens one at a time — it is memory-bandwidth-intensive. The prefill-decode disaggregation (PD) separates the two phases on different hardware. This improves utilization, and allows for each phase to independently optimize.

Separating prefill and decode causes a problem in transport. Prefill is run on one machine and decode on another. The KVCache generated by prefill has to be moved from the prefill side to the decoder before output can start. In conventional dense-attention models — those using Grouped Query Attention (GQA) — this KVCache is enormous. The research team benchmarks MiniMax-M2.5, a representative dense model with GQA, producing KVCache at roughly 60 Gbps for a 32K-token request on a single 8×H200 instance. This volume of data is only transferable with RDMA interconnects, and this is the reason why conventional PD is restricted to a datacenter scale network fabric. Prefilling and decoding in separate clusters was not possible, much less across multiple datacenters.

The Hybrid Math is Different

PrfaaS has become timely due to a shift in architecture at the model-level. A growing class of models — including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T — adopt hybrid attention stacks that interleave a small number of full-attention layers with a larger number of linear-complexity or bounded-state layers such as Kimi Delta Attention (KDA), Multi-head Latent Attention (MLA), and Sliding Window Attention (SWA). These architectures only have full-attention layer that produce KVCache which is scalable with the sequence length. These layers of linear complexity maintain recurrent state that are fixed in size and whose footprint becomes negligible when the context is long.

The KV throughput numbers — defined as KVCache size divided by prefill latency — tell the story clearly. At 32K tokens, MiMo-V2-Flash produces KVCache at 4.66 Gbps versus 59.93 Gbps for MiniMax-M2.5, a 13× reduction. Qwen3.5-397B reaches 8.25 Gbps versus 33.35 Gbps for Qwen3-235B, a 4× reduction. For Ring-2.5-1T specifically, the paper decomposes the savings: MLA contributes roughly a 4.5× compression over GQA, and the 7:1 hybrid ratio contributes another approximately 8× reduction, yielding an overall KV memory saving of roughly 36×. For the internal 1T model used in the case study, KV throughput at 32K tokens is just 3.19 Gbps — a level that modern inter-datacenter Ethernet links can actually sustain.

The research team makes a crucial distinction for AI developers building actual systems. A smaller KVCache may be necessary, but it is not enough to enable cross-datacenter PD deaggregation. The workloads of real systems are often bursty. Request lengths and prefix caches can be distributed in an uneven manner across the nodes. Inter-cluster bandwidth is also variable. The simplest design, which routes all prefills to remote clusters, still results in congestion and instability.

https://arxiv.org/pdf/2604.15039v1

What Does PrfaaS Do?

PrfaaS PD Architecture is built on top Three subsystems: compute, network, You can also find out more about the following: . The compute subsystem separates clusters into two types — local PD clusters that handle end-to-end inference for short requests, and PrfaaS clusters with high-compute-throughput accelerators dedicated to long-context prefill. The network subsystem utilizes intra-cluster RDMA transfers for local transfer and commodity Ethernet transports for KVCache cross-cluster. The storage subsystem constructs a distributed hybrid cache pool which handles linear attention states (request-level fixed-size exact-match-only) and KVCache full-attention blocks (block level growing linearly as input length increases, supporting partial matching).

This is the main routing method. The key routing mechanism is length-based threshold routing. The l After subtracting the cached prefix from a request, you can use this value to indicate how long it takes for a prefill. The t A routing threshold. When l > tThe request is sent to PrfaaS and the KVCache of the cluster over Ethernet, to a node that decodes. If l ≤ tIn the case study, the optimal threshold is a local PD. In this case, optimal thresholds are The t value is 19.4K tokens, which routes approximately 50% of all requests — the longer ones — to the PrfaaS cluster.

To make an Ethernet network reliable, more is needed than just a low throughput. The team of researchers specifies three transport mechanisms, including layer-wise pipelining for KVCache generation and transmission to maximize bandwidth. Multi-connection transport using TCP to utilize all available bandwidth. And congestion monitoring with scheduler integration to prevent congestion accumulation and detect early loss or retransmission.

The research team has also developed a scheduler that works on two timescales. It adjusts routing when the bandwidth of a link is approaching its limit. When bandwidth is limited, it evaluates each cluster’s cache independently. If bandwidth is plentiful, the scheduler looks at the prefixes that are cached across all clusters, and performs cross-cluster caching if this reduces redundant calculations. On longer timescales the scheduler rebalances node counts for prefilling and decoding within the local PD group as traffic patterns alter, keeping the system close to the throughput optimal operating point.

Numbers

A PrfaaS-cluster of 32 H200 graphics cards is paired up with a local PD-cluster of 64 H20 graphics cards, and connected through a VPC, providing 100 Gbps bandwidth between clusters. The aggregate PrfaaS egress load under the optimal configuration is approximately 13 Gbps — just 13% of available Ethernet capacity — and the paper notes that the PrfaaS cluster remains compute-bound with substantial bandwidth headroom to spare. This research is also projected to be true for larger deployments. Even at a scale of 10,000 GPU datacenters, the aggregate bandwidth needed for KVCache transfers totals just 1.8 Tbps. Modern inter-datacenter connections are well within this capacity.

The Mean Time To First Token (TTFT), and the P90 TTFT, both drop by 50% compared with the homogenous baseline. The naive heterogeneous configuration — all prefill on H200, all decode on H20, with no routing or scheduling logic — achieves only 1.16× throughput over the homogeneous baseline, compared to 1.54× for the full PrfaaS-PD system. The gap between 1.16× and 1.54× isolates the contribution of the scheduling layer and shows it accounts for the majority of the practical gain.

The research team positions PrfaaS not as a near-future concept but as a design that is viable today for hybrid-architecture models — and argues that as context windows grow, KVCache compression techniques mature, and phase-specialized hardware such as NVIDIA’s Rubin CPX for prefill and LPU-style chips for decode become more widely available, the case for cross-datacenter PD disaggregation will only strengthen.


Check out the Paper here. Also, feel free to follow us on Twitter Join our Facebook group! 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us


AI ar dat data ETH research search
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

20/04/2026

OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator

19/04/2026
Top News

AI Humanoids are Here: Move aside, chatbots!

Cloudflare blocks AI crawlers by default

A Dark Horse AI is rewriting rules of game design

The worst fears of gamers about AI are coming true

This AI Agent is Designed Not to Go Rogue

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

AI Agents: The math doesn’t add up

23/01/2026

Why Docker Issues for Synthetic Intelligence AI Stack: Reproducibility, Portability, and Setting Parity

13/08/2025
Latest News

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.