Maia 200 Microsoft has developed a new AI accelerator for Azure datacenters. This accelerator targets the costs of token generation in large language models as well as other reasoning workloads. It does this by combining a narrow precision computer, dense on-chip memory hierarchy and Ethernet-based scaling fabric.
Why Microsoft created a dedicated inference processor?
Different hardware and software are used for training. All-to-all communication, long jobs and large training needs are required. The inference system is concerned about the tokens per second and latency as well as tokens for dollars. Microsoft’s Maia 200 inference system is its most cost-effective inference system. It offers a performance that is 30 percent higher per dollar, compared to the latest hardware.
Maia is a part of the heterogeneous Azure stack. The Maia 200 will support multiple models including OpenAI’s latest GPT 5.2 model and power Microsoft Foundry as well Microsoft 365 Copilot workloads. Microsoft Superintelligence will utilize the chip to enhance in-house models using synthetic data and reinforcement learning.
Numeric and core silicon specifications
You can also find out more about us on our website. Maia 200 The die was manufactured using TSMC 3 nanometer technology. The chip has more than 14 billion transistors.
The computing pipeline is built on native FP8 tensorcores and FP4 tensorcores. Within a TDP of 750W, a single chip provides more than 10 PetaFLOPS (FP4) and 5 PetaFLOPS (FP8).
The memory is divided between HBM stacked and SRAM on the die. Maia 200 comes with 216GB HBM3e and 7TB/s of bandwidth. It also includes 272MB on-die SRAM. SRAM can be divided into cluster and tile-level SRAM, and it is software managed. Working sets can be placed explicitly by compilers and runtimes to maintain attention and GEMM Kernels close to computation.
Memory hierarchy and microarchitecture using tiles
The Maia200 microarchitecture has a hierarchical structure. Tile is the base unit. Tiles are the smallest independent compute and memory units on the chip. A Tile Tensor Unit is included in each tile for high-throughput matrix operations, and a SIMD Tile Vector Processor acts as an programmable engine. Both units are fed by tile SRAM, while the tile DMA engine moves data to and from SRAM with no stalling of computation. Tile Control Processors orchestrate the sequence of DMA, tensor, and tile work.
Clusters are formed when multiple tiles come together. The Cluster SRAM is shared between all the tiles of a cluster. Cluster DMA engines are used to move data from Cluster SRAM into the HBM co-packaged stacks. Cluster cores coordinate multi-tile execution, and use redundancy schemes in tiles and SRAM for improved yield while maintaining the programming model.
The software stack can pin the different components of the model into different levels using this hierarchy. Attention kernels, for example, can place Q, K and V tensors into tile SRAM while collective communication Kernels can put payloads on cluster SRAM to reduce pressure in HBM. When models increase in size or sequence length, the design goal should be to maintain high utilization.
The Ethernet fabric and chip data transfer
The data transfer rate is more often than not the main limiter for inference. Maia 200 is equipped with DMA engines and a Network on Chip. This Network on Chip is a multi-layer network that spans memory controllers, I/O, tiles and clusters. There are separate planes to handle large tensor traffic, and smaller control messages. By separating the two, small messages and outputs are not blocked by larger transfers.
The chip’s boundary is beyond the limit Maia 200 The AI Transport Layer protocol is run on an Ethernet-based scaling network. On-die NICs expose 1.4 TB/s each way, or 2.8 TB/s bidirectionally, scaling up to 6,144 accelerations on a 2-tier domain.
Four Maia Accelerators form a Quad Fully Connected in each tray. The four devices are connected directly, without switching. The majority of tensor traffic remains within this group while the lighter collective traffic is sent out. It improves the latency of inference collectives and decreases their switch port count.
Integration and cooling of Azure systems
The system-level standards for Maia 200 are the same as those of Azure GPU servers. This includes racks, power, and mechanical specifications. The system supports both air-cooled and liquid-cooled configurations, and it uses the second generation closed circuit liquid cooling Heat Exchanger Units for racks with high density. The mixed deployment of GPUs with Maia accelerators is possible in the same datacenter footprint.
The accelerator is integrated with Azure’s control plane. The same Azure compute services workflows are used for firmware management, health monitoring, and telemetry. It allows for fleet-wide rollouts, maintenance and AI workloads to continue running without interruption.
The Key Take-Aways
Takeaway 5:
- Inference first designMaia 200, Microsoft’s very first system and silicon platform designed exclusively for AI inferences and optimized for token generation at large scale in large reasoning models as well as large language models.
- Memory hierarchy and numeric specificationsThe chip was fabricated using TSMC’s 3nm technology, and integrates approximately 140 billion transistors. Delivering more than 10 PFLOPS FP4 (and more than five PFLOPS FP8) and 216 GB of HBM3e with 7TB per sec along with 272 MB SRAM on the chip split between tile SRAMs and clusters SRAMs that are managed by software.
- Cloud accelerators: Performance comparison with other cloud acceleratorsMicrosoft says its latest Azure inference system is 30 percent faster per dollar and has a 3x performance increase over Amazon Trainium’s third generation. Google TPU v7 also claims to have FP8 performances that are better at the accelerator levels.
- Ethernet fabric and tile-based architectureMaia 200 divides the compute domain into clusters and tiles with local DMA, SRAM and Network on Chip. It also exposes a NIC that has a per-direction Ethernet bandwidth of about 1.4TB/s. This can scale to 6144 accelerators by using Fully Connected Quad Groups as the local parallel tensor domain.

