Poolside AI launched the first two Laguna models. Laguna M.1 The following are some examples of how to get started: Laguna XS.2. The company will also release a new product. Pool — a lightweight terminal-based coding agent and a dual Agent Client Protocol (ACP) client-server — the same environment Poolside uses internally for agent RL training and evaluation, now available as a research preview.
What Are These Models and Why Should you Care?
Both Laguna M.1 and Laguna XS.2 are Mixture-of-Experts (MoE) models. Instead of activating all parameters for every token, MoE models route each token through only a subset of specialized sub-networks called ‘experts.’ It allows for a larger total number of parameters and headroom in terms of capability, but only pays the cost to compute a small amount. “activated” Parameter count during inference.
Laguna M.1 This is a MoE with 225B total parameters and 23B activated parameters. It was created from scratch using 30T tokens, 6144 NVIDIA Hopper interconnected GPUs. This model was pre-trained at the close of the last year. It is used as the base for all Laguna models. The benchmarks it achieves are impressive SWE Bench Verified: 72.5%, 67.3% of SWE Multilingual, Pro Bench SWE: 46.9%Then, 40% off Terminal Bench 2.0.
Laguna XS.2 This is Poolside’s second-generation MoE, and the first model to use open weights. It builds on all that has been learned from training M.1. At 33B total parameters with 3B activated per token, it is designed for agentic coding and long-horizon work on a local machine — compact enough to run on a Mac with 36 GB of RAM via Ollama. It has a score of Verified SWE 68.2%, Multilingual: SWE Bench Multilingual: 62.4%, SWE-Bench Pro: 44% offThen, 30% on Terminal Bench 2.0. Poolside also releases Laguna XS.2-base Practitioners who are looking to improve their practice can do so soon.
Architecture: the Efficiency Decisions of XS.2
XS.2 is used Signoid gate with per-layer rotating scales, enabling a mixed Sliding Window Attention (SWA) and global attention layout in a 3:1 ratio across 40 total layers — 30 SWA layers and 10 global attention layers. The Sliding Windows Attention limit each token to only a small window (512 tokens) and not the entire sequence. This dramatically reduces KV cache memory. The global attention layer at 1-in-4 preserves long-range dependencies while not paying full costs everywhere. It also quantifies KV cache. FP8The memory usage per token is further reduced.
XS.2 has many uses under the hood The 256 expert team with a shared expertThe plethora of support services available to a The context window for 131,072 Tokens, and features native reasoning support — interleaved thinking between tool calls with per-request control over enabling or disabling thinking.


Swimming Pools: 3 Areas that are pushed hard
The Poolside Team trains its models using their own codebase, data pipeline (Titan), agent RL infrastructure and training codebase. Laguna invested in three areas.
AutoMixer: Automatically optimizing data mix. The data curation process and mix of training variables have a huge impact on the final performance of models. Rather than relying on manual heuristics, Poolside developed an automixing framework that trains a swarm of approximately 60 proxy models, each on a different data mix, and measures performance across key capability groups — code, math, STEM, and common sense. The surrogate regressions fit are used to estimate how the changes in dataset proportions will affect downstream evaluations. This gives a map from data mix and performance which can be optimized. This approach was inspired by previous work, including Olmix MDE RegMixIt is adapted to Poolside with more data categories.
The data-side of both models was trained with more than 30T tokens. Poolside’s diversity-preserving data curation approach — which retains portions of mid- and lower-quality buckets alongside top-quality data to avoid STEM bias — yields approximately 2× more unique tokens Compared to pipelines that are precision focused, the gains persist over longer training periods. Separate deduplication analyses also confirmed this. The global deduplication process removes more data with high quality.The team used synthetic data to fine-tune its pipeline. The contribution of synthetic data is about Final training mix: 13% The Laguna series uses approximately The 4.4T+ Synthetic Tokens Total.
Muon Optimizer. Rather than AdamW — the most common optimizer in large model training — Poolside used a distributed implementation of the Muon Optimizer Both models are trained through the entire training stage. The research team was able to achieve the same loss of training as an AdamW base in the initial pre-training ablations. 15% fewer stepsThe final model showed large increases in absolute evaluation, as well as a transfer of learning rates across scales. Muon has an additional advantage: it only needs one state for each parameter, rather than the two required by other algorithms. Memory requirements are reduced both during training and checking. During the pre-training phase of Laguna M.1, overhead was lower than 1%.
Poolside runs Periodic hash checks on Model Weights across training replicas to catch silent data corruption (SDC) from defective GPUs — specifically errors in arithmetic logic and pipeline registers, which unlike DRAM and SRAM are not covered by ECC protection.
Async On-Policy Agent RL. The most difficult part of Laguna’s training stack is this piece. Poolside created a synchronous online RL platform where actors pull tasks from a data set, create sandboxed virtual containers and then run production agents against those tasks using the model that was just deployed. The resultant trajectories can be scored, filtered, or written in a database. Iceberg tables, while the trainer continuously consumes those records and produces the next checkpoint — inference and training running asynchronously in parallel, with throughput tuned to balance off-policy staleness.
What you need to know
- Poolside introduces its first open weight model. Laguna XS.2 is a 33B total parameter MoE model with only 3B activated parameters per token, available under an Apache 2.0 license — compact enough to run locally on a Mac with 36 GB of RAM via Ollama.
- Small scale performance benchmarks: Laguna XS.2 scores 68.2% on SWE-bench Verified and 44.5% on SWE-bench Pro, while the larger Laguna M.1 (225B total, 23B activated) reaches 72.5% on SWE-bench Verified and 46.9% on SWE-bench Pro — both trained from scratch on 30T tokens.
- Muon Optimizer beats AdamW in training by 15% Poolside replaced AdamW with a distributed implementation of the Muon optimizer, achieving the same training loss in roughly 15% fewer steps, with lower memory requirements — only one state per parameter instead of two.
- AutoMixer is a data mixer that replaces manual mixing of data with optimized learning. Instead of handcrafted data recipes, Poolside trains a swarm of ~60 proxy models on different data mixes and fits surrogate regressors to optimize dataset proportions — with synthetic data making up ~13% of Laguna XS.2’s final training mix from a total of 4.4T+ synthetic tokens.
- Fully asynchronous agent RL using GPUDirect RDMA for weight transfer Poolside’s RL runs both inference and learning in parallel. It transfers hundreds of gigabytes worth of BF16 Weights in less than 5 seconds using GPUDirect RDMA. The CISPO algorithm is used for training off policy stability.
Check out the Model Weights The following are some examples of how to get started: Technical details. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

