UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size

It is still the same recipe that has been used since Chinchilla: more FLOPs are spent, more parameters added, and more tokens trained. But as inference deployments consume an ever-growing share of compute and model deployments push toward the edge, researchers are increasingly asking a harder question — can you scale Quality Without scaling Memory Footprint?

Researchers from UC San Diego have developed Together AI. Parcae, a stable looped transformer architecture that outperforms prior looped models and beats fixed-depth Transformer baselines at every scale tested — all while using the same parameter count and the same training data budget

https://arxiv.org/pdf/2604.12946

What is Looped Language Model (LLM)?

A standard Transformer is a stack of layers that has been fixed. Activations are pushed through this stack exactly one time. A Looped Architecture Instead, activations are routed through layers You can also contact us by clicking here. The loop multiplies the effective computation without adding any parameters. Consider it like running the same set of transformer blocks again and again rather than building an even taller model.

Parcae uses the word aspecifically. middle-looped Design, by dividing the architecture into functional blocks. Prelude (P). This embeds input into a state of latent. EA Block recurrent (R) This iteratively update a concealed state H_TThe following are some examples of how to use You can also contact us by clicking here. Loops E You can inject the input into each new iteration in order to keep its influence. Code (C) This is the process that finalizes H_{You can also contact us by clicking here.}The output is produced by the model. This keeps the model small in memory which is a great property to have for device deployment. It also gives you a lot more computing power per forward pass.

They were difficult to learn. These transformers were prone to a number of problems. residual state explosion — where the hidden state vector grows uncontrollably across loop iterations — and frequent Loss spikes. Just to reach convergence, hyperparameters had to be tuned with great care.

A Residual Unconstrained System is the Root Cause

Parcae has developed a key insight that recasts the looped-model’s forward passage as a Nonlinear dynamical system with time variations Over the residual stream

H_{The e-mail address you entered was not valid.+1} = Ā h_{The r} + B̄ e + R̄(h_t, e),

Here, Ā controls the balance between prior and current residual states, B̄ injects the input signal, and R̄ is the nonlinear contribution of the transformer blocks (attention and MLPs). Dropping R̄ yields a discrete linear time-invariant (LTI) system, and classical control theory immediately gives you the stability condition: the system is stable when the spectral norm ρ(Ā) , marginally stable when ρ(Ā) = 1, and unstable when ρ(Ā) > 1.

Examining prior methods under this framework reveals the problem precisely. You can also find out more about the A-Team here.ddition-based input injection sets Ā = I (the identity matrix), meaning ρ(Ā) = 1 — marginally stable. The concatenation-with-projection approach used by RDMs leaves Ā entirely unconstrained, making ρ(Ā) potentially far greater than 1 — unstable. Empirical training curves confirm this directly: divergent training runs learn ρ(Ā) ≥ 1, while the few convergent runs maintain ρ(Ā)

How Parcae Enforces Stability by Design

Rather than parameterizing Ā directly, Parcae works in continuous form and discretizes using zero-order hold (ZOH) and Euler schemes — borrowing a standard technique from state space models like Mamba and S4 — with a learned step size Δ ∈ ℝ^d_h, giving Ā = exp(ΔA) and The B-word̄ = ΔB. To guarantee ρ(Ā) negative diagonal matrix: A := Diag(−exp(log_A)), where log_A ∈ ℝ^d_h is a learnable vector. Because diagonal entries are always negative before exponentiation, the spectral norm constraint is satisfied at all times by construction.

The following are some of the reasons why you should consider hiring someone elseesults: Outperforming Models Twice the Size

Against parameter- and data-matched The following are some of the reasons why you should consider hiring someone elseDMs trained on the Huginn dataset, Parcae reduces validation perplexity by up to 6.3% — a figure that peaks at 350M scale (improving from 10.76 to 10.09 PPL) versus a 4.5% gain at 100M scale (14.23 to 13.59 PPL). WikiText perplexity improves by up to 9.1% at 350M scale. You can also find out more about the A-Team here.verage downstream zero-shot benchmark accuracy improves by up to 1.8 points.

You can also find out more about the A-Team here.gainst stanDard fixed-deptH Transformer baselines trained with a nanochat-inspired setup on FineWeb-Edu, Parcae outperforms at every scale. You can also find out more about the A-Team here.t 1.3B parameters trained on 104B tokens, Parcae beats the parameter-matched Transformer by 2.99 points on Core and 1.18 points on Core-Extended. The 770M Parcae model (25.07 Core) reaches quality comparable to the 1.3B Transformer (25.45 Core) — roughly half the parameters for equivalent capability. The research team quantifies Parcae’s parameter efficiency as achieving up to 87.5% of the quality of a Transformer twice its size, measured against the quality gap to the next larger model.

The First Scaling Laws for Looping

The second major contribution of this research is establishing the first predictable scaling laws for layer looping. Using isoFLOP experiments at 140M and 370M scales, the research team shows that compute-optimal training increases mean recurrence µ_rec and training tokens D in tandem, following power laws with consistent exponents across both scales: optimal µ_rec scales as C^0.40 and optimal tokens scale as C^0.78, where C is the training FLOP budget.

When looped Parcae models trained at their optimal µ_rec are compared against fixed-depth Parcae models (µ_rec = 1) under identical FLOP and parameter budgets, looping achieves a strictly lower validation loss — translating into 1.2 to 2.0 points higher Core scores depending on the FLOP budget. Looping is a genuinely orthogonal axis for scaling compute, not a free lunch from weight sharing.

You can also find out more about the A-Team here.t test time, increasing loop count T beyond training depth follows a saturating exponential decay: L(T) = L_∞ + Z·e^−z·T, where L_∞ is an irreducible floor determined by training depth. Gains plateau near µ_rec — the mean recurrence used during training — meaning training depth sets a hard ceiling on test-time scaling. These dynamics unify into a single parametric law that predicts held-out model loss within 0.85–1.31% average error.

Key Takeaways

Looped transformers can now be trained reliably at scale: Parcae is a looped architecture to solve the residual state explosion and loss spike problems that have plagued prior looped models, achieving stable training across a wide range of learning rates where previous approaches diverged.
A 770M Parcae model matches the quality of a 1.3B standard Transformer: By reusing the same layers across multiple loop iterations instead of adding more parameters, Parcae delivers equivalent downstream capability at roughly half the memory footprint.
Looping is a third orthogonal axis for scaling compute, alongside parameters and data: Under a fixed FLOP and parameter budget, compute-optimal training requires increasing mean recurrence and training tokens in tandem following predictable power laws — giving AI professionals a new lever to improve quality without buying more hardware.
Test-time looping has a hard ceiling set by training depth: Parcae can use additional loop iterations at inference to scale compute, but gains plateau near the mean recurrence used during training. You cannot infinitely loop your way to better performance without training the model at deeper recurrences first.

Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHubThe following are some of the reasons why you should consider hiring someone elseepo OR Hugging Face Page OR ProThe spokesman for the duct Release OR Webinar etc.? Connect with us

UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

The Huey Code Guide: Build a High-Performance Background Task Processor Using Scheduling with Retries and Pipelines.

Top 19 AI Red Teaming Tools (2026): Secure Your ML Models

Attend Our Livestream to Learn What GPT-5 Means to ChatGPT Users

Micron Megafab Project is now facing a new hurdle as activists seek a benefits deal

WIRED Roundup: DHS’s Privateness Breach, AI Romantic Affairs, and Google Sues Textual content Scammers

It’s Hard to Be Excited about a New Amazon Smartphone

The ‘Cofounder of my AI agent’ conquered LinkedIn. After that, it was banned.

Top Insights

Anthropic’s Developer Day: AI Agents were the Stars of Anthropic Day 1

Amazon Explains how its AWS outage brought down the web

Latest News

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size

What is Looped Language Model (LLM)?

A Residual Unconstrained System is the Root Cause

How Parcae Enforces Stability by Design

The following are some of the reasons why you should consider hiring someone elseesults: Outperforming Models Twice the Size

The First Scaling Laws for Looping

Key Takeaways

Related Posts