OpenMythos - A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator

Anthropic never published a paper on Claude Mythos. The research community has continued to theorize despite this. The new project is called OpenMythos, Published on GitHub Kye Gomez, a project that is ambitious, tries to build a PyTorch-based first-principles reconstruction of the Claude Mythos Architecture, based entirely on peer-reviewed scientific research.

It is not an unreleased model, refinement, or distillation. It is a hypothesis rendered in code — and the hypothesis is specific enough to be falsifiable, which is what makes it interesting.

The Claim: Claude Mythos is a Recurrent Depth Transformer

OpenMythos claims that Claude Mythos falls under a category of architectures known as The Recurrent Depth Transformers (RDTs).Also known in literature as Looped Transformers. The idea is fundamentally different than standard transformer stacks.

In a conventional transformer — GPT, LLaMA, Mistral — the model passes input through a sequence of unique layers, one after another, each with its own independent weights. More capabilities usually means that there are more layers, and therefore more parameters. With a Recurrent-Depth Transformer a set of fixed weights are applied across T loops iteratively in one forward pass. The weights can be run several times. The reasoning depth does not depend on how many parameters you store, but rather how many iterations run during inference.

Imagine it as a process of refinement, not reading. The model will return to the same block repeatedly and improve its internal representation each time.

The Architecture of the Website

OpenMythos creates this structure as three parts: Prelude → Recurrent Block → Coda. Preludes and Codas are the standard layer of a transformer that runs exactly one time. Recurrent Block: This is the computation core looped T=16 time.

The following rule is used to update the state hidden at every loop step:

H_{The t+1} = A·h_{The t} + B·e + Transformer(H_{The t}, e)

Here, h_t The hidden state is after the loop iteration T. The e-mail address you entered is not valid. is the encoded input from the Prelude — re-injected at every step. This re-injection of the encoded input is intentional: otherwise, the hidden signal would be displaced from the original in the deep loops. Learned matrices (A and B) determine the amount of hidden state that is passed forward each time.

It is not standard to have a feedforward layer inside the Recurrent Block. OpenMythos replaced it with a Mixture-of-Experts (MoE) The design is repeated in the layer. DeepSeekMoEA large group of experts with finely-grained routes, but only the top K subsets are activated for each token. There is also a smaller set of always active experts. Experts for sharing That absorbs common patterns across domains. The router is able to select distinct subsets of experts at every loop depth. This means that each iteration has a different computational result, despite having the same weights. MoE provides domain breadth; looping provides reasoning depth.

Attention defaults to Multi-Latent Attention from DeepSeek-V2, which caches a compressed low-rank KV latent rather than full key/value tensors, yielding a 10–20× reduction in KV memory at production scale.

You can use continuous latent space to reason.

This architecture has the important property that all reasoning takes place in latent continuous space. There is no intermediate token emission between loop steps — the model does not produce text mid-thought and then re-read it. The model is not chain-ofthought, which externalizes reasoning as token sequences. 2025) and the COCONUT (2044).

Saunshi et al. (2025) Formally show that every loop iteration of an RDT operates over vectors with real values rather than discrete symbols. Continuous latent thought can encode several alternative next steps at once, which allows for a search of the reasoning space in a breadth first manner.

It also helps explain a tangible capability advantage. A standard transformer trained on 5-hop reasoning chains fails when tested on 10-hop chains at inference time — it has no mechanism to extend its depth beyond what it saw during training. The Recurrent-Depth Transformer takes care of this automatically: by running additional inference time loops, the reasoning chain is extended without any training. Compute is allocated to harder problems; simple ones are discarded early.

The Stability Problem

The training looped model has been historically brittle. The hidden state The e-mail address you entered was not valid._{The t} can grow unboundedly across iterations — a failure mode called residual explosion. OpenMythos tackles this problem using an Linear Time-Invariant Injection constraint borrowed by the Parcae architecture (Prairie et al., 2026): the spectral radius of A, denoted ρ(A), is enforced to be less than 1 by construction, guaranteeing stability regardless of learning rate or gradient noise.

A second failure mode also exists at the other extreme: beyond a certain loop depth, excessive recurrence degrades predictions — the hidden state drifts past the solution and into noise. This is the ‘overthinking’ problem. Adaptive Computation Time (ACT) Stopping is addressed with an learned scalar for each position, which decides dynamically when to stop the loop. Tokens already in convergence are stopped early. Positions with more difficulty to process get more computation.

Finally, Depth-Wise LoRA adapters introduce a small rank-r adaptation matrix at each iteration depth, giving each loop step slightly distinct behavior without adding substantial parameters — bridging the gap between pure weight-tying and fully distinct layers.

OpenMythos – Introducing OpenMythos

PyTorch implementation of an open-source first-principles reconstruction theory for Claude Mythos.

The architecture instantiates a looped transformer with a Mixture-of-Experts (MoE) routing mechanism, enabling iterative depth via weight sharing and… pic.twitter.com/YLvCid6CAr

— Kye Gomez (swarms) (@KyeGomezB) April 19, 2026

The importance of parameter efficiency

It is important to note that the word “you” means “you”. Parcae paper (Prairie et al., 2026) This provides a basis for the claim of efficiency. At 770M parameters, an RDT matches a 1.3B standard transformer trained on identical data — roughly half the parameters for equivalent downstream quality. The optimal token counts and recurrences follow the same power law with constant exponents on all scales. This is the first scaling law that can be predicted for looped learning.

It is important to note that reasoning depth does not scale with stored parameter counts, but rather inference time compute. It reframes a dominant assumption in the debate on scaling. This may be because the relevant axis is not parameter counts at training, but instead loop depth at inference.

OpenMythos Contributes

OpenMythos contains four research artifacts. They are: A fully configurable PyTorch RDT implementation, with the MoE FFN (Multi-Latent Attention) and LTI-stable Recurrent Injection integrated as an LTI-stable training primitive.

Whether or not Mythos is actually an RDT, OpenMythos gives the research community something concrete and runnable — an implementation of an architecture class the literature increasingly suggests is underexplored, and one that may represent a fundamentally different path to capable AI than simply training bigger models.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter Join our Facebook group! 130k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

A significant amount of website traffic is now generated by AI bots

OpenAI Lastly Launched GPT-5. This is All the pieces You Have to Know

X Data Center Fire in Oregon Started Inside Power Cabinet, Authorities Say

AI-based relationships are increasing. The next divorce boom could be on the way

AlphaFold has changed the way science is done. Even after 5 years, it’s still evolving

Top Insights

What Should Your Instagram Posting Frequency Be in 2025? Two Million Posts of Data Tell Us What to Do

30+ Social Media Podcasts You’ll Want in Your Feed in 2025

Latest News