OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

This tutorial explores the implementation of OpenMythos., a reconstruction theoretical of the Claude Mythos Architecture that enables greater reasoning via iterative computing rather than increasing parameter sizes. The models are built and analyzed using GQA and MLA mechanisms. Memory efficiency is examined through comparisons of KV caches and stability confirmed by the spectral characteristics of the recurrent updates. The model is then trained on a structured-parity task, and we investigate whether increasing the loop depth during inference can improve performance without having to retrain. Along the way we examine adaptive computation using ACT halting.

Import subprocess (sys)
try:
 Import open_mythos Noqa: F401
If you get an ImportError, it's because your import is not working.
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
                          "open-mythos"])


Import math, time and copy
From collections Import Counter, defaultdict


Numpy can be imported as np
As a nn import, you can use torch.nn.
Matplotlib.pyplot can be imported as a plt


Open_mythos.main (import)
   OpenMythos, MythosConfig,
   ACTHalting, MoEFFN,
)


torch.manual_seed(0); np.random.seed(0)
Device = "cuda" if torch.cuda.is_available() You can also find out more about "cpu"
print(f"▸ device = {device}   |   torch = {torch.__version__}")


def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
               max_loops=8, seq_len=128, vocab=256):
   base = dict(
       vocab_size=vocab, dim=dim, n_heads=n_heads,
       max_seq_len=seq_len, max_loop_iters=max_loops,
       prelude_layers=1, coda_layers=1,
       n_experts=n_experts, n_shared_experts=1,
       n_experts_per_tok=2, expert_dim=dim // 2,
       lora_rank=8, attn_type=attn_type,
   )
 If attn_type is equal to "gqa":
     Return MythosConfig (**base, n_kv_heads=2)
   return MythosConfig(
       **base, n_kv_heads=n_heads,
       kv_lora_rank=32, q_lora_rank=64,
       qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
   )


cfg_gqa = make_config("gqa")
cfg_mla = make_config("mla")
m_gqa = OpenMythos(cfg_gqa).to(device)
m_mla = OpenMythos(cfg_mla).to(device)


print("n─── Part 1 ─ model sizes ──────────────────────────────")
print(f"GQA  params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
print(f"MLA  params : {sum(p.numel() for p in m_mla.parameters()):>10,}")

Install and import the required dependencies, and then initialize your environment to run OpenMythos. Both GQA and MLA models are instantiated and their configurations constructed. Moreover, their parameters sizes are compared to determine how the architectural differences influence model scale.

def cache_bytes(kv: dict) -> int:
   total = 0
 Enter kv.values for the entry.():
 For t, enter the value '1'():
           total += t.element_size() * t.numel()
   return total


x = torch.randint(0, 256, (1, 64), device=device)
ck_gqa, ck_mla = {}, {}
No_grad. With torch():
   m_gqa(x, n_loops=4, kv_cache=ck_gqa)
   m_mla(x, n_loops=4, kv_cache=ck_mla)


gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print("n─── Part 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
print(f"GQA cache : {gqa_kb:6.2f} KB   ({len(ck_gqa)} layer-keys)")
print(f"MLA cache : {mla_kb:6.2f} KB   ({len(ck_mla)} layer-keys)")
print(f"ratio      : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")


def show_stability(model, tag):
   A = model.recurrent.injection.get_A()
   print(f"{tag:3s}  ρ(A): min={A.min():.4f}  max={A.max():.4f}  "
 The c"mean={A.mean():.4f}  stable={bool((A  0).all())}")


print("n─── Part 3 ─ spectral radius at init ──────────────────")
show_stability(m_gqa, "GQA")
show_stability(m_mla, "MLA")


opt = torch.optim.Adam(m_mla.parameters(), lr=1.0)
For example, _ is in the following range:
   loss = m_mla(torch.randint(0, 256, (2, 16), device=device),
                n_loops=2).square().mean()
   opt.zero_grad(); loss.backward(); opt.step()
show_stability(m_mla, "MLA after abusive training (lr=1.0, 30 steps)")

We calculate and compare the KV memory footprint during forward pass for GQA and MLA types of attention. The stability of the component recurrent is then checked by analysing the spectrum radius of matrix A. The model is then stressed under extreme training conditions in order to verify that it remains stable.

VOCAB = 64
SEQ_LEN = 24


Def make_batch (batch=64; seq_len=SEQ_LEN);
   x = torch.randint(1, 3, (batch, seq_len), device=device)
 The bits are x divided by 1.
 Parity = bits.cumsum (dim=1)%
 Y = parity plus 1
 Return x to y


MythosConfig = cfg
   vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
   max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
   prelude_layers=1, coda_layers=1,
   n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
   expert_dim=32, lora_rank=4, attn_type="gqa",
   act_threshold=0.99,
)
model = OpenMythos(cfg).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
T_TRAIN = 3


print("n─── Part 5 ─ training (T_train = 3) ───────────────────")
print(f"params: {sum(p.numel() for p in model.parameters()):,}")
Losses []
t0 = time.time()
Step in the range:
   x, y = make_batch(64)
 Model(x, T_TRAIN=n_loops) = logits
   loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
   opt.zero_grad(); loss.backward()
   opt.step()
   losses.append(loss.item())
 Step == 599 if %100 is equal to 0
 No_grad. With torch():
 The float value of acc = logits.argmax (-1) == "y"().mean().item()
       print(f"step {step:3d}   loss={loss.item():.4f}   acc@T3={acc:.3f}")
print(f"training wallclock: {time.time() - t0:.1f}s")

To train our model, we define a parity cumulative task. OpenMythos is initialized with a loop depth that’s fixed and trained using the cross-entropy. We monitor accuracy and loss during training to determine how the model performs with a constrained loop depth.

model.eval()
T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
" accs=" []
No_grad. With torch():
   x_eval, y_eval = make_batch(512)
 The T in the T_sweep is:
 Logits = model (x_eval n_loops=T).
       accs.append((logits.argmax(-1) == y_eval).float().mean().item())


print("n─── Part 6 ─ depth extrapolation (T_train=3) ──────────")
For T, enter a into zip (T_sweep accs).
 Bar = "█" * int(a * 40)
 "mark = "  ← trained here" If T == T_TRAIN then ""
   print(f"T={T:2d}  acc={a:.3f}  {bar}{marker}")


halt_trace: list[torch.Tensor] = []
orig_halt = model.recurrent.act.forward


def halt_hook(self, h):
   p = orig_halt(h)
   halt_trace.append(p.detach().cpu())
 Return to p
model.recurrent.act.forward = halt_hook.__get__(model.recurrent.act, ACTHalting)


No_grad. With torch():
   x_h, _ = make_batch(1)
   _ = model(x_h, n_loops=16)


model.recurrent.act.forward = orig_halt


halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
print(f"n─── Part 7 ─ ACT halting matrix (loops × positions) ───")
print(f"shape: {halts.shape}  |  "
 F"mean halt-prob per loop: "
 The t"{', '.join(f'{v:.2f}' for v in halts.mean(1))}")

To study the depth extrapolation, we evaluate the model trained by changing the number of loops. We see that increasing the loop length improves accuracy and does not require a retraining of the model. We have also created a mechanism for the ACT to measure halting probability at every sequence position.

Expert_hits = Counter()
orig_moe = model.recurrent.block.ffn.forward


Def moe_hook() (self, "x"):
 flat = x.view(1), x.shape[-1])
 Logits = self.router (flat) + Self.router_bias
   scores = F.softmax(logits, dim=-1)
   _, idx = scores.topk(self.topk, dim=-1)
   for e in idx.flatten().tolist():
       expert_hits[e] += 1
 Return orig_moe


model.recurrent.block.ffn.forward = moe_hook.__get__(
   model.recurrent.block.ffn, MoEFFN)


No_grad. With torch():
   x_m, _ = make_batch(32)
   _ = model(x_m, n_loops=T_TRAIN)


model.recurrent.block.ffn.forward = orig_moe


print("n─── Part 8 ─ MoE expert utilization ───────────────────")
total = sum(expert_hits.values())
for eid in range(cfg.n_experts):
 Share = expert_hits.get() / max (total, 1)
   print(f"expert {eid}: {share*100:5.2f}% of topk slots")


prompt = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], device=device)
print("n─── Part 9 ─ generation ───────────────────────────────")
print(f"prompt (parity pattern): {prompt.tolist()[0]}")
for T_gen in [1, 4, 12]:
 No_grad. With torch():
       out = model.generate(prompt, max_new_tokens=8,
                            n_loops=T_gen, temperature=0.1, top_k=2)
   print(f"T_gen={T_gen:2d}  → {out.tolist()[0]}")


Axes = plt.subplots (1, 3, figsize=(15.4, 4)).


The axes[0].plot(losses)
The axes[0].set_title("Training loss (parity task)")
The axes[0].set_xlabel("step"( ) axes[0].set_ylabel("cross-entropy")
The axes[0].grid(alpha=0.3)


The axes[1].plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
The axes[1].axvline(T_TRAIN, color="red", linestyle="--",
               label=f"T_train = {T_TRAIN}")
The axes[1].set_title("Depth extrapolation: accuracy vs inference loops")
The axes[1].set_xlabel("n_loops at inference"( ) axes[1].set_ylabel("accuracy")
The axes[1].legend()Axes[1]Axes[1].set_ylim(0, 1.05)


im = The axes[2].imshow(halts, aspect="auto", cmap="viridis",
                   vmin=0, vmax=halts.max())
The axes[2].set_title("ACT halting probabilityn(loop t × position)")
axes[2].set_xlabel("position"( ) axes[2].set_ylabel("loop iteration t")
plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)


plt.tight_layout()
plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
plt.show()

Then, we track how tokens move between experts to analyze the expert use in the MoE Layer. Then, we generate different sequences with different looping depths and observe the effects they have on outputs. We visualize the training loss, extrapolation of depth, and ACT halting behaviors through plots.

We concluded that OpenMythos uses looped computing to effectively achieve depth extrapolation. This allows the model to increase accuracy by simply increasing the number inference time loops. MLA’s attention reduced KV cache memory consumption by a significant amount compared with GQA. We also observed that even in extreme training conditions the recurrent mechanisms remained stable. We observed that ACT enables computations across positions in sequences and the MoE routing spreads out workload between experts. In the end, this architecture provides a powerful direction for the compute-adaptive reason, in which we can trade extra computations for performance improvements without changing parameters.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter Join our Facebook group! 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

Mend.io releases AI Security Governance Framework covering asset inventory, risk tiering, AI Supply Chain Security and Maturity model

Now anyone can own their own FPV Drone.

Data Centers have arrived at the edge of the Arctic Circle

GPT-4o Tells Jokes about AI • AI Blog

Gemini 3 pro: I’m the Next leap in Intelligence

There are 85 seconds left until midnight. Find Out What This Means

Top Insights

Mamba-3 is a new state space model frontier with two times smaller states and enhanced MIMO decoding hardware efficiency

You’re scared to use social media. This is What Really Helped us Start! Buffer’s Creators

Latest News

Apple’s new CEO must launch an AI killer product

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

Related Posts