This tutorial explores the implementation of OpenMythos., a reconstruction theoretical of the Claude Mythos Architecture that enables greater reasoning via iterative computing rather than increasing parameter sizes. The models are built and analyzed using GQA and MLA mechanisms. Memory efficiency is examined through comparisons of KV caches and stability confirmed by the spectral characteristics of the recurrent updates. The model is then trained on a structured-parity task, and we investigate whether increasing the loop depth during inference can improve performance without having to retrain. Along the way we examine adaptive computation using ACT halting.
Import subprocess (sys)
try:
Import open_mythos Noqa: F401
If you get an ImportError, it's because your import is not working.
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
"open-mythos"])
Import math, time and copy
From collections Import Counter, defaultdict
Numpy can be imported as np
As a nn import, you can use torch.nn.
Matplotlib.pyplot can be imported as a plt
Open_mythos.main (import)
OpenMythos, MythosConfig,
ACTHalting, MoEFFN,
)
torch.manual_seed(0); np.random.seed(0)
Device = "cuda" if torch.cuda.is_available() You can also find out more about "cpu"
print(f"▸ device = {device} | torch = {torch.__version__}")
def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
max_loops=8, seq_len=128, vocab=256):
base = dict(
vocab_size=vocab, dim=dim, n_heads=n_heads,
max_seq_len=seq_len, max_loop_iters=max_loops,
prelude_layers=1, coda_layers=1,
n_experts=n_experts, n_shared_experts=1,
n_experts_per_tok=2, expert_dim=dim // 2,
lora_rank=8, attn_type=attn_type,
)
If attn_type is equal to "gqa":
Return MythosConfig (**base, n_kv_heads=2)
return MythosConfig(
**base, n_kv_heads=n_heads,
kv_lora_rank=32, q_lora_rank=64,
qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
)
cfg_gqa = make_config("gqa")
cfg_mla = make_config("mla")
m_gqa = OpenMythos(cfg_gqa).to(device)
m_mla = OpenMythos(cfg_mla).to(device)
print("n─── Part 1 ─ model sizes ──────────────────────────────")
print(f"GQA params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
print(f"MLA params : {sum(p.numel() for p in m_mla.parameters()):>10,}")
Install and import the required dependencies, and then initialize your environment to run OpenMythos. Both GQA and MLA models are instantiated and their configurations constructed. Moreover, their parameters sizes are compared to determine how the architectural differences influence model scale.
def cache_bytes(kv: dict) -> int:
total = 0
Enter kv.values for the entry.():
For t, enter the value '1'():
total += t.element_size() * t.numel()
return total
x = torch.randint(0, 256, (1, 64), device=device)
ck_gqa, ck_mla = {}, {}
No_grad. With torch():
m_gqa(x, n_loops=4, kv_cache=ck_gqa)
m_mla(x, n_loops=4, kv_cache=ck_mla)
gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print("n─── Part 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
print(f"GQA cache : {gqa_kb:6.2f} KB ({len(ck_gqa)} layer-keys)")
print(f"MLA cache : {mla_kb:6.2f} KB ({len(ck_mla)} layer-keys)")
print(f"ratio : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")
def show_stability(model, tag):
A = model.recurrent.injection.get_A()
print(f"{tag:3s} ρ(A): min={A.min():.4f} max={A.max():.4f} "
The c"mean={A.mean():.4f} stable={bool((A 0).all())}")
print("n─── Part 3 ─ spectral radius at init ──────────────────")
show_stability(m_gqa, "GQA")
show_stability(m_mla, "MLA")
opt = torch.optim.Adam(m_mla.parameters(), lr=1.0)
For example, _ is in the following range:
loss = m_mla(torch.randint(0, 256, (2, 16), device=device),
n_loops=2).square().mean()
opt.zero_grad(); loss.backward(); opt.step()
show_stability(m_mla, "MLA after abusive training (lr=1.0, 30 steps)")
We calculate and compare the KV memory footprint during forward pass for GQA and MLA types of attention. The stability of the component recurrent is then checked by analysing the spectrum radius of matrix A. The model is then stressed under extreme training conditions in order to verify that it remains stable.
VOCAB = 64
SEQ_LEN = 24
Def make_batch (batch=64; seq_len=SEQ_LEN);
x = torch.randint(1, 3, (batch, seq_len), device=device)
The bits are x divided by 1.
Parity = bits.cumsum (dim=1)%
Y = parity plus 1
Return x to y
MythosConfig = cfg
vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
prelude_layers=1, coda_layers=1,
n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
expert_dim=32, lora_rank=4, attn_type="gqa",
act_threshold=0.99,
)
model = OpenMythos(cfg).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
T_TRAIN = 3
print("n─── Part 5 ─ training (T_train = 3) ───────────────────")
print(f"params: {sum(p.numel() for p in model.parameters()):,}")
Losses []
t0 = time.time()
Step in the range:
x, y = make_batch(64)
Model(x, T_TRAIN=n_loops) = logits
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
opt.zero_grad(); loss.backward()
opt.step()
losses.append(loss.item())
Step == 599 if %100 is equal to 0
No_grad. With torch():
The float value of acc = logits.argmax (-1) == "y"().mean().item()
print(f"step {step:3d} loss={loss.item():.4f} acc@T3={acc:.3f}")
print(f"training wallclock: {time.time() - t0:.1f}s")
To train our model, we define a parity cumulative task. OpenMythos is initialized with a loop depth that’s fixed and trained using the cross-entropy. We monitor accuracy and loss during training to determine how the model performs with a constrained loop depth.
model.eval()
T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
" accs=" []
No_grad. With torch():
x_eval, y_eval = make_batch(512)
The T in the T_sweep is:
Logits = model (x_eval n_loops=T).
accs.append((logits.argmax(-1) == y_eval).float().mean().item())
print("n─── Part 6 ─ depth extrapolation (T_train=3) ──────────")
For T, enter a into zip (T_sweep accs).
Bar = "█" * int(a * 40)
"mark = " ← trained here" If T == T_TRAIN then ""
print(f"T={T:2d} acc={a:.3f} {bar}{marker}")
halt_trace: list[torch.Tensor] = []
orig_halt = model.recurrent.act.forward
def halt_hook(self, h):
p = orig_halt(h)
halt_trace.append(p.detach().cpu())
Return to p
model.recurrent.act.forward = halt_hook.__get__(model.recurrent.act, ACTHalting)
No_grad. With torch():
x_h, _ = make_batch(1)
_ = model(x_h, n_loops=16)
model.recurrent.act.forward = orig_halt
halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
print(f"n─── Part 7 ─ ACT halting matrix (loops × positions) ───")
print(f"shape: {halts.shape} | "
F"mean halt-prob per loop: "
The t"{', '.join(f'{v:.2f}' for v in halts.mean(1))}")
To study the depth extrapolation, we evaluate the model trained by changing the number of loops. We see that increasing the loop length improves accuracy and does not require a retraining of the model. We have also created a mechanism for the ACT to measure halting probability at every sequence position.
Expert_hits = Counter()
orig_moe = model.recurrent.block.ffn.forward
Def moe_hook() (self, "x"):
flat = x.view(1), x.shape[-1])
Logits = self.router (flat) + Self.router_bias
scores = F.softmax(logits, dim=-1)
_, idx = scores.topk(self.topk, dim=-1)
for e in idx.flatten().tolist():
expert_hits[e] += 1
Return orig_moe
model.recurrent.block.ffn.forward = moe_hook.__get__(
model.recurrent.block.ffn, MoEFFN)
No_grad. With torch():
x_m, _ = make_batch(32)
_ = model(x_m, n_loops=T_TRAIN)
model.recurrent.block.ffn.forward = orig_moe
print("n─── Part 8 ─ MoE expert utilization ───────────────────")
total = sum(expert_hits.values())
for eid in range(cfg.n_experts):
Share = expert_hits.get() / max (total, 1)
print(f"expert {eid}: {share*100:5.2f}% of topk slots")
prompt = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], device=device)
print("n─── Part 9 ─ generation ───────────────────────────────")
print(f"prompt (parity pattern): {prompt.tolist()[0]}")
for T_gen in [1, 4, 12]:
No_grad. With torch():
out = model.generate(prompt, max_new_tokens=8,
n_loops=T_gen, temperature=0.1, top_k=2)
print(f"T_gen={T_gen:2d} → {out.tolist()[0]}")
Axes = plt.subplots (1, 3, figsize=(15.4, 4)).
The axes[0].plot(losses)
The axes[0].set_title("Training loss (parity task)")
The axes[0].set_xlabel("step"( ) axes[0].set_ylabel("cross-entropy")
The axes[0].grid(alpha=0.3)
The axes[1].plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
The axes[1].axvline(T_TRAIN, color="red", linestyle="--",
label=f"T_train = {T_TRAIN}")
The axes[1].set_title("Depth extrapolation: accuracy vs inference loops")
The axes[1].set_xlabel("n_loops at inference"( ) axes[1].set_ylabel("accuracy")
The axes[1].legend()Axes[1]Axes[1].set_ylim(0, 1.05)
im = The axes[2].imshow(halts, aspect="auto", cmap="viridis",
vmin=0, vmax=halts.max())
The axes[2].set_title("ACT halting probabilityn(loop t × position)")
axes[2].set_xlabel("position"( ) axes[2].set_ylabel("loop iteration t")
plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)
plt.tight_layout()
plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
plt.show()
Then, we track how tokens move between experts to analyze the expert use in the MoE Layer. Then, we generate different sequences with different looping depths and observe the effects they have on outputs. We visualize the training loss, extrapolation of depth, and ACT halting behaviors through plots.
We concluded that OpenMythos uses looped computing to effectively achieve depth extrapolation. This allows the model to increase accuracy by simply increasing the number inference time loops. MLA’s attention reduced KV cache memory consumption by a significant amount compared with GQA. We also observed that even in extreme training conditions the recurrent mechanisms remained stable. We observed that ACT enables computations across positions in sequences and the MoE routing spreads out workload between experts. In the end, this architecture provides a powerful direction for the compute-adaptive reason, in which we can trade extra computations for performance improvements without changing parameters.
Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter Join our Facebook group! 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

