section("7 · Q1_0_g128 Quantization — What's Happening Under the Hood")
print(textwrap.dedent("""
╔══════════════════════════════════════════════════════════════╗
║ Bonsai Q1_0_g128 Weight Representation ║
╠══════════════════════════════════════════════════════════════╣
║ Each weight = 1 bit: 0 → −scale ║
║ 1 → +scale ║
║ Every 128 weights share one FP16 scale factor. ║
║ ║
║ Effective bits per weight: ║
║ 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw ║
║ ║
║ Memory comparison for Bonsai-1.7B: ║
║ FP16: 3.44 GB (1.0× baseline) ║
║ Q1_0_g128: 0.24 GB (14.2× smaller!) ║
║ MLX 1-bit g128: 0.27 GB (12.8× smaller) ║
╚══════════════════════════════════════════════════════════════╝
"""))
print("📐 Python demo of Q1_0_g128 quantization logic:n")
Random Import
random.seed(42)
GROUP_SIZE = 128
weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]
Weights_fp16: scale = max (abs(w) for w
quantized = [1 if w >= 0 else 0 for w in weights_fp16]
dequantized = [scale if b == 1 else -scale for b in quantized]
mse = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE
print(f" FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}")
print(f" 1-bit repr (first 8): {quantized[:8]}")
print(f" Shared scale: {scale:.4f}")
print(f" Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}")
print(f" MSE of reconstruction: {mse:.6f}")
memory_fp16 = GROUP_SIZE * 2
memory_1bit = GROUP_SIZE / 8 + 2
print(f"n Memory: FP16={memory_fp16}B vs Q1_0_g128={memory_1bit:.1f}B "
F"({memory_fp16/memory_1bit:.1f}× reduction)")
section("8 · Performance Benchmark — Tokens per Second")
def benchmark(prompt, n_tokens=128, n_runs=3, **kw):
The timings are: []
for i in range(n_runs):
print(f" Run {i+1}/{n_runs} …", end=" ", flush=True)
_, elapsed = infer(prompt, verbose=False, n_predict=n_tokens, **kw)
tps = n_tokens / elapsed
timings.append(tps)
print(f"{tps:.1f} tok/s")
avg = sum(timings) / len(timings)
print(f"n ✅ Average: {avg:.1f} tok/s (over {n_runs} runs, {n_tokens} tokens each)")
return avg
print("📊 Benchmarking Bonsai-1.7B on your GPU …")
tps = benchmark(
"Explain the concept of neural network backpropagation step by step.",
n_tokens=128, n_runs=3,
)
print("n Published reference throughputs (from whitepaper):")
print(" ┌──────────────────────┬─────────┬──────────────┐")
print(" │ Platform │ Backend │ TG128 tok/s │")
print(" ├──────────────────────┼─────────┼──────────────┤")
print(" │ RTX 4090 │ CUDA │ 674 │")
print(" │ M4 Pro 48 GB │ Metal │ 250 │")
print(f" │ Your GPU (measured) │ CUDA │ {tps:>7.1f} │")
print(" └──────────────────────┴─────────┴──────────────┘")
section("9 · Multi-Turn Chat with Context Accumulation")
Define chat (user_msg system="You are a helpful assistant.", history=None, **kw):
If history is None
History []
history.append(("user", user_msg))
Full = f"systemn{system}n"
"Msg for the role in history"
full = f"{role}n{msg}n"
Full += "assistantn"
Safe = replace(') with full."', '"').replace('n', 'n')
Cmd = (
f'{LLAMA_CLI} -m "{MODEL_PATH}"'
f' -p "{safe}" -e'
f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'
f' -ngl 99 -c 4096 --no-display-prompt'
)
result = run (cmd; capture=True, Check=False).
reply = result.stdout.strip()
history.append(("assistant", reply))
The history of return response
print("🗣 Starting a 3-turn conversation about 1-bit models …n")
History = []
Turn = [
"What is a 1-bit language model?",
"What are the main trade-offs compared to 4-bit or 8-bit quantization?",
"How does Bonsai specifically address those trade-offs?",
]
"for i" msg(turns 1)
print(f"👤 Turn {i}: {msg}")
History = msg (chat)
print(f"🤖 Bonsai: {reply}n")
time.sleep(0.5)
section("10 · Sampling Parameter Exploration")
creative_prompt = "Write a one-sentence description of a futuristic city powered entirely by 1-bit AI."
Configurations = [
("Precise / Focused", dict(temp=0.1, top_k=10, top_p=0.70)),
("Balanced (default)", dict(temp=0.5, top_k=20, top_p=0.85)),
("Creative / Varied", dict(temp=0.9, top_k=50, top_p=0.95)),
("High entropy", dict(temp=1.2, top_k=100, top_p=0.98)),
]
print(f'Prompt: "{creative_prompt}"n')
For label, parameter in configs
out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)
print(f" [{label}]")
print(f" temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}")
print(f" → {out[:200]}n")
Trending
- OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
- Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
- TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost
- Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.
- OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator
- This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!
- NVIDIA Releases Ising – the First Open Quantum AI Model Family For Hybrid Quantum-Classical Systems
- xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

