This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!

section("7 · Q1_0_g128 Quantization — What's Happening Under the Hood")


print(textwrap.dedent("""
╔══════════════════════════════════════════════════════════════╗
║           Bonsai Q1_0_g128 Weight Representation            ║
╠══════════════════════════════════════════════════════════════╣
║  Each weight = 1 bit:  0  →  −scale                         ║
║                        1  →  +scale                         ║
║  Every 128 weights share one FP16 scale factor.             ║
║                                                              ║
║  Effective bits per weight:                                  ║
║    1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw    ║
║                                                              ║
║  Memory comparison for Bonsai-1.7B:                         ║
║    FP16:            3.44 GB  (1.0×  baseline)               ║
║    Q1_0_g128:       0.24 GB  (14.2× smaller!)               ║
║    MLX 1-bit g128:  0.27 GB  (12.8× smaller)                ║
╚══════════════════════════════════════════════════════════════╝
"""))


print("📐 Python demo of Q1_0_g128 quantization logic:n")
Random Import
random.seed(42)
GROUP_SIZE   = 128
weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]
Weights_fp16: scale = max (abs(w) for w
quantized    = [1 if w >= 0 else 0 for w in weights_fp16]
dequantized  = [scale if b == 1 else -scale for b in quantized]
mse          = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE


print(f"  FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}")
print(f"  1-bit repr  (first 8): {quantized[:8]}")
print(f"  Shared scale:          {scale:.4f}")
print(f"  Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}")
print(f"  MSE of reconstruction: {mse:.6f}")
memory_fp16 = GROUP_SIZE * 2
memory_1bit = GROUP_SIZE / 8 + 2
print(f"n  Memory: FP16={memory_fp16}B  vs  Q1_0_g128={memory_1bit:.1f}B  "
 F"({memory_fp16/memory_1bit:.1f}× reduction)")


section("8 · Performance Benchmark — Tokens per Second")


def benchmark(prompt, n_tokens=128, n_runs=3, **kw):
 The timings are: []
   for i in range(n_runs):
       print(f"   Run {i+1}/{n_runs} …", end=" ", flush=True)
       _, elapsed = infer(prompt, verbose=False, n_predict=n_tokens, **kw)
       tps = n_tokens / elapsed
       timings.append(tps)
       print(f"{tps:.1f} tok/s")
   avg = sum(timings) / len(timings)
   print(f"n  ✅ Average: {avg:.1f} tok/s  (over {n_runs} runs, {n_tokens} tokens each)")
   return avg


print("📊 Benchmarking Bonsai-1.7B on your GPU …")
tps = benchmark(
   "Explain the concept of neural network backpropagation step by step.",
   n_tokens=128, n_runs=3,
)


print("n  Published reference throughputs (from whitepaper):")
print("  ┌──────────────────────┬─────────┬──────────────┐")
print("  │ Platform             │ Backend │ TG128 tok/s  │")
print("  ├──────────────────────┼─────────┼──────────────┤")
print("  │ RTX 4090             │ CUDA    │     674      │")
print("  │ M4 Pro 48 GB         │ Metal   │     250      │")
print(f"  │ Your GPU (measured)  │ CUDA    │  {tps:>7.1f}    │")
print("  └──────────────────────┴─────────┴──────────────┘")


section("9 · Multi-Turn Chat with Context Accumulation")


Define chat (user_msg system="You are a helpful assistant.", history=None, **kw):
 If history is None
 History []
   history.append(("user", user_msg))
 Full = f"systemn{system}n"
 "Msg for the role in history"
 full = f"{role}n{msg}n"
 Full += "assistantn"
 Safe = replace(') with full."', '"').replace('n', 'n')
 Cmd = (
       f'{LLAMA_CLI} -m "{MODEL_PATH}"'
       f' -p "{safe}" -e'
       f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'
       f' -ngl 99 -c 4096 --no-display-prompt'
   )
 result = run (cmd; capture=True, Check=False).
 reply = result.stdout.strip()
   history.append(("assistant", reply))
 The history of return response


print("🗣  Starting a 3-turn conversation about 1-bit models …n")
History = []
Turn = [
   "What is a 1-bit language model?",
   "What are the main trade-offs compared to 4-bit or 8-bit quantization?",
   "How does Bonsai specifically address those trade-offs?",
]
"for i" msg(turns 1)
   print(f"👤 Turn {i}: {msg}")
 History = msg (chat)
   print(f"🤖 Bonsai: {reply}n")
   time.sleep(0.5)


section("10 · Sampling Parameter Exploration")


creative_prompt = "Write a one-sentence description of a futuristic city powered entirely by 1-bit AI."
Configurations = [
   ("Precise / Focused",  dict(temp=0.1, top_k=10,  top_p=0.70)),
   ("Balanced (default)", dict(temp=0.5, top_k=20,  top_p=0.85)),
   ("Creative / Varied",  dict(temp=0.9, top_k=50,  top_p=0.95)),
   ("High entropy",       dict(temp=1.2, top_k=100, top_p=0.98)),
]


print(f'Prompt: "{creative_prompt}"n')
For label, parameter in configs
   out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)
   print(f"  [{label}]")
   print(f"    temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}")
   print(f"    → {out[:200]}n")

This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

Lisa Su, CEO of AMD, Is Not Afraid Of The Competition

OpenAI Social Video App: WIRED’s Roundup on the New Fake World

Amazon’s ‘House of David’ Used Over 350 AI Shots in Season 2. The creator isn’t sorry

This is the man who makes AI Slop with his hands

Casio’s Fluffy AI robot Squeaked its way into my heart

Top Insights

AI: The Next Frontier A Consciousness Algorithm

Building a BioCypher Powered AI Agent to Generate and Query Biomedical Knowledge Graphs

Latest News

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!

Related Posts