Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
  • TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost
  • Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.
  • OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator
  • This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!
  • NVIDIA Releases Ising – the First Open Quantum AI Model Family For Hybrid Quantum-Classical Systems
  • xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers
AI-trends.todayAI-trends.today
Home»Tech»This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!

This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!

Tech By Gavin Wallace19/04/20263 Mins Read
Facebook Twitter LinkedIn Email
DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced
DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced
Share
Facebook Twitter LinkedIn Email
section("7 · Q1_0_g128 Quantization — What's Happening Under the Hood")


print(textwrap.dedent("""
╔══════════════════════════════════════════════════════════════╗
║           Bonsai Q1_0_g128 Weight Representation            ║
╠══════════════════════════════════════════════════════════════╣
║  Each weight = 1 bit:  0  →  −scale                         ║
║                        1  →  +scale                         ║
║  Every 128 weights share one FP16 scale factor.             ║
║                                                              ║
║  Effective bits per weight:                                  ║
║    1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw    ║
║                                                              ║
║  Memory comparison for Bonsai-1.7B:                         ║
║    FP16:            3.44 GB  (1.0×  baseline)               ║
║    Q1_0_g128:       0.24 GB  (14.2× smaller!)               ║
║    MLX 1-bit g128:  0.27 GB  (12.8× smaller)                ║
╚══════════════════════════════════════════════════════════════╝
"""))


print("📐 Python demo of Q1_0_g128 quantization logic:n")
Random Import
random.seed(42)
GROUP_SIZE   = 128
weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]
Weights_fp16: scale = max (abs(w) for w
quantized    = [1 if w >= 0 else 0 for w in weights_fp16]
dequantized  = [scale if b == 1 else -scale for b in quantized]
mse          = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE


print(f"  FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}")
print(f"  1-bit repr  (first 8): {quantized[:8]}")
print(f"  Shared scale:          {scale:.4f}")
print(f"  Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}")
print(f"  MSE of reconstruction: {mse:.6f}")
memory_fp16 = GROUP_SIZE * 2
memory_1bit = GROUP_SIZE / 8 + 2
print(f"n  Memory: FP16={memory_fp16}B  vs  Q1_0_g128={memory_1bit:.1f}B  "
 F"({memory_fp16/memory_1bit:.1f}× reduction)")


section("8 · Performance Benchmark — Tokens per Second")


def benchmark(prompt, n_tokens=128, n_runs=3, **kw):
 The timings are: []
   for i in range(n_runs):
       print(f"   Run {i+1}/{n_runs} …", end=" ", flush=True)
       _, elapsed = infer(prompt, verbose=False, n_predict=n_tokens, **kw)
       tps = n_tokens / elapsed
       timings.append(tps)
       print(f"{tps:.1f} tok/s")
   avg = sum(timings) / len(timings)
   print(f"n  ✅ Average: {avg:.1f} tok/s  (over {n_runs} runs, {n_tokens} tokens each)")
   return avg


print("📊 Benchmarking Bonsai-1.7B on your GPU …")
tps = benchmark(
   "Explain the concept of neural network backpropagation step by step.",
   n_tokens=128, n_runs=3,
)


print("n  Published reference throughputs (from whitepaper):")
print("  ┌──────────────────────┬─────────┬──────────────┐")
print("  │ Platform             │ Backend │ TG128 tok/s  │")
print("  ├──────────────────────┼─────────┼──────────────┤")
print("  │ RTX 4090             │ CUDA    │     674      │")
print("  │ M4 Pro 48 GB         │ Metal   │     250      │")
print(f"  │ Your GPU (measured)  │ CUDA    │  {tps:>7.1f}    │")
print("  └──────────────────────┴─────────┴──────────────┘")


section("9 · Multi-Turn Chat with Context Accumulation")


Define chat (user_msg system="You are a helpful assistant.", history=None, **kw):
 If history is None
 History []
   history.append(("user", user_msg))
 Full = f"systemn{system}n"
 "Msg for the role in history"
 full = f"{role}n{msg}n"
 Full += "assistantn"
 Safe = replace(') with full."', '"').replace('n', 'n')
 Cmd = (
       f'{LLAMA_CLI} -m "{MODEL_PATH}"'
       f' -p "{safe}" -e'
       f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'
       f' -ngl 99 -c 4096 --no-display-prompt'
   )
 result = run (cmd; capture=True, Check=False).
 reply = result.stdout.strip()
   history.append(("assistant", reply))
 The history of return response


print("🗣  Starting a 3-turn conversation about 1-bit models …n")
History = []
Turn = [
   "What is a 1-bit language model?",
   "What are the main trade-offs compared to 4-bit or 8-bit quantization?",
   "How does Bonsai specifically address those trade-offs?",
]
"for i" msg(turns 1)
   print(f"👤 Turn {i}: {msg}")
 History = msg (chat)
   print(f"🤖 Bonsai: {reply}n")
   time.sleep(0.5)


section("10 · Sampling Parameter Exploration")


creative_prompt = "Write a one-sentence description of a futuristic city powered entirely by 1-bit AI."
Configurations = [
   ("Precise / Focused",  dict(temp=0.1, top_k=10,  top_p=0.70)),
   ("Balanced (default)", dict(temp=0.5, top_k=20,  top_p=0.85)),
   ("Creative / Varied",  dict(temp=0.9, top_k=50,  top_p=0.95)),
   ("High entropy",       dict(temp=1.2, top_k=100, top_p=0.98)),
]


print(f'Prompt: "{creative_prompt}"n')
For label, parameter in configs
   out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)
   print(f"  [{label}]")
   print(f"    temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}")
   print(f"    → {out[:200]}n")
AI ar Benchmark
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

20/04/2026

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

20/04/2026
Top News

Lisa Su, CEO of AMD, Is Not Afraid Of The Competition

OpenAI Social Video App: WIRED’s Roundup on the New Fake World

Amazon’s ‘House of David’ Used Over 350 AI Shots in Season 2. The creator isn’t sorry

This is the man who makes AI Slop with his hands

Casio’s Fluffy AI robot Squeaked its way into my heart

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

AI: The Next Frontier A Consciousness Algorithm

27/10/2025

Building a BioCypher Powered AI Agent to Generate and Query Biomedical Knowledge Graphs

03/07/2025
Latest News

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.