Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence
  • A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning
  • Moonshot AI Releases Kimi K2.6 with Lengthy-Horizon Coding, Agent Swarm Scaling to 300 Sub-Brokers and 4,000 Coordinated Steps
  • In China, a humanoid robot set a record for the half-marathon.
  • Prego Has a Dinner-Conversation-Recording Device, Capisce?
  • AI CEOs think they can be everywhere at once
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
AI-trends.todayAI-trends.today
Home»Tech»A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

Tech By Gavin Wallace21/04/202613 Mins Read
Facebook Twitter LinkedIn Email
This AI Paper Introduces Differentiable MCMC Layers: A New AI
This AI Paper Introduces Differentiable MCMC Layers: A New AI
Share
Facebook Twitter LinkedIn Email

On this tutorial, we construct an end-to-end implementation round Qwen 3.6-35B-A3B and discover how a contemporary multimodal MoE mannequin can be utilized in sensible workflows. We start by establishing the surroundings, loading the mannequin adaptively based mostly on accessible GPU reminiscence, and making a reusable chat framework that helps each normal responses and specific considering traces. From there, we work by means of necessary capabilities corresponding to thinking-budget management, streamed technology with separated reasoning and solutions, imaginative and prescient enter dealing with, instrument calling, structured JSON technology, MoE routing inspection, benchmarking, retrieval-augmented technology, and session persistence. By way of this course of, we run the mannequin for inference and in addition study the way to design a sturdy software layer on prime of Qwen 3.6 for actual experimentation and superior prototyping.

Copy CodeCopiedUse a unique Browser
import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *a])
_pip("--upgrade", "pip")
_pip("--upgrade",
    "transformers>=4.48.0", "accelerate>=1.2.0", "bitsandbytes>=0.44.0",
    "pillow", "requests", "sentencepiece",
    "qwen-vl-utils[decord]", "sentence-transformers", "jsonschema")


import torch, os, json, time, re, gc, io, threading, textwrap, warnings
from collections import Counter
from typing import Any, Non-compulsory
warnings.filterwarnings("ignore")


assert torch.cuda.is_available(), "GPU required. Switch runtime to A100 / L4."
p = torch.cuda.get_device_properties(0)
VRAM_GB = p.total_memory / 1e9
print(f"GPU: {p.name} | VRAM: {VRAM_GB:.1f} GB | CUDA {torch.version.cuda} | torch {torch.__version__}")


if VRAM_GB >= 75:   LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else:               LOAD_MODE = "int4"


attempt:
   import flash_attn
   ATTN_IMPL = "flash_attention_2"
besides Exception:
   ATTN_IMPL = "sdpa"
print(f"-> mode={LOAD_MODE}  attn={ATTN_IMPL}")


from transformers import (
   AutoModelForImageTextToText, AutoProcessor,
   BitsAndBytesConfig, TextIteratorStreamer,
   StoppingCriteria, StoppingCriteriaList,
)


MODEL_ID = "Qwen/Qwen3.6-35B-A3B"
kwargs = dict(device_map="auto", trust_remote_code=True,
             low_cpu_mem_usage=True, attn_implementation=ATTN_IMPL,
             torch_dtype=torch.bfloat16)
if LOAD_MODE == "int8":
   kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
elif LOAD_MODE == "int4":
   kwargs["quantization_config"] = BitsAndBytesConfig(
       load_in_4bit=True, bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)


print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
print(f"Loading model in {LOAD_MODE} (first run downloads ~70GB) ...")
t0 = time.time()
mannequin = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs); mannequin.eval()
print(f"Loaded in {time.time()-t0:.0f}s  |  VRAM used: {torch.cuda.memory_allocated()/1e9:.1f} GB")


SAMPLING = {
   "thinking_general": dict(temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5),
   "thinking_coding":  dict(temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0),
   "instruct_general": dict(temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5),
   "instruct_reason":  dict(temperature=1.0, top_p=1.00, top_k=40, presence_penalty=2.0),
}
THINK_OPEN, THINK_CLOSE = "<think>", "</think>"


def split_thinking(textual content: str):
   if THINK_OPEN in textual content and THINK_CLOSE in textual content:
       a = textual content.index(THINK_OPEN) + len(THINK_OPEN); b = textual content.index(THINK_CLOSE)
       return textual content[a:b].strip(), textual content[b + len(THINK_CLOSE):].strip()
   if THINK_CLOSE in textual content:
       b = textual content.index(THINK_CLOSE)
       return textual content[:b].strip(), textual content[b + len(THINK_CLOSE):].strip()
   return "", textual content.strip()

We arrange the total surroundings required to run Qwen 3.6-35B-A3B in Google Colab and put in all supporting libraries for quantization, multimodal processing, retrieval, and schema validation. We then probe the accessible GPU, dynamically choose the loading mode based mostly on VRAM, and configure the eye backend so the mannequin runs as effectively as attainable on the given {hardware}. After that, we load the processor and mannequin from Hugging Face and outline the core sampling presets and the thinking-splitting utility, which lay the inspiration for all later interactions.

Copy CodeCopiedUse a unique Browser
class QwenChat:
   def __init__(self, mannequin, processor, system=None, instruments=None):
       self.mannequin, self.processor = mannequin, processor
       self.tokenizer = processor.tokenizer
       self.historical past: checklist[dict] = []
       if system: self.historical past.append({"role": "system", "content": system})
       self.instruments = instruments


   def person(self, content material):      self.historical past.append({"role":"user","content":content material}); return self
   def assistant(self, content material, reasoning=""):
       m = {"role":"assistant","content":content material}
       if reasoning: m["reasoning_content"] = reasoning
       self.historical past.append(m); return self
   def tool_result(self, identify, outcome):
       self.historical past.append({"role":"tool","name":identify,
           "content": outcome if isinstance(outcome, str) else json.dumps(outcome)})
       return self


   def _inputs(self, enable_thinking, preserve_thinking):
       return self.processor.apply_chat_template(
           self.historical past, instruments=self.instruments, tokenize=True,
           add_generation_prompt=True, return_dict=True, return_tensors="pt",
           enable_thinking=enable_thinking, preserve_thinking=preserve_thinking,
       ).to(self.mannequin.gadget)


   def generate(self, *, enable_thinking=True, preserve_thinking=False,
                max_new_tokens=2048, preset="thinking_general",
                stopping_criteria=None, append_to_history=True):
       inp = self._inputs(enable_thinking, preserve_thinking)
       cfg = SAMPLING[preset]
       gk = dict(**inp, max_new_tokens=max_new_tokens, do_sample=True,
                 temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
                 repetition_penalty=1.0,
                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
       if stopping_criteria is just not None: gk["stopping_criteria"] = stopping_criteria
       with torch.inference_mode(): out = self.mannequin.generate(**gk)
       uncooked = self.tokenizer.decode(out[0, inp["input_ids"].form[-1]:], skip_special_tokens=True)
       assume, ans = split_thinking(uncooked)
       if append_to_history: self.assistant(ans, reasoning=assume)
       return assume, ans


   def stream(self, *, enable_thinking=True, preserve_thinking=False,
              max_new_tokens=2048, preset="thinking_general",
              on_thinking=None, on_answer=None):
       inp = self._inputs(enable_thinking, preserve_thinking)
       cfg = SAMPLING[preset]
       streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, skip_special_tokens=True)
       gk = dict(**inp, streamer=streamer, max_new_tokens=max_new_tokens, do_sample=True,
                 temperature=cfg["temperature"], top_p=cfg["top_p"], top_k=cfg["top_k"],
                 pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id)
       t = threading.Thread(goal=self.mannequin.generate, kwargs=gk); t.begin()
       buf, in_think = "", enable_thinking
       think_text, answer_text = "", ""
       for piece in streamer:
           buf += piece
           if in_think:
               if THINK_CLOSE in buf:
                   close_at = buf.index(THINK_CLOSE)
                   resid = buf[:close_at]
                   if on_thinking: on_thinking(resid[len(think_text):])
                   think_text = resid
                   buf = buf[close_at + len(THINK_CLOSE):]
                   in_think = False
                   if buf and on_answer: on_answer(buf)
                   answer_text = buf; buf = ""
               else:
                   if on_thinking: on_thinking(piece)
                   think_text += piece
           else:
               if on_answer: on_answer(piece)
               answer_text += piece
       t.be a part of()
       self.assistant(answer_text.strip(), reasoning=think_text.strip())
       return think_text.strip(), answer_text.strip()


   def save(self, path):
       with open(path, "w") as f:
           json.dump({"history": self.historical past, "tools": self.instruments}, f, indent=2)
   @classmethod
   def load(cls, mannequin, processor, path):
       with open(path) as f: information = json.load(f)
       c = cls(mannequin, processor, instruments=information.get("tools"))
       c.historical past = information["history"]; return c


class ThinkingBudget(StoppingCriteria):
   def __init__(self, tokenizer, price range: int):
       self.price range = price range
       self.open_ids  = tokenizer.encode(THINK_OPEN,  add_special_tokens=False)
       self.close_ids = tokenizer.encode(THINK_CLOSE, add_special_tokens=False)
       self.begin = None
   def _find(self, seq, needle):
       n = len(needle)
       for i in vary(len(seq)-n+1):
           if seq[i:i+n] == needle: return i
       return None
   def __call__(self, input_ids, scores, **kwargs):
       seq = input_ids[0].tolist()
       if self.begin is None:
           idx = self._find(seq, self.open_ids)
           if idx is just not None: self.begin = idx + len(self.open_ids)
           return False
       if self._find(seq[self.start:], self.close_ids) is just not None: return False
       return (len(seq) - self.begin) >= self.price range


TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.S)


def run_calculate(expr: str) -> str:
   if any(c not in "0123456789+-*/().% " for c in expr):
       return json.dumps({"error":"illegal chars"})
   attempt:    return json.dumps({"result": eval(expr, {"__builtins__": {}}, {})})
   besides Exception as e: return json.dumps({"error": str(e)})


_DOCS = {
   "qwen3.6":  "Qwen3.6-35B-A3B is a 35B MoE with 3B active params and 262k native context.",
   "deltanet": "Gated DeltaNet is a linear-attention variant used in Qwen3.6's hybrid layers.",
   "moe":      "Qwen3.6 uses 256 experts with 8 routed + 1 shared per token.",
}
def run_search_docs(q):
   hits = [v for k,v in _DOCS.items() if k in q.lower()]
   return json.dumps({"results": hits or ["no hits"]})
def run_get_time():
   import datetime as dt
   return json.dumps({"iso": dt.datetime.utcnow().isoformat()+"Z"})


TOOL_FNS = {
   "calculate":   lambda a: run_calculate(a["expression"]),
   "search_docs": lambda a: run_search_docs(a["query"]),
   "get_time":    lambda a: run_get_time(),
}
TOOLS_SCHEMA = [
   {"type":"function","function":{"name":"calculate","description":"Evaluate arithmetic.",
     "parameters":{"type":"object","properties":{"expression":{"type":"string"}},"required":["expression"]}}},
   {"type":"function","function":{"name":"search_docs","description":"Search internal docs.",
     "parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}},
   {"type":"function","function":{"name":"get_time","description":"Get current UTC time.",
     "parameters":{"type":"object","properties":{}}}},
]

We construct the principle QwenChat dialog supervisor, which handles message historical past, instrument messages, chat template formatting, normal technology, streaming technology, and session persistence. We additionally outline the ThinkingBudget stopping criterion to regulate how a lot reasoning the mannequin is allowed to provide earlier than persevering with or stopping technology. As well as, we create the tool-calling help layer, together with arithmetic, light-weight doc search, time lookup, and the instrument schema that enables the mannequin to work together with exterior capabilities in an agent-style loop.

Copy CodeCopiedUse a unique Browser
def run_agent(user_msg, *, max_steps=5, verbose=True):
   chat = QwenChat(mannequin, processor,
       system="You are a helpful assistant. Call tools when helpful, then answer.",
       instruments=TOOLS_SCHEMA)
   chat.person(user_msg)
   for step in vary(max_steps):
       assume, uncooked = chat.generate(enable_thinking=True, preserve_thinking=True,
                                  preset="thinking_general", max_new_tokens=1024,
                                  append_to_history=False)
       calls = TOOL_CALL_RE.findall(uncooked)
       if verbose:
           print(f"n=== step {step+1} ===")
           print("reasoning:", textwrap.shorten(assume, 200))
           print("raw     :", textwrap.shorten(uncooked, 300))
       if not calls:
           chat.assistant(uncooked, reasoning=assume); return chat, uncooked
       chat.assistant(uncooked, reasoning=assume)
       for payload in calls:
           attempt: parsed = json.masses(payload)
           besides json.JSONDecodeError:
               chat.tool_result("error", {"error":"bad json"}); proceed
           fn = TOOL_FNS.get(parsed.get("name"))
           res = fn(parsed.get("arguments", {})) if fn else json.dumps({"error":"unknown"})
           if verbose: print(f" -> {parsed.get('name')}({parsed.get('arguments',{})}) = {res}")
           chat.tool_result(parsed.get("name"), res)
   return chat, "(max_steps reached)"


import jsonschema


MOVIE_SCHEMA = {
   "type":"object",
   "required":["title","year","rating","genres","runtime_minutes"],
   "additionalProperties": False,
   "properties":{
       "title":{"type":"string"},
       "year":{"type":"integer","minimum":1900,"maximum":2030},
       "rating":{"type":"number","minimum":0,"maximum":10},
       "genres":{"type":"array","items":{"type":"string"},"minItems":1},
       "runtime_minutes":{"type":"integer","minimum":1,"maximum":500},
   },
}
def extract_json(textual content):
   textual content = re.sub(r"^```(?:json)?", "", textual content.strip())
   textual content = re.sub(r"```$", "", textual content.strip())
   s = textual content.discover("{")
   if s < 0: elevate ValueError("no object")
   d, e = 0, -1
   for i in vary(s, len(textual content)):
       if textual content[i] == "{": d += 1
       elif textual content[i] == "}":
           d -= 1
           if d == 0: e = i; break
   if e < 0: elevate ValueError("unbalanced braces")
   return json.masses(textual content[s:e+1])


def json_with_retry(immediate, schema, *, max_tries=3):
   sys_m = ("You reply with ONLY a single JSON object matching the user's schema. "
            "No markdown fences. No commentary. No <think> blocks.")
   chat = QwenChat(mannequin, processor, system=sys_m)
   chat.person(f"{prompt}nnRespond as JSON matching this schema:n{json.dumps(schema, indent=2)}")
   final = None
   for i in vary(max_tries):
       _, uncooked = chat.generate(enable_thinking=False, preset="instruct_general",
                              max_new_tokens=512, append_to_history=False)
       attempt:
           obj = extract_json(uncooked); jsonschema.validate(obj, schema)
           return obj, i+1
       besides Exception as e:
           final = str(e); chat.assistant(uncooked)
           chat.person(f"That failed validation: {last}. Produce ONLY valid JSON.")
   elevate RuntimeError(f"gave up after {max_tries}: {last}")


def benchmark(immediate, *, batch_sizes=(1,2,4), max_new_tokens=64):
   print(f"{'batch':>6} {'tok/s':>10} {'total_s':>10} {'VRAM_GB':>10}")
   print("-"*40)
   for bs in batch_sizes:
       gc.acquire(); torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
       msgs = [[{"role":"user","content":prompt}] for _ in vary(bs)]
       texts = [processor.apply_chat_template(m, tokenize=False, add_generation_prompt=True,
                                               enable_thinking=False) for m in msgs]
       processor.tokenizer.padding_side = "left"
       inp = processor.tokenizer(texts, return_tensors="pt", padding=True).to(mannequin.gadget)
       torch.cuda.synchronize(); t0 = time.time()
       with torch.inference_mode():
           out = mannequin.generate(**inp, max_new_tokens=max_new_tokens, do_sample=False,
               pad_token_id=processor.tokenizer.pad_token_id or processor.tokenizer.eos_token_id)
       torch.cuda.synchronize(); dt = time.time()-t0
       new_toks = (out.form[1] - inp["input_ids"].form[1]) * bs
       vram = torch.cuda.max_memory_allocated()/1e9
       print(f"{bs:>6d} {new_toks/dt:>10.1f} {dt:>10.2f} {vram:>10.1f}")


def build_rag():
   from sentence_transformers import SentenceTransformer
   import numpy as np
   embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
   KB = [
       "Qwen3.6-35B-A3B has 35B total params and 3B activated via MoE.",
       "Context length is 262,144 tokens natively, up to ~1M with YaRN.",
       "The MoE layer uses 256 experts with 8 routed and 1 shared per token.",
       "Thinking mode wraps internal reasoning in <think>...</think> blocks.",
       "preserve_thinking=True keeps prior reasoning across turns for agents.",
       "Gated DeltaNet is a linear-attention variant in the hybrid layers.",
       "The model accepts image, video, and text input natively.",
       "Sampling for coding tasks uses temperature=0.6 rather than 1.0.",
   ]
   KB_EMB = embedder.encode(KB, normalize_embeddings=True)
   def retrieve(q, ok=3):
       qv = embedder.encode([q], normalize_embeddings=True)[0]
       import numpy as _np
       return [KB[i] for i in _np.argsort(-(KB_EMB @ qv))[:k]]
   return retrieve


def rag_answer(question, retrieve, ok=3):
   ctx = retrieve(question, ok)
   sys_m = "Answer using ONLY the provided context. If insufficient, say so."
   person = "Context:n" + "n".be a part of(f"- {c}" for c in ctx) + f"nnQuestion: {query}"
   chat = QwenChat(mannequin, processor, system=sys_m); chat.person(person)
   _, ans = chat.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
   return ans, ctx

We outline higher-level utility capabilities that flip the mannequin right into a extra full software framework for agentic, structured workflows. We implement the agent loop for iterative instrument use, add JSON extraction and validation with retry logic, create a benchmarking perform to measure technology throughput, and construct a light-weight semantic retrieval pipeline for mini-RAG. Collectively, these capabilities assist us transfer from primary prompting to extra sturdy workflows through which the mannequin can purpose, validate outputs, retrieve supporting context, and be systematically examined.

Copy CodeCopiedUse a unique Browser
print("n" + "="*20, "§4 thinking-budget", "="*20)
c = QwenChat(mannequin, processor)
c.person("A frog is at the bottom of a 30m well. It climbs 3m/day, slips 2m/night. "
      "How many days until it escapes? Explain.")
price range = ThinkingBudget(processor.tokenizer, price range=150)
assume, ans = c.generate(enable_thinking=True, max_new_tokens=1200,
                        stopping_criteria=StoppingCriteriaList([budget]))
print(f"Thinking ~{len(processor.tokenizer.encode(think))} tok | Answer:n{ans or '(truncated)'}")


print("n" + "="*20, "§5 streaming split", "="*20)
c = QwenChat(mannequin, processor)
c.person("Explain why transformers scale better than RNNs, in two short paragraphs.")
print("[THINKING >>] ", finish="", flush=True)
first = [True]
def _ot(x): print(x, finish="", flush=True)
def _oa(x):
   if first[0]: print("nn[ANSWER >>] ", finish="", flush=True); first[0] = False
   print(x, finish="", flush=True)
c.stream(enable_thinking=True, preset="thinking_general", max_new_tokens=700,
        on_thinking=_ot, on_answer=_oa); print()


print("n" + "="*20, "§6 vision", "="*20)
IMG = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg"
c = QwenChat(mannequin, processor)
c.historical past.append({"role":"user","content":[
   {"type":"image","image":IMG},
   {"type":"text","text":"Describe this figure in one sentence, then state what it's asking."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=300)
print("Describe:", ans)


GRD = "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/demo/RealWorld/RealWorld-04.png"
c = QwenChat(mannequin, processor)
c.historical past.append({"role":"user","content":[
   {"type":"image","image":GRD},
   {"type":"text","text": "Locate every distinct object. Reply ONLY with JSON "
    "[{"label":...,"bbox_2d":[x1,y1,x2,y2]}, ...] in pixel coords."}]})
_, ans = c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=800)
print("Grounding:", ans[:600])


print("n" + "="*20, "§7 YaRN override", "="*20)
YARN = {"text_config": {"rope_parameters": {
   "mrope_interleaved": True, "mrope_section": [11,11,10],
   "rope_type": "yarn", "rope_theta": 10_000_000,
   "partial_rotary_factor": 0.25, "factor": 4.0,
   "original_max_position_embeddings": 262_144}}}
print(json.dumps(YARN, indent=2))

We start operating the superior demonstrations by testing thinking-budget management, break up streaming, multimodal imaginative and prescient prompting, and a YaRN configuration instance for prolonged context dealing with. We first observe how the mannequin causes below a restricted considering price range, then stream its considering and reply individually in order that we will examine each components of the response move. We additionally ship image-based prompts for description and grounding duties, and eventually print a YaRN rope-configuration override that exhibits how long-context settings may be ready for mannequin reloading.

Copy CodeCopiedUse a unique Browser
print("n" + "="*20, "§8 agent loop", "="*20)
chat, ultimate = run_agent(
   "What's 15% of 842 to 2 decimals? Also briefly explain gated DeltaNet per the docs.",
   max_steps=4)
print("nFINAL:", ultimate)


print("n" + "="*20, "§9 structured JSON", "="*20)
obj, tries = json_with_retry("Summarize the movie Inception as structured metadata.",
                            MOVIE_SCHEMA)
print(f"({tries} tries)", json.dumps(obj, indent=2))


print("n" + "="*20, "§10 MoE routing", "="*20)
routers = []
for identify, m in mannequin.named_modules():
   low = identify.decrease()
   if (("gate" in low and ("moe" in low or "expert" in low)) or
       low.endswith(".router") or low.endswith(".gate")) and hasattr(m, "weight"):
       routers.append((identify, m))
print(f"found {len(routers)} router-like modules")


TOP_K = 8
counts = [Counter() for _ in routers]
handles = []
def _mkhook(i):
   def h(_m, _i, out):
       lg = out[0] if isinstance(out, tuple) else out
       if lg.dim() != 2: return
       attempt:
           for eid in lg.topk(TOP_K, dim=-1).indices.flatten().tolist():
               counts[i][eid] += 1
       besides Exception: move
   return h
for i,(_,m) in enumerate(routers): handles.append(m.register_forward_hook(_mkhook(i)))
attempt:
   c = QwenChat(mannequin, processor); c.person("Write one short sentence about sunset.")
   c.generate(enable_thinking=False, preset="instruct_general", max_new_tokens=40)
lastly:
   for h in handles: h.take away()
whole = Counter()
for c_ in counts: whole.replace(c_)
print(f"distinct experts activated: {len(total)}")
for eid, n in whole.most_common(10): print(f"  expert #{eid:>3}  {n} fires")


print("n" + "="*20, "§11 benchmark", "="*20)
benchmark("In one sentence, what is entropy?", batch_sizes=(1,2,4), max_new_tokens=48)


print("n" + "="*20, "§12 mini-RAG", "="*20)
retrieve = build_rag()
ans, ctx = rag_answer("How many experts are active per token, and why does that matter?", retrieve)
print("retrieved:"); [print(" -", c) for c in ctx]
print("answer:", ans)


print("n" + "="*20, "§13 save/resume", "="*20)
c = QwenChat(mannequin, processor); c.person("Give me a unique 5-letter codeword. Just the word.")
_, a1 = c.generate(enable_thinking=True, max_new_tokens=256); print("T1:", a1)
c.save("/content/session.json")
del c; gc.acquire()
r = QwenChat.load(mannequin, processor, "/content/session.json")
r.person("Reverse the letters of that codeword.")
_, a2 = r.generate(enable_thinking=True, preserve_thinking=True, max_new_tokens=256)
print("T2:", a2)


print("n✓ tutorial complete")

We proceed with the remaining demonstrations that showcase tool-augmented reasoning, schema-constrained JSON technology, MoE routing introspection, throughput benchmarking, retrieval-augmented answering, and save-resume session dealing with. We let the mannequin clear up a tool-using process, generate structured film metadata with validation, examine which expert-like router modules activate throughout inference, and measure tokens-per-second throughout completely different batch sizes. Lastly, we take a look at mini-RAG for context-grounded answering and confirm conversational persistence by saving a session, reloading it, and persevering with the interplay from the saved historical past.

In conclusion, we created a sensible and detailed workflow for utilizing Qwen 3.6-35B-A3B past easy textual content technology. We confirmed the way to mix adaptive loading, multimodal prompting, managed reasoning, tool-augmented interplay, schema-constrained outputs, light-weight RAG, and session save-resume patterns into one built-in system. We additionally inspected skilled routing habits and measured throughput to know the mannequin’s usability and efficiency. Additionally, we turned Qwen 3.6 right into a working experimental playground the place we will research its capabilities, take a look at superior interplay patterns, and construct a robust basis for extra critical analysis or product-oriented purposes.


Try the Full Codes with Notebook here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us

The publish A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence appeared first on MarkTechPost.

coding
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning

21/04/2026

Moonshot AI Releases Kimi K2.6 with Lengthy-Horizon Coding, Agent Swarm Scaling to 300 Sub-Brokers and 4,000 Coordinated Steps

21/04/2026

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026
Top News

A new AI brain model is coming to ICU

The Enigma of Enforcing GDPR on LLMs • AI Blog

The cyberattack that left drivers stuck on the road was a result of a cyber-attack by a company selling car breathalyzers

This Startup Wants to Build Self-Driving Car Software—Super Fast

Mark Zuckerberg reveals Meta’s plan for a self-improving and superintelligent AI

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

ChatGPT in the Classroom: Let’s talk about it

26/05/2025

NVIDIA AI has released the largest open-source speech AI dataset for European languages and models that are state-ofthe-art.

16/08/2025
Latest News

A Coding Implementation on Qwen 3.6-35B-A3B Masking Multimodal Inference, Considering Management, Device Calling, MoE Routing, RAG, and Session Persistence

21/04/2026

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Instrument Use RAG and LoRA High-quality-Tuning

21/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.