We will explore this in detail. lambda/hermes-agent-reasoning-traces dataset To understand how models based on agents think, utilize tools, and produce responses in multi-turn conversation. In order to understand the information available, we begin by inspecting and loading the dataset. We examine its structure, categories, conversational format, and other relevant factors. Then, we build simple parsers that extract important components, such as tool calls and responses, reasoning traces, to help us separate the internal from external thinking. We also analyze patterns like tool usage, conversation duration, and error rate to understand the behavior of agents. In order to enhance the visualization and highlight trends, we also develop visuals. We prepare the data for training, by converting it to a format that is model-friendly, making it ideal for tasks such as supervised fine tuning.
It's a pip! -q install-U pandas Matplotlib Seaborn Transformators Accelerate trl
Import json, re random textwrap
Import Counter defaultdict
import pandas as pd
Import numpy as an np
Import matplotlib.pyplot into plt
Concatenate datasets, load_dataset or concatenate_datasets.
random.seed(0)
CONFIG = "kimi"
ds = Load_dataset ("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))
COMPARE_BOTH = FALSE
If COMPARE_BOTH
ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
ds_glm = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
ds_glm = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
print("Combined:", ds, "→ counts:", Counter(ds["source"]))
Sample = ds[0]
print("n=== Sample 0 ===")
print("id :"The sample["id"])
print("category :"The sample["category"], "/"The sample["subcategory"])
print("task :"The sample["task"])
print("turns :", len(sample["conversations"]))
print("system[0] :"The sample["conversations"][0]["value"][:220], "...n")
Installation of all libraries required and the importation of modules is done to create our environment. We then load the lambda/hermes-agent-reasoning-traces dataset and inspect its structure, fields, and categories. We can also combine different dataset configurations, and look at a sample in order to better understand conversational formats.
THINK_RE = re.compile(r"(.*?) ", re.DOTALL)
TOOL_CALL_RE = re.compile(r"s*({.*?})s* ", re.DOTALL)
TOOL_RESP_RE = re.compile(r"s*(.*?)s* ", re.DOTALL)
def parse_assistant(value: str) -> dict:
Think = [t.strip() for t in THINK_RE.findall(value)]
Calls []
for raw in TOOL_CALL_RE.findall(value):
try:
calls.append(json.loads(raw))
except json.JSONDecodeError:
calls.append({"name": "", "arguments": {}})
To call the tool, use final = TOOL_CALL_RE.subReturn"", THINK_RE.sub("", value)).strip()
return {"thoughts": thoughts, "tool_calls": calls, "final": final}
The value of parse_tool is str.If not raw, returnReturn
raw = TOOL_RESP_RE.search(value)
if not raw: return {"raw": value}
body = raw.group(1)
try: return json.loads(body)
except: return {"raw": body}
First_gpt is equal to next(t) for sample t["conversations"] If t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] You can also find out more about "(none)")
print("Tool calls :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])
To extract the reasoning traces and tool calls from the dataset, we define parsers based on regex. In order to process the assistant messages, we separate out thoughts, actions, final outputs, in a structured manner. The parser is then tested on a conversation sample to ensure that it works.
N = 3000
sub = ds.select(range(min(N, len(ds))))
"tool_calls" = Counter()
Parallel_widths => Counter()
thoughts_per_turn = []
calls_per_traj = []
errors_per_traj = []
turns_per_traj = []
cat_counts = counter()
For example, sub
cat_counts[ex["category"]] += 1
n_calls = n_err = 0
turns_per_traj.append(len(ex["conversations"]))
The word t is used in the following sentences:["conversations"]:
If t["from"] == "gpt":
The parse_assistant function returns the value p.["value"])
thoughts_per_turn.append(len(p["thoughts"]))
If you p["tool_calls"]:
parallel_widths[len(p["tool_calls"])] += 1
Why c is not in p["tool_calls"]:
tool_calls[c.get("name", "")] += 1
n_calls += len(p["tool_calls"])
Elif T["from"] == "tool":
r = Parse_tool (t)["value"])
blob=json.dumps.lower()
If you want to know more about if "error" In blobs or ""exit_code"The '1' symbol is used in the blob. "traceback" In a blob
n_err += 1
calls_per_traj.append(n_calls)
errors_per_traj.append(n_err)
print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with >=1 error : {100*np.mean([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns : {100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools :", tool_calls.most_common(10))
Figure, Axis = Plt.Subplots(2), 2 (2), figsize=(13. 9, 9)
top = tool_calls.most_common(15)
Axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
Axes[0,0].set_title("Top 15 tools by call volume")
Axes[0,0].set_xlabel("calls")
ks = sorted(parallel_widths)
Axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
Axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
Axes[0,1].set_xlabel("# tool calls in one turn"( ) axes[0,1].set_ylabel("count")
Axes[0,1].set_yscale("log")
Axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
Axes[1,0].set_title("Conversation length"( ) axes[1,0].set_xlabel("turns")
cats, vals = zip(*cat_counts.most_common())
Axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
Axes[1,1].set_title("Category distribution")
plt.tight_layout(); plt.show()
We use dataset-wide analytics for measuring tool usage and conversation length. In order to better understand the behavior of agents, we combine statistics for multiple samples. Visualizations are created to show trends like tool usage, simultaneous calls, or category distribution.
def render_trace(ex, max_chars=350):
print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
For t, use ex["conversations"]:
Role = t["from"]
If Role == "system":
You can continue reading
If Role == "human":
print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
The role of elif == "gpt":
The parse_assistant function returns the value p.["value"])
For th, p["thoughts"]:
print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
Use c instead of p["tool_calls"]:
args = "json.dumps"(c.get()"arguments", {}))[:200]
print(f"[CALL] {c.get('name')}({args})")
If p["final"]:
print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
The role of elif == "tool":
print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
print("="*72)
idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])
def get_tool_schemas(ex):
try: return json.loads(ex["tools"])
Return to the original page []
schemas = get_tool_schemas(sample)
print(f"nSample 0 has {len(schemas)} tools available")
The schema for the letter s[:3]:
Get fn by s.get()"function", {})
print(" -", fn.get("name"), "—", (fn.get("description"( "")[:80])
ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}
def to_openai_messages(conv):
You can return to your original language by clicking here. [{"role": ROLE_MAP[t["from"]], "content""["value"]} for t in conv]
example_msgs = to_openai_messages(sample["conversations"])
print("nFirst 2 OpenAI messages:")
For example_msgs, m is a valid value.[:2]:
print(" "The m["role"], "→"The m["content"][:120].replace("n", " "), "...")
For deeper analysis, we build tools to render conversation traces. We convert datasets into OpenAI message formats and extract tool schemas for use with pipelines. We can better understand how tools are structured and standardize conversations.
AutoTokenizer import from transformers
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)
def build_masked(conv, tokenizer, max_len=2048):
msgs = to_openai_messages(conv)
You can use msgs to send msgs.
If m["role"] == "tool":
The m["role"] = "user"
The m["content"] = "[TOOL OUTPUT]n" +["content"]
Labels = input_ids [], []
For m, in messages:
text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
ids = tokenizer.encode(text, add_special_tokens=False)
input_ids.extend(ids)
If m, then labels.extend() will be used.["role"] == "assistant" You can also find out more about [-100] * len(ids))
Return input_ids[:max_len]Labels[:max_len]
Ids and lbls are equal to build_masked()["conversations"], tok)
If x is less than 100, then sum (1 for x) in lbls.
print(f"nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")
think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
T in ex["conversations"]:
If t["from"] != "gpt"Continue
The parse_assistant function returns the value t.["value"])
For th, p["thoughts"]: think_lens.append(len(th))
If c is in p, then a's' will appear.["tool_calls"]: call_lens.append(len(json.dumps(c)))
If you p["final"]: ans_lens.append(len(p["final"]))
plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()
Class TraceReplayer
def __init__(self, ex):
self.ex = Ex
self.steps = []
pending = None
Ex["conversations"]:
If t["from"] == "gpt":
if pending: self.steps.append(pending)
pending = {"think": parse_assistant(t["value"]), "responses": []}
The elif is a["from"] == "tool" And pending:
pending["responses"].append(parse_tool(t["value"]))
if pending: self.steps.append(pending)
def __len__(self): return len(self.steps)
def play(self, i):
s = self.steps[i]
print(f"n── Step {i+1}/{len(self)} ──")
The th is s["think"]["thoughts"]:
print(f"💭 {textwrap.shorten(th, 280)}")
When c is s["think"]["tool_calls"]:
print(f"⚙️ {c.get('name')}({json.dumps(c.get('arguments', {}))[:140]})")
S is pronounced as r["responses"]:
print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
If s["think"]["final"]:
print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")
rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
rp.play(i)
TRAIN = FALSE
If TRAIN
Import torch
Import AutoModelForCausalLM
From trl, import SFTTrainer or SFTConfig
train_subset = ds.select(range(200))
Def to_text (batch)
msgs = to_openai_messages(batch["conversations"])
msgs for the m character:
If you m["role"] == "tool":
The m["role"] = "user"The m["content"] = "[TOOL]n" Plus m["content"]
batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
return batch
train_subset = train_subset.map(to_text)
model = AutoModelForCausalLM.from_pretrained(
TOK_ID,
torch_dtype=torch.float16 if torch.cuda.is_available() Other torch.float32
device_map="auto" if torch.cuda.is_available() else None,
)
cfg = SFTConfig
output_dir="hermes-sft-demo",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=20,
learning_rate=2e-5,
logging_steps=2,
max_seq_length=1024,
dataset_text_field="text",
report_to="none",
fp16=torch.cuda.is_available(),
)
SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
print("Fine-tune demo finished.")
print("n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
"tokenized + label-masked SFT examples, and an optional training hook.")
Only assistants’ responses are used for training. To gain more insight, we analyze the distribution of lengths for reasoning, tool requests, and responses. A trace replayer is also implemented to allow us to see the agent’s behavior.
We developed a workflow that allows us to analyze and effectively work with reasoning traces of agents. In order to analyze how agents solve problems, we were able break conversations down into their meaningful parts, as well as examine the reasoning process. We were able to gain insights using the analytics and visualizations into the common patterns of behavior and pattern across the dataset. We also converted the data to a format that is suitable for language model training, which includes tokenization and masking of labels for assistant answers. The process is also a good foundation for studying and improving AI tools in a practical way.
Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

