A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

We will explore this in detail. lambda/hermes-agent-reasoning-traces dataset To understand how models based on agents think, utilize tools, and produce responses in multi-turn conversation. In order to understand the information available, we begin by inspecting and loading the dataset. We examine its structure, categories, conversational format, and other relevant factors. Then, we build simple parsers that extract important components, such as tool calls and responses, reasoning traces, to help us separate the internal from external thinking. We also analyze patterns like tool usage, conversation duration, and error rate to understand the behavior of agents. In order to enhance the visualization and highlight trends, we also develop visuals. We prepare the data for training, by converting it to a format that is model-friendly, making it ideal for tasks such as supervised fine tuning.

It's a pip! -q install-U pandas Matplotlib Seaborn Transformators Accelerate trl


Import json, re random textwrap
Import Counter defaultdict
import pandas as pd
Import numpy as an np
Import matplotlib.pyplot into plt
Concatenate datasets, load_dataset or concatenate_datasets.


random.seed(0)


CONFIG = "kimi"
ds = Load_dataset ("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = FALSE
If COMPARE_BOTH
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
   ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


Sample = ds[0]
print("n=== Sample 0 ===")
print("id        :"The sample["id"])
print("category  :"The sample["category"], "/"The sample["subcategory"])
print("task      :"The sample["task"])
print("turns     :", len(sample["conversations"]))
print("system[0] :"The sample["conversations"][0]["value"][:220], "...n")

Installation of all libraries required and the importation of modules is done to create our environment. We then load the lambda/hermes-agent-reasoning-traces dataset and inspect its structure, fields, and categories. We can also combine different dataset configurations, and look at a sample in order to better understand conversational formats.

THINK_RE     = re.compile(r"(.*?)", re.DOTALL)
TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.DOTALL)
TOOL_RESP_RE = re.compile(r"s*(.*?)s*", re.DOTALL)


def parse_assistant(value: str) -> dict:
 Think = [t.strip() for t in THINK_RE.findall(value)]
 Calls []
   for raw in TOOL_CALL_RE.findall(value):
       try:
           calls.append(json.loads(raw))
       except json.JSONDecodeError:
           calls.append({"name": "", "arguments": {}})
 To call the tool, use final = TOOL_CALL_RE.subReturn"", THINK_RE.sub("", value)).strip()
   return {"thoughts": thoughts, "tool_calls": calls, "final": final}


The value of parse_tool is str.If not raw, returnReturn
   raw = TOOL_RESP_RE.search(value)
   if not raw: return {"raw": value}
   body = raw.group(1)
   try:    return json.loads(body)
   except: return {"raw": body}


First_gpt is equal to next(t) for sample t["conversations"] If t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] You can also find out more about "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

To extract the reasoning traces and tool calls from the dataset, we define parsers based on regex. In order to process the assistant messages, we separate out thoughts, actions, final outputs, in a structured manner. The parser is then tested on a conversation sample to ensure that it works.

N = 3000
sub = ds.select(range(min(N, len(ds))))


"tool_calls" = Counter()
Parallel_widths => Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts = counter()


For example, sub
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
 The word t is used in the following sentences:["conversations"]:
 If t["from"] == "gpt":
 The parse_assistant function returns the value p.["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
 If you p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
 Why c is not in p["tool_calls"]:
                   tool_calls[c.get("name", "")] += 1
               n_calls += len(p["tool_calls"])
 Elif T["from"] == "tool":
 r = Parse_tool (t)["value"])
 blob=json.dumps.lower()
 If you want to know more about if "error" In blobs or ""exit_code"The '1' symbol is used in the blob. "traceback" In a blob
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.mean([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools        :", tool_calls.most_common(10))


Figure, Axis = Plt.Subplots(2), 2 (2), figsize=(13. 9, 9)


top = tool_calls.most_common(15)
Axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
Axes[0,0].set_title("Top 15 tools by call volume")
Axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
Axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
Axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
Axes[0,1].set_xlabel("# tool calls in one turn"( ) axes[0,1].set_ylabel("count")
Axes[0,1].set_yscale("log")


Axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
Axes[1,0].set_title("Conversation length"( ) axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
Axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
Axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.show()

We use dataset-wide analytics for measuring tool usage and conversation length. In order to better understand the behavior of agents, we combine statistics for multiple samples. Visualizations are created to show trends like tool usage, simultaneous calls, or category distribution.

def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
 For t, use ex["conversations"]:
 Role = t["from"]
 If Role == "system":
 You can continue reading
 If Role == "human":
           print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
 The role of elif == "gpt":
 The parse_assistant function returns the value p.["value"])
 For th, p["thoughts"]:
               print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
 Use c instead of p["tool_calls"]:
 args = "json.dumps"(c.get()"arguments", {}))[:200]
               print(f"[CALL] {c.get('name')}({args})")
 If p["final"]:
               print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
 The role of elif == "tool":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   try:    return json.loads(ex["tools"])
 Return to the original page []


schemas = get_tool_schemas(sample)
print(f"nSample 0 has {len(schemas)} tools available")
The schema for the letter s[:3]:
 Get fn by s.get()"function", {})
   print(" -", fn.get("name"), "—", (fn.get("description"( "")[:80])


ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}


def to_openai_messages(conv):
 You can return to your original language by clicking here. [{"role": ROLE_MAP[t["from"]], "content""["value"]} for t in conv]


example_msgs = to_openai_messages(sample["conversations"])
print("nFirst 2 OpenAI messages:")
For example_msgs, m is a valid value.[:2]:
   print(" "The m["role"], "→"The m["content"][:120].replace("n", " "), "...")

For deeper analysis, we build tools to render conversation traces. We convert datasets into OpenAI message formats and extract tool schemas for use with pipelines. We can better understand how tools are structured and standardize conversations.

AutoTokenizer import from transformers
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
 You can use msgs to send msgs.
 If m["role"] == "tool":
 The m["role"] = "user"
 The m["content"] = "[TOOL OUTPUT]n" +["content"]
 Labels = input_ids [], []
 For m, in messages:
       text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(text, add_special_tokens=False)
       input_ids.extend(ids)
 If m, then labels.extend() will be used.["role"] == "assistant" You can also find out more about [-100] * len(ids))
 Return input_ids[:max_len]Labels[:max_len]


Ids and lbls are equal to build_masked()["conversations"], tok)
If x is less than 100, then sum (1 for x) in lbls.
print(f"nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
 T in ex["conversations"]:
 If t["from"] != "gpt"Continue
 The parse_assistant function returns the value t.["value"])
 For th, p["thoughts"]: think_lens.append(len(th))
 If c is in p, then a's' will appear.["tool_calls"]: call_lens.append(len(json.dumps(c)))
 If you p["final"]: ans_lens.append(len(p["final"]))


plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()


Class TraceReplayer
   def __init__(self, ex):
 self.ex = Ex
       self.steps = []
       pending = None
 Ex["conversations"]:
 If t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"think": parse_assistant(t["value"]), "responses": []}
 The elif is a["from"] == "tool" And pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
 The th is s["think"]["thoughts"]:
           print(f"💭 {textwrap.shorten(th, 280)}")
 When c is s["think"]["tool_calls"]:
           print(f"⚙️  {c.get('name')}({json.dumps(c.get('arguments', {}))[:140]})")
 S is pronounced as r["responses"]:
           print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
 If s["think"]["final"]:
           print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")


rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
   rp.play(i)


TRAIN = FALSE
If TRAIN
 Import torch
 Import AutoModelForCausalLM
 From trl, import SFTTrainer or SFTConfig


   train_subset = ds.select(range(200))


 Def to_text (batch)
       msgs = to_openai_messages(batch["conversations"])
 msgs for the m character:
 If you m["role"] == "tool":
 The m["role"] = "user"The m["content"] = "[TOOL]n" Plus m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   model = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() Other torch.float32
       device_map="auto" if torch.cuda.is_available() else None,
   )


 cfg = SFTConfig
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="text",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
   print("Fine-tune demo finished.")


print("n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an optional training hook.")

Only assistants’ responses are used for training. To gain more insight, we analyze the distribution of lengths for reasoning, tool requests, and responses. A trace replayer is also implemented to allow us to see the agent’s behavior.

We developed a workflow that allows us to analyze and effectively work with reasoning traces of agents. In order to analyze how agents solve problems, we were able break conversations down into their meaningful parts, as well as examine the reasoning process. We were able to gain insights using the analytics and visualizations into the common patterns of behavior and pattern across the dataset. We also converted the data to a format that is suitable for language model training, which includes tokenization and masking of labels for assistant answers. The process is also a good foundation for studying and improving AI tools in a practical way.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter Don’t forget about our 130k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

Compare Kiro, BMAD GSD and Other AI Tools.

Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers

Build a single-cell RNA-seq analysis pipeline with Scanpy to perform PBMC clustering, annotation, and trajectory discovery

OpenAI’s AI Agent can now access LinkedIn, Salesforce Gmail and internal tools via sign-in sessions.

WhatsApp Warning: UK Parents Scammed Out of £500K by AI That Pretends to Be Their Kids

How can we counter someone like Terrence-Howard? • AI Blog

The Judge has halted the designation of Anthropic supply-Chain risk

Biden Administration Report on AI Safety Unpublished

Elon Musk’s xAI Sues Apple & OpenAI for App Store Rankings

Top Insights

Subscribing to the best newsletters in 2025

ICE and CBP’s Face-Recognition App Can’t Really Confirm Who Folks Are

Latest News

Compare Kiro, BMAD GSD and Other AI Tools.

Meet GitHub Spec-Equipment: An Open Supply Toolkit for Spec-Pushed Improvement with AI Coding Brokers

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

Related Posts