The Release of Rogue by Qualifire: A Framework for Agentic Testing, Evaluating AI Agents' Performance

The agentic system is stochastic and context-dependent. Conventional QA—unit tests, static prompts, or scalar “LLM-as-a-judge” scores—fails to expose multi-turn vulnerabilities and provides weak audit trails. The developer teams must have protocol-accurate discussions, policy checks that are explicit, and evidence which can be read by machines.

Qualifire AI is open-sourced RogueA Python Framework that Evaluates AI Agents over Agent-to Agent (A2A). protocol. Rogue turns business policies into scenarios that can be executed, it drives interactions in multiple directions against the target agent. Rogue also produces reports for CI/CD, compliance, and other reviews.

Quick Start

Prerequisites

uvx – If not installed, follow uv installation guide
Python 3.10+
An API Key for an LLM Provider (e.g. OpenAI, Google or Anthropic).

Installation

Choose Option 1 for Quick Installation

Install quickly using our scripted installation:

TUI
uvx rogue-ai
# Web UI
uvx rogue-ai ui
# CLI/CI/CD
uvx rogue-ai cli

Option 2: Manual Installation

The repository can be cloned:

git clone https://github.com/qualifire-dev/rogue.git
cd rogue

(b) Install dependencies:

Use uv:

You can use the pip command to find out if your system is using this.

Setup your environment variables – Optionally: Create a file called.env and place your API key in it. Rogue makes use of LiteLLM. You can therefore set different keys to suit various providers.

OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="sk-..."
GOOGLE_API_KEY="..."

Running Rogue

Rogue works on behalf of a customer.server Architecture where core evaluation logic is run on a server in the backend, with various clients connecting to it via different interfaces.

The default behavior

If you do not specify a mode when running uvx-rogue AI, the following happens:

Rogue starts in the background
Launches TUI client (Terminal User Interface).

There are several modes of operation.

Standard (Server plus TUI): uvx rogue-ai – Starts server in background + TUI client
Can you please explain?: uvx rogue-ai server – Runs only the backend server
TUI: uvx rogue-ai tui – Runs only the TUI client (requires server running)
Web Interface: uvx rogue-ai ui – Runs only the Gradio web interface client (requires server running)
CLI: uvx rogue-ai cli – Runs non-interactive command-line evaluation (requires server running, ideal for CI/CD)

Modal Arguments

Server Mode

uvx rogue-ai server [OPTIONS]

Options:

–host HOST – Host to run the server on (default: 127.0.0.1 or HOST env var)
–port PORT – Port to run the server on (default: 8000 or PORT env var)
–debug – Enable debug logging

TUI Style

uvx rogue-ai tui [OPTIONS]
Web UI mode
uvx rogue-ai ui [OPTIONS]

Options:

–rogue-server-url URL – Rogue server URL (default: http://localhost:8000)
–port PORT – Port to run the UI on
–workdir WORKDIR – Working directory (default: ./.rogue)
–debug – Enable debug logging

Examples: Test of the T-Shirt Agent

The repository contains a simple agent example that sells t-shirts. Use it to view Rogue’s actions.

Install the following example dependency:

You are using the uv

If you’re using Pip

Pip Install -e[examples]

Launch the agent in an additional terminal.

You are using the uv

uv run examples/tshirt_store_agent

Then:

python examples/tshirt_store_agent

This will start the agent on http://localhost:10001.

b) Configuration Rogue In the user interface, you can point out the agent as an example:

Agent URL: http://localhost:10001
Authentication: no-auth

Run the evaluation, and then watch Rogue Check the policy of your T-Shirt Agent!

Use either TUI or Web UI.

How Rogue fits into Practical Use Cases

Safety & Compliance HardeningTranscripts can provide evidence to support policies on PII/PHI, refusing behavior, preventing leaks of secrets, and the regulated domain.
E-Commerce & Support AgentsUnder adverse and failure circumstances, ensure that discounts are only available with an OTP, rules for refunds, escalation based on SLAs, and tools (order check, tickets) have the correct use.
Agents for Developer/DevOpsTest code-mod copilots and their CLIs to determine if they are confined in a workspace, if rollback is semantically correct, based on rate limit/backoff behaviors, or preventing unsafe commands.
Multi-Agent Systems: Verify planner↔executor contracts, capability negotiation, and schema conformance over A2A; evaluate interoperability across heterogeneous frameworks.
Regression & Drift MonitoringNightly Suites to detect new models or changes in the model; detect behavior drift and enforce critical pass criteria for policy before release.

What Exactly Is Rogue—and Why Should Agent Dev Teams Care?

Rogue This is a comprehensive testing framework that evaluates the reliability, performance and compliance of AI agents. Rogue Synthesizes context, risk and business into tests that have clear goals, tactics and criteria for success. EvaluatorAgent can run protocol-correct conversations either in a fast one turn mode or a deep adversarial mode. You can bring your own model or have it made. Rogue Tests can be driven by the SLM judge of Qualifire. Streaming observability and deterministic artifacts: live transcripts,pass/fail verdicts, rationales tied to transcript spans, timing and model/version lineage.

Rogue: The Inside Story

Rogue uses a client/server architecture.

Rogue Server: Includes the core evaluation logic
Client InterfacesMultiple interfaces connecting to the server
- TUI Terminal UI: A modern interface for terminals built with Bubble Tea and Go
- Web UIGradio-based Web interface
- CLI: Command line interface for automated evaluation, CI/CD

The architecture is flexible and allows multiple users to connect simultaneously and run the server independently.

You can read more about it here:

Rogue Helps developer teams to test agent behavior in the real-world environment. Written policies are turned into scenarios that can be tested over A2A. The transcripts of the tests show what actually happened. It produces a signal which you can repeat in CI/CD and use to identify policy breaches or regressions.

Thanks to the Qualifire team for the thought leadership/ Resources for this article. This article/content has been supported by the Qualifire Team.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence’s potential to benefit society. Marktechpost was his most recent venture. This platform, dedicated to Artificial Intelligence, is known for the in-depth reporting of news on machine learning and deep understanding that is both technical and understandable. Over 2 million views per month are a testament to the platform’s popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

The Release of Rogue by Qualifire: A Framework for Agentic Testing, Evaluating AI Agents’ Performance

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

US Tech Giants race to spend Billions in UK Artificial Intelligence Push

The ‘Cofounder of my AI agent’ conquered LinkedIn. After that, it was banned.

The Vibes-Based Pricing of ‘Pro’ AI Software

Elon Musk’s Grok ‘Undressing’ Problem Isn’t Fixed

Feeld was a dating app for freaks. Now Some People Call It ‘Normie Hell’

Top Insights

The right way to Run Fb Adverts: 2026 Newbie’s Information

Mamba-3 is a new state space model frontier with two times smaller states and enhanced MIMO decoding hardware efficiency

Latest News