Anthropic AI Releases: Bloom: An open-source framework to evaluate AI models.

Anthropic’s Bloom is an agentic open-source framework for automating behavioral evaluations. The system uses a behavior specified by a researcher to build targeted evaluations which measure the frequency and strength of that behavior in real-world scenarios.

Why Bloom?

The design and maintenance of behavioral evaluations to ensure safety and alignment is expensive. The teams must create scenarios, perform many interactions, and read transcripts. Old benchmarks may become outdated or leak onto training data as models change. Anthropic research frames the issue as a scaling problem. The team needs to quickly generate new evaluations of misaligned behaviours while still keeping metrics relevant.

Bloom fills this gap. Bloom creates an evaluation suite by starting with a single configuration. This seed determines which behavior is to be studied, how many different scenarios are to be generated and the interaction style. Each run of the framework produces new scenarios that are behavior-consistent, yet still reproducible through a recorded seed.

https://www.anthropic.com/research/bloom

Design of the seed configuration

Bloom was released on GitHub under the MIT License. The evaluation is at the heart of Bloom. “seed”Definition of ” seed.yaml. This file refers to a behavior in behaviors/behaviors.jsonExample transcripts are optional, and there are global parameters which shape the entire run.

Key configuration elements include:

A unique identifier that is defined by behaviors.json For example, sycophancy and self-preservation are examples of the behavior to be targeted.
Example of a reputable websiteThere are zero or several few shots transcripts that can be stored. behaviors/examples/
total_evalsThe number of rolls out to be generated in the suite
rollout.targetThe model being evaluated such as claude-sonnet-4
Controls such as Diversity, max_turns, ModalityThe ability to reason and judge additional qualities

Bloom is able to talk with Anthropic models and OpenAI through a unified interface. Weights and Biases can be integrated with Bloom for large sweeps and the export of Inspect compatible transcripts.

Four stage pipeline for agentics

Bloom’s process of evaluation is divided into four stages:

Understanding agentThe agent will read the description of behavior and examples. The agent creates a structured overview of how to recognize a successful example of a behavior, and the importance of that behavior. In the example, it identifies specific examples of successful behavior so that those in later stages will know what to expect.
Ideation agentThe ideation stage produces candidate evaluation scenario. Each scenario describes an actual situation and the persona of the users, as well as the models that can be used to achieve the desired results. Bloom uses token budgets to generate scenarios in batches and the diversity parameter is used to choose between more diverse scenarios or more variations within each scenario.
Roll-out agentThis agent will instantiate these scenarios using the model of your choice. The agent can simulate environments or run multiple-turn conversations. It also records messages and calls to tools. The configuration parameters include max_turns, Modality The following are some examples of how to get started: no_user_mode Control the level of autonomy that your target model has during this phase.
Meta-judgment agents and judgmentThe judge model can score each transcript on a numeric scale for the presence of behavior. It also rates additional attributes like realistic or forcefulness in the evaluator. Meta judges then review summaries for all rollouts to produce a suite-level report that highlights important cases and patterns. Main metric: elicitation rates, which is the percentage of rolls that scored at least seven out of ten for behavioral presence.

Validation using frontier models

Anthropic uses Bloom to build four alignment relevant evaluation suitesFor delusional sycophancy; for long-horizon sabotage and bias towards self. The 16 models are repeated 3 times for each suite. Each suite contains 100 different rollouts. All plots are reported with elicitation rates and standard deviation error bars using Claude Opus 4.1 across all stages.

Bloom is also tested on intentionally misaligned ‘model organisms’ from earlier alignment work. Bloom can distinguish the organism in nine cases from a baseline production model based on 10 quirks. Manual inspection of the last self promotion quirk shows the baseline model displays similar behaviors, explaining the scores overlap. In a separate validation, 40 transcripts were compared to 11 judge candidate models. Claude Sonnet 4.5 reaches a Spearman coefficient of 0.75 and Claude Sonnet 4.1 reaches 0.8 with the human score.

https://alignment.anthropic.com/2025/bloom-auto-evals/

Positioning in relation to Petri

Bloom is positioned by Anthropic as complementing Petri. Petri, a tool for broad coverage auditing, uses agents that probe the models using multi-turn interactions. Bloom starts with a single behavior definition, automating the engineering required to convert it into a large evaluation suite.

What you need to know

Bloom is an agentic open-source framework which transforms a behavior specification for large models into a behavioral evaluation suite using a pipeline consisting of four stages: understanding, ideation and rollout, followed by judgment.
It is driven by the seed configuration. seed.yaml The following are some examples of how to get started: behaviors/behaviors.jsonThe researchers will specify the desired behavior and other controls, such as diversification, maximal turns, modality, etc.
Bloom is based on LiteLLM, which provides a unified way to access Anthropic models and OpenAI. Weights & Biases allows experiment tracking.
Bloom validated by Anthropic on four alignment-focused behaviors in 16 frontier models, with 100 rolls repeated three times. Also, on 10 quirks of model organisms where Bloom separated intentionally misaligned models from baseline models 9 out of 10, and judge models matching human labels up to Spearman coefficients as high as 0.86.

Take a look at the Github Repo, Technical report The following are some examples of how to get started: Blog. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. This platform has over 2,000,000 monthly views which shows its popularity.

Anthropic AI Releases: Bloom: An open-source framework to evaluate AI models.

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

This is a complete guide to running OpenAI’s GPT-OSS open-weight models using advanced inference workflows.

AI blog on Cybersecurity and Learning LLMs

AI Analysis Is Getting More durable to Separate From Geopolitics

‘100 Video Calls Per Day’: Models Are Applying to Be the Face of AI Scams

The AI wearable from ex-Apple engineers looks like an iPod Shuffle

Sora’s Blurred Truths

Top Insights

Apple Intelligence is a Gambler on Privacy As A Killer Feature

Google AI has shipped a Model Context Protocol server for Data Commons that gives AI agents access to data in a first-class manner

Latest News

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Anthropic AI Releases: Bloom: An open-source framework to evaluate AI models.

Why Bloom?

Design of the seed configuration

Four stage pipeline for agentics

Validation using frontier models

Positioning in relation to Petri

What you need to know

Related Posts