Cloudflare Releases the Agents SDK V0.5.0, which includes a rewritten chat @cloudflare/ai and Rust-Powered Engine to Optimize Edge Inference Performance

Cloudflare released Cloudflare v. Agents SDK v0.5.0 Addressing the limitations of serverless stateless functions for AI development. Each LLM is required to rebuild the context of the session in standard serverless architectural designs, resulting in increased latency. Agents SDK’s latest version, Agents SDK V0.5.0, offers a vertically-integrated execution layer, where compute, inference, and state coexist on the network edge.

SDK allows you to develop agents with state that lasts for long periods of time, going beyond request-response cycles. It is done by using two technologies. Durable Objects provides persistent identity and state, while Infire is a Rust-based inference engine optimized for edge resources. The architecture is designed to eliminate the need for devs to handle external database connections and WebSocket server synchronization.

State Management via Durable Objects

Agents SDK uses Durable Objects to maintain persistent memory and identity for each agent instance. In the traditional serverless model, functions don’t have any history of events until they access an external database, such as RDS, DynamoDB or a similar service. These queries can often cause a 50ms to 200ms latency.

A Durable Object (DO) is a small micro-server with its private storage that is running in Cloudflare. The Agents SDK assigns a stable ID to each agent that is created. This allows the agent to maintain its memory state by sending all future requests from that particular user to the same instance. The embedded SQLite databases in each agent have a storage limit of 1GB per instance. This allows for zero latency reading and writings to be performed on conversation logs and tasks.

Durable objects are single threaded, simplifying concurrency control. The design eliminates race conditions by ensuring that only one event can be processed per agent instance at any given time. When an agent is receiving multiple inputs at the same time, these are queued up and then processed in a atomic manner, which ensures that state consistency during complex operations.

Infire – Optimizing Inference Using Rust

Cloudflare has developed Infire for the inference layer. It is an LLM engine that was written in Rust to replace Python-based stacks such as vLLM. Python engines are often slowed down by the Global Interpreter Lock and garbage collection delays. Infire was designed to reduce CPU overhead and maximize GPU usage on H100 hardware.

Granular CUDA Graphs are used in the engine, as well as Just-In-Time compilation. Infire creates a CUDA graph on-the-fly for each possible batch size, instead of launching GPU Kernels sequentially. It allows for the driver to run work in one monolithic unit, which reduces CPU overhead by 82%. Benchmarks show that Infire is 7% faster than vLLM 0.10.0 on unloaded machines, utilizing only 25% CPU compared to vLLM’s >140%.

Metric	vLLM 0.10 (Python).	Infire (Rust)	Improve your life with the help of
Throughput speed	Baseline	Get 7% faster	+7%
CPU Overhead	>140% CPU usage	15% CPU Use	-82%
Early Startup Lag	High (Cold start)		Considerably

Infire uses Paged KV Caching to break memory up into non-contiguous chunks and prevent fragmentation. This enables ‘continuous batching,’ where the engine processes new prompts while simultaneously finishing previous generations without a performance drop. Cloudflare is able to achieve a warm request rate (inference) of 99.99% using this architecture.

The Code Mode Token Efficiency

Standard AI agents typically use ‘tool calling,’ where the LLM outputs a JSON object to trigger a function. It is necessary to switch back and forth between the LLM, the execution environment, and each tool. Cloudflare’s ‘Code Mode’ changes this by asking the LLM to write a TypeScript program that orchestrates multiple tools at once.

This code runs in an isolated V8 sandbox. Code Mode reduces token consumption by 87.5% for complex tasks such as searching ten different files. Since intermediate results remain within the Sandbox and do not have to be sent back every time, this process is quicker and cost-effective.

Code Mode also improves security through ‘secure bindings.’ It is not possible to interact with MCP servers in the sandbox as it does not have internet access. Instead, bindings are used within the environment object. These bindings conceal sensitive API key from the LLM and prevent the model from accidentally leaking credential in generated code.

Febuary 2026, the v0.5.0 release

Agents SDK has reached version 0.5.0. The Agents SDK reached version 0.5.0.

this.retry()New method to retry asynchronous operation with exponential backoff.
Protocol SuppressionDevelopers are now able to suppress JSON frames per connection using the shouldSendProtocolMessages hook. It is especially useful when IoT clients or MQTT clients cannot handle JSON.
Stable AI ChatThe @cloudflare/ai-chat The package has reached version 0.1.0. This includes the addition of message persistence in SQLite. “Row Size Guard” When messages are approaching the SQLite 2MB limit, this compaction function is automatically performed.

Features	Please click here to view the full description.
this.retry()	Automated retries of external API calls.
Data Parts	Add typed JSON Blobs as attachments in chat messages.
Tools Approval	The approval status that persists after hibernation.
Synchronous Getters	`getQueue()` You can also find out more about the following: `Schedule()` No longer required Promises

What you need to know

The Edge: Stateful PersistenceThe Agents SDK, unlike traditional serverless stateless functions uses durable objects to give agents a permanent memory and identity. The Agents SDK allows for each agent to keep its state within an embedded SQLite Database with 1GB storage. This allows zero latency access without any external database calls.
The High-Efficiency Rust DetectionCloudflare’s Infire, a Rust-based inference engine optimizes GPU use by reducing CPU overheads by 82%. Benchmarks have shown that the Python-based version of vLLM is 7 percent faster. It also uses Paged KV Caching, which maintains 99.99% warm requests rate and reduces cold start delays.
Token Optimization via Code Mode: ‘Code Mode’ allows agents to write and execute TypeScript programs in a secure V8 isolate rather than making multiple individual tool calls. The deterministic method reduces the token usage by 87.5 % for complex tasks, and intermediate data is kept within the sandbox. This improves both security and speed.
Universal Tool IntegrationModel Context Protocol is fully supported by the platform. This standard acts as a translator universal for AI-based tools. Cloudflare deployed thirteen official MCP servers, which allow agents manage infrastructure elements like DNS, Workers KV, and R2 storage using natural language.
Production-Ready Utilities Version 0.5.0In February 2026 the release included critical reliability features including: this.retry() It is a utility that allows asynchronous operations to be performed with exponential delay and jitter. The software also adds protocol suppression to allow agents to interact with IoT binary devices, lightweight embedded systems and IoT binary devices that are not able process JSON text frames.

Click here to find out more Technical details. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Cloudflare Releases the Agents SDK V0.5.0, which includes a rewritten chat @cloudflare/ai and Rust-Powered Engine to Optimize Edge Inference Performance

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

You can forget SEO. Generative Engine Optimization: Welcome to the World

AI and the Enshittification Trap

This Chatbot Software Pays Customers $50 a Month for Their Suggestions on AI Fashions

AI Agents have tried to hack into my web page that is coded with Vibe

‘Thank You for Generating With Us!’ Hollywood AI acolytes stay on the hype train

Top Insights

What is AI Agent Watchability? What are the 7 best practices for reliable AI?

SXSW London has 6 tips for creatives

Latest News

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

Cloudflare Releases the Agents SDK V0.5.0, which includes a rewritten chat @cloudflare/ai and Rust-Powered Engine to Optimize Edge Inference Performance

State Management via Durable Objects

Infire – Optimizing Inference Using Rust

The Code Mode Token Efficiency

Febuary 2026, the v0.5.0 release

What you need to know

Related Posts