Cloudflare released Cloudflare v. Agents SDK v0.5.0 Addressing the limitations of serverless stateless functions for AI development. Each LLM is required to rebuild the context of the session in standard serverless architectural designs, resulting in increased latency. Agents SDK’s latest version, Agents SDK V0.5.0, offers a vertically-integrated execution layer, where compute, inference, and state coexist on the network edge.
SDK allows you to develop agents with state that lasts for long periods of time, going beyond request-response cycles. It is done by using two technologies. Durable Objects provides persistent identity and state, while Infire is a Rust-based inference engine optimized for edge resources. The architecture is designed to eliminate the need for devs to handle external database connections and WebSocket server synchronization.
State Management via Durable Objects
Agents SDK uses Durable Objects to maintain persistent memory and identity for each agent instance. In the traditional serverless model, functions don’t have any history of events until they access an external database, such as RDS, DynamoDB or a similar service. These queries can often cause a 50ms to 200ms latency.
A Durable Object (DO) is a small micro-server with its private storage that is running in Cloudflare. The Agents SDK assigns a stable ID to each agent that is created. This allows the agent to maintain its memory state by sending all future requests from that particular user to the same instance. The embedded SQLite databases in each agent have a storage limit of 1GB per instance. This allows for zero latency reading and writings to be performed on conversation logs and tasks.
Durable objects are single threaded, simplifying concurrency control. The design eliminates race conditions by ensuring that only one event can be processed per agent instance at any given time. When an agent is receiving multiple inputs at the same time, these are queued up and then processed in a atomic manner, which ensures that state consistency during complex operations.
Infire – Optimizing Inference Using Rust
Cloudflare has developed Infire for the inference layer. It is an LLM engine that was written in Rust to replace Python-based stacks such as vLLM. Python engines are often slowed down by the Global Interpreter Lock and garbage collection delays. Infire was designed to reduce CPU overhead and maximize GPU usage on H100 hardware.
Granular CUDA Graphs are used in the engine, as well as Just-In-Time compilation. Infire creates a CUDA graph on-the-fly for each possible batch size, instead of launching GPU Kernels sequentially. It allows for the driver to run work in one monolithic unit, which reduces CPU overhead by 82%. Benchmarks show that Infire is 7% faster than vLLM 0.10.0 on unloaded machines, utilizing only 25% CPU compared to vLLM’s >140%.
| Metric | vLLM 0.10 (Python). | Infire (Rust) | Improve your life with the help of |
| Throughput speed | Baseline | Get 7% faster | +7% |
| CPU Overhead | >140% CPU usage | 15% CPU Use | -82% |
| Early Startup Lag | High (Cold start) | Considerably |
Infire uses Paged KV Caching to break memory up into non-contiguous chunks and prevent fragmentation. This enables ‘continuous batching,’ where the engine processes new prompts while simultaneously finishing previous generations without a performance drop. Cloudflare is able to achieve a warm request rate (inference) of 99.99% using this architecture.
The Code Mode Token Efficiency
Standard AI agents typically use ‘tool calling,’ where the LLM outputs a JSON object to trigger a function. It is necessary to switch back and forth between the LLM, the execution environment, and each tool. Cloudflare’s ‘Code Mode’ changes this by asking the LLM to write a TypeScript program that orchestrates multiple tools at once.
This code runs in an isolated V8 sandbox. Code Mode reduces token consumption by 87.5% for complex tasks such as searching ten different files. Since intermediate results remain within the Sandbox and do not have to be sent back every time, this process is quicker and cost-effective.
Code Mode also improves security through ‘secure bindings.’ It is not possible to interact with MCP servers in the sandbox as it does not have internet access. Instead, bindings are used within the environment object. These bindings conceal sensitive API key from the LLM and prevent the model from accidentally leaking credential in generated code.
Febuary 2026, the v0.5.0 release
Agents SDK has reached version 0.5.0. The Agents SDK reached version 0.5.0.
- this.retry()New method to retry asynchronous operation with exponential backoff.
- Protocol SuppressionDevelopers are now able to suppress JSON frames per connection using the
shouldSendProtocolMessageshook. It is especially useful when IoT clients or MQTT clients cannot handle JSON. - Stable AI ChatThe
@cloudflare/ai-chatThe package has reached version 0.1.0. This includes the addition of message persistence in SQLite. “Row Size Guard” When messages are approaching the SQLite 2MB limit, this compaction function is automatically performed.
| Features | Please click here to view the full description. |
| this.retry() | Automated retries of external API calls. |
| Data Parts | Add typed JSON Blobs as attachments in chat messages. |
| Tools Approval | The approval status that persists after hibernation. |
| Synchronous Getters | getQueue() You can also find out more about the following: Schedule() No longer required Promises |
What you need to know
- The Edge: Stateful PersistenceThe Agents SDK, unlike traditional serverless stateless functions uses durable objects to give agents a permanent memory and identity. The Agents SDK allows for each agent to keep its state within an embedded SQLite Database with 1GB storage. This allows zero latency access without any external database calls.
- The High-Efficiency Rust DetectionCloudflare’s Infire, a Rust-based inference engine optimizes GPU use by reducing CPU overheads by 82%. Benchmarks have shown that the Python-based version of vLLM is 7 percent faster. It also uses Paged KV Caching, which maintains 99.99% warm requests rate and reduces cold start delays.
- Token Optimization via Code Mode: ‘Code Mode’ allows agents to write and execute TypeScript programs in a secure V8 isolate rather than making multiple individual tool calls. The deterministic method reduces the token usage by 87.5 % for complex tasks, and intermediate data is kept within the sandbox. This improves both security and speed.
- Universal Tool IntegrationModel Context Protocol is fully supported by the platform. This standard acts as a translator universal for AI-based tools. Cloudflare deployed thirteen official MCP servers, which allow agents manage infrastructure elements like DNS, Workers KV, and R2 storage using natural language.
- Production-Ready Utilities Version 0.5.0In February 2026 the release included critical reliability features including:
this.retry()It is a utility that allows asynchronous operations to be performed with exponential delay and jitter. The software also adds protocol suppression to allow agents to interact with IoT binary devices, lightweight embedded systems and IoT binary devices that are not able process JSON text frames.
Click here to find out more Technical details. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

