The game Claude 3.7 Sonnet was a challenge for Claude 3.7 Sonnet.dozens of hours” was stuck in a city, and it had difficulty identifying other players. This severely hindered the progress of its game. With Claude 4 Opus, Hershey noticed an improvement in Claude’s long-term memory and planning capabilities when he watched it navigate a complex Pokémon quest. When the AI realized it would need a specific power to continue, it spent two full days honing its skills. Hershey says that multi-step reasoning without immediate feedback shows a higher level of coherence. This means the model is better able to stay on course.
“This is one of my favorite ways to get to know a model. Like, this is how I understand what its strengths are, what its weaknesses are,” Hershey’s says “It’s my way of just coming to grips with this new model that we’re about to put out, and how to work with it.”
All Agents Wanted
Anthropic’s Pokémon research is a novel approach to tackling a preexisting problem—how do we understand what decisions an AI is making when approaching complex tasks, and nudge it in the right direction?
The answer to that question is integral to advancing the industry’s much-hyped AI agents—AI that can tackle complex tasks with relative independence. In Pokémon, it’s important that the model doesn’t lose context or “forget” This is the case for AI agents who are asked to automate a workflow, even if it takes hundreds of hours. That also applies to AI agents asked to automate a workflow—even one that takes hundreds of hours.
“As a task goes from being a five-minute task to a 30-minute task, you can see the model’s ability to keep coherent, to remember all of the things it needs to accomplish [the task] successfully get worse over time,” Hershey’s says
Anthropic, like many other AI labsThe company hopes to develop powerful agents that can be sold to consumers as products. Krieger claims that Anthropic’s “top objective” This year, Claude “doing hours of work for you.”
“This model is now delivering on it—we saw one of our early-access customers have the model go off for seven hours and do a big refactor,” Krieger is referring to a process that involves restructuring large amounts of code in order to organize and make them more efficient.
Google and OpenAI have been working on this kind of future. Google Mariner was released earlier this week. an AI agent built into Chrome OpenAI recently released a new version of its AI that allows it to do simple tasks such as buying groceries for $249.99 per monthly. OpenAI has recently released a coding agentIt was a couple of months ago it launched OperatorAn agent can search the internet on behalf of a user.
Anthropic’s competitors often see it as moving more slowly, doing research faster but deploying slower. This is a plus, especially with AI that’s powerful. There are so many potential problems with agents that have access to user information such as their inboxes or banking logins. Anthropic, in a post published on its blog Thursday, says: “We’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks.” The company also says that both Claude 4 Opus and Claude Sonnet 4 are 65 percent less likely to engage in this behavior, known as reward hacking, than prior models—at least on certain coding tasks.

