How to design a mini reinforcement learning environment-acting agent with intelligent local feedback, adaptive decision-making, and multi-agent coordination

This tutorial demonstrates how to code a small reinforcement learning system in which multi-agent systems learn to navigate grid-worlds through feedback and layering decision making. Then, we build three different agent roles, an Action Agent and Tool Agent. This allows us to observe the combination of simple heuristics and analysis with oversight and supervision. We also observe the way the agents work together, develop their strategies and learn how to overcome obstacles while learning. Click here to see the FULL CODES here.

Numpy can be imported as np
Matplotlib.pyplot can be imported as a plt
Clear_output import from IPython.display
import time
Import defaults from Collections


GridWorld class:
   def __init__(self, size=8):
 Self-size = Size
       self.agent_pos = [0, 0]
       self.goal_pos = [size-1, size-1]
       self.obstacles = self._generate_obstacles()
 Set = self-visitation()
       self.step_count = 0
       self.max_steps = size * size * 2
      
   def _generate_obstacles(self):
       obstacles = set()
       n_obstacles = self.size
 While len (obstacles),

GridWorld’s environment is created and we define the goals and agents that exist within it. As we create the framework for valid state movements and representation, we also prepare the gridworld environment to allow us dynamic interaction. We see how the world is taking shape, and readying itself for agents to explore. See the FULL CODES here.

class GridWorld(GridWorld):
   def step(self, action):
       self.step_count += 1
       moves = {Moves = "up" [-1, 0], 'down': [1, 0], 'left': [0, -1], 'right': [0, 1]}
      
 When action does not move:
 Return self._get_state(), -1, False, "Invalid action"
      
       delta = moves[action]
       new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
      
 if (0 = self.max_steps
 If you are unsure, just do it.
 Click here to learn more about "Max steps reached"
      
 Return self._get_state(), reward, done, info
  
   def render(self, agent_thoughts=None):
       grid = np.zeros((self.size, self.size, 3))
 Visited:
 Grid[pos[0]Pos[1]] = [0.7, 0.9, 1.0]
 Obs in Self. Obstacles
 Grid[obs[0]Obs[1]] = [0.2, 0.2, 0.2]
 Grid[self.goal_pos[0], self.goal_pos[1]] = [0, 1, 0]
 Grid[self.agent_pos[0], self.agent_pos[1]] = [1, 0, 0]
      
       plt.figure(figsize=(10, 8))
       plt.imshow(grid, interpolation='nearest')
       plt.title(f"Step: {self.step_count} | Visited: {len(self.visited)}/{self.size*self.size}")
       for i in range(self.size + 1):
           plt.axhline(i - 0.5, color="gray", linewidth=0.5)
           plt.axvline(i - 0.5, color="gray", linewidth=0.5)
 If agent_thoughts
           plt.text(0.5, -1.5, agent_thoughts, ha="center", fontsize=9,
                    bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.8),
                    wrap=True, transform=plt.gca().transData)
       plt.axis('off')
       plt.tight_layout()
       plt.show()

The world and its visual rendering are defined by us. We track and calculate progress, identify collisions, display all of this in an elegant grid. We can watch in real-time the journey of an agent as we apply this logic. See the FULL CODES here.

Class ActionAgent
   def __init__(self):
       self.q_values = defaultdict(lambda: defaultdict(float))
       self.epsilon = 0.3
       self.learning_rate = 0.1
       self.discount = 0.95
  
   def choose_action(self, state):
 Valid_actions = State['can_move']
 If not valid_actions
 Return None
       pos = state['position']
 If np.random.random()  5:
           suggestions.append("🔍 Low exploration rate. Consider exploring more.")
       if len(history) >= 5:
           recent_rewards = [h[2] For h in History[-5:]]
           avg_reward = np.mean(recent_rewards)
 If avg_reward is 0.3, then:
               suggestions.append("✅ Good progress! Current strategy working.")
 If len(state)['can_move'])

The Action Agent is implemented, and the Tool Agent provides the system with both analytical and learning feedback. The Action Agent selects the best actions by balancing exploration with exploitation. Meanwhile, the Tool Agent assesses the performance of the system and makes suggestions for improvements. The two agents create a loop of learning that grows with each experience. See the FULL CODES here.

class SupervisorAgent:
   def decide(self, state, proposed_action, tool_suggestions):
 If not taken_action
           return None, "No valid actions available"
      
       decision = proposed_action
 The reasoning is f"Approved action '{proposed_action}'"
      
 For suggestion:
 If you want to know more about if "goal" In suggestion.lower() The following are some examples of how to get started: "close" In suggestion.lower():
               goal_direction = self._get_goal_direction(state)
 If state is goal_direction, then it will be a problem.['can_move']:
                   decision = goal_direction
 The reasoning is f"Override: Moving '{goal_direction}' toward goal"
 Breaking News
      
       return decision, reasoning
  
   def _get_goal_direction(self, state):
       pos = state['position']
 Goal = state['goal']
 If goal[0] > pos[0]:
 Return 'down'
 Goal elif[0] Pos[1]:
 Return 'right"
       else:
 Return 'left"

Introduced is the Supervisor agent, who acts as the ultimate decision maker within the system. This agent interprets and eliminates risky options, while ensuring that all actions are aligned to the overall goal. We experience multi-agent coordination as we work with this component. See the FULL CODES here.

def train_multi_agent(episodes=5, visualize=True):
 GridWorld = GridWorld(size=8)
 ActionAgent()
 ToolAgent()
   supervisor = SupervisorAgent()
  
   episode_rewards = []
   episode_steps = []
  
 Episodes in the range:
       state = env.reset()
       total_reward = 0
 False
 History []
      
       print(f"n{'='*60}")
       print(f"EPISODE {episode + 1}/{episodes}")
       print(f"{'='*60}")
      
 When not enough:
           action_result = action_agent.choose_action(state)
 If action_result = None
 Breaking News
           proposed_action, action_reasoning = action_result
          
           suggestions = tool_agent.analyze(state, proposed_action, total_reward, history)
           final_action, supervisor_reasoning = supervisor.decide(state, proposed_action, suggestions)
          
 If final_action = None
 Breaking News
          
           next_state, reward, done, info = env.step(final_action)
           total_reward += reward
           action_agent.learn(state, final_action, reward, next_state)
           history.append((state, final_action, reward, next_state))
          
 If visual:
               clear_output(wait=True)
 Thoughts = (f)"Action Agent: {action_reasoning}n"
 F"Supervisor: {supervisor_reasoning}n"
 F"Tool Agent: {', '.join(suggestions[:2]) if suggestions else 'No suggestions'}n"
 F"Reward: {reward:.2f} | Total: {total_reward:.2f}")
               env.render(thoughts)
               time.sleep(0.3)
          
 Next_state = state
      
       episode_rewards.append(total_reward)
       episode_steps.append(env.step_count)
      
       print(f"nEpisode {episode+1} Complete!")
       print(f"Total Reward: {total_reward:.2f}")
       print(f"Steps Taken: {env.step_count}")
       print(f"Cells Visited: {len(env.visited)}/{env.size**2}")
  
   plt.figure(figsize=(12, 4))
   plt.subplot(1, 2, 1)
   plt.plot(episode_rewards, marker="o")
   plt.title('Episode Rewards')
   plt.xlabel('Episode')
   plt.ylabel('Total Reward')
   plt.grid(True, alpha=0.3)
  
   plt.subplot(1, 2, 2)
   plt.plot(episode_steps, marker="s", color="orange")
   plt.title('Episode Steps')
   plt.xlabel('Episode')
 
   plt.grid(True, alpha=0.3)
   plt.tight_layout()
   plt.show()
  
 Agent, agent, return, supervisor


If the __name__ equals "__main__":
   print("🤖 Multi-Agent RL System: Grid World Navigation")
   print("=" * 60)
   print("Components:")
   print("  • Action Agent: Proposes actions using Q-learning")
   print("  • Tool Agent: Analyzes performance and suggests improvements")
   print("  • Supervisor Agent: Makes final decisions")
   print("=" * 60)
  
   trained_agents = train_multi_agent(episodes=5, visualize=True)

We perform a full loop of training where agents work together in the same environment over multiple episodes. With each test, we track rewards, monitor movement patterns and visualise learning progress. We can see that the system is improving as we go through this loop.

As a conclusion, we can see that a multi-agent RL is created from simple components. Each layer also contributes to a smarter navigation. For example, the Action Agent gains knowledge via Q updates, while the Tool Agent helps guide improvements. Finally, the Supervisor assures safe goal-oriented actions. The dynamic, yet simple, grid-based world allows us to see learning, exploration, as well as decision-making, in real time.

Click here to find out more FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

How to design a mini reinforcement learning environment-acting agent with intelligent local feedback, adaptive decision-making, and multi-agent coordination

Google Cloud AI Research introduces ReasoningBank: a memory framework that distills reasoning strategies from agent successes and failures.

Equinox Detailed implementation with JAX Native Moduls, Filtered Transformations, Stateful Ladders and Workflows from End to end.

Xiaomi MiMo V2.5 Pro and MiMo V2.5 Released: Frontier Model Benchmarks with Significantly Lower Token Cost

How to Create a Multi-Agent System of Production Grade CAMEL with Tool Usage, Consistency, and Criticism-Driven Improvement

Grok Is Pushing AI ‘Undressing’ Mainstream

AI podcasters Want To Tell You How To Keep A Man Happy

OpenClaw agents can be guilt-tripped into self-sabotage

OpenAI’s Extreme AI Liability Bill is opposed by Anthropic

Anthropic’s plan is to stop its AI from developing a nuclear weapons. Does It Work?

Top Insights

o1’s Thoughts on LNMs and LMMs • AI Blog

REST is a stress-testing framework for evaluating multi-problem reasoning in large reasoning models.

Latest News

Google Cloud AI Research introduces ReasoningBank: a memory framework that distills reasoning strategies from agent successes and failures.

Equinox Detailed implementation with JAX Native Moduls, Filtered Transformations, Stateful Ladders and Workflows from End to end.

How to design a mini reinforcement learning environment-acting agent with intelligent local feedback, adaptive decision-making, and multi-agent coordination

Related Posts