This tutorial demonstrates how to code a small reinforcement learning system in which multi-agent systems learn to navigate grid-worlds through feedback and layering decision making. Then, we build three different agent roles, an Action Agent and Tool Agent. This allows us to observe the combination of simple heuristics and analysis with oversight and supervision. We also observe the way the agents work together, develop their strategies and learn how to overcome obstacles while learning. Click here to see the FULL CODES here.
Numpy can be imported as np
Matplotlib.pyplot can be imported as a plt
Clear_output import from IPython.display
import time
Import defaults from Collections
GridWorld class:
def __init__(self, size=8):
Self-size = Size
self.agent_pos = [0, 0]
self.goal_pos = [size-1, size-1]
self.obstacles = self._generate_obstacles()
Set = self-visitation()
self.step_count = 0
self.max_steps = size * size * 2
def _generate_obstacles(self):
obstacles = set()
n_obstacles = self.size
While len (obstacles),
GridWorld’s environment is created and we define the goals and agents that exist within it. As we create the framework for valid state movements and representation, we also prepare the gridworld environment to allow us dynamic interaction. We see how the world is taking shape, and readying itself for agents to explore. See the FULL CODES here.
class GridWorld(GridWorld):
def step(self, action):
self.step_count += 1
moves = {Moves = "up" [-1, 0], 'down': [1, 0], 'left': [0, -1], 'right': [0, 1]}
When action does not move:
Return self._get_state(), -1, False, "Invalid action"
delta = moves[action]
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
if (0 = self.max_steps
If you are unsure, just do it.
Click here to learn more about "Max steps reached"
Return self._get_state(), reward, done, info
def render(self, agent_thoughts=None):
grid = np.zeros((self.size, self.size, 3))
Visited:
Grid[pos[0]Pos[1]] = [0.7, 0.9, 1.0]
Obs in Self. Obstacles
Grid[obs[0]Obs[1]] = [0.2, 0.2, 0.2]
Grid[self.goal_pos[0], self.goal_pos[1]] = [0, 1, 0]
Grid[self.agent_pos[0], self.agent_pos[1]] = [1, 0, 0]
plt.figure(figsize=(10, 8))
plt.imshow(grid, interpolation='nearest')
plt.title(f"Step: {self.step_count} | Visited: {len(self.visited)}/{self.size*self.size}")
for i in range(self.size + 1):
plt.axhline(i - 0.5, color="gray", linewidth=0.5)
plt.axvline(i - 0.5, color="gray", linewidth=0.5)
If agent_thoughts
plt.text(0.5, -1.5, agent_thoughts, ha="center", fontsize=9,
bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.8),
wrap=True, transform=plt.gca().transData)
plt.axis('off')
plt.tight_layout()
plt.show()
The world and its visual rendering are defined by us. We track and calculate progress, identify collisions, display all of this in an elegant grid. We can watch in real-time the journey of an agent as we apply this logic. See the FULL CODES here.
Class ActionAgent
def __init__(self):
self.q_values = defaultdict(lambda: defaultdict(float))
self.epsilon = 0.3
self.learning_rate = 0.1
self.discount = 0.95
def choose_action(self, state):
Valid_actions = State['can_move']
If not valid_actions
Return None
pos = state['position']
If np.random.random() 5:
suggestions.append("🔍 Low exploration rate. Consider exploring more.")
if len(history) >= 5:
recent_rewards = [h[2] For h in History[-5:]]
avg_reward = np.mean(recent_rewards)
If avg_reward is 0.3, then:
suggestions.append("✅ Good progress! Current strategy working.")
If len(state)['can_move'])
The Action Agent is implemented, and the Tool Agent provides the system with both analytical and learning feedback. The Action Agent selects the best actions by balancing exploration with exploitation. Meanwhile, the Tool Agent assesses the performance of the system and makes suggestions for improvements. The two agents create a loop of learning that grows with each experience. See the FULL CODES here.
class SupervisorAgent:
def decide(self, state, proposed_action, tool_suggestions):
If not taken_action
return None, "No valid actions available"
decision = proposed_action
The reasoning is f"Approved action '{proposed_action}'"
For suggestion:
If you want to know more about if "goal" In suggestion.lower() The following are some examples of how to get started: "close" In suggestion.lower():
goal_direction = self._get_goal_direction(state)
If state is goal_direction, then it will be a problem.['can_move']:
decision = goal_direction
The reasoning is f"Override: Moving '{goal_direction}' toward goal"
Breaking News
return decision, reasoning
def _get_goal_direction(self, state):
pos = state['position']
Goal = state['goal']
If goal[0] > pos[0]:
Return 'down'
Goal elif[0] Pos[1]:
Return 'right"
else:
Return 'left"
Introduced is the Supervisor agent, who acts as the ultimate decision maker within the system. This agent interprets and eliminates risky options, while ensuring that all actions are aligned to the overall goal. We experience multi-agent coordination as we work with this component. See the FULL CODES here.
def train_multi_agent(episodes=5, visualize=True):
GridWorld = GridWorld(size=8)
ActionAgent()
ToolAgent()
supervisor = SupervisorAgent()
episode_rewards = []
episode_steps = []
Episodes in the range:
state = env.reset()
total_reward = 0
False
History []
print(f"n{'='*60}")
print(f"EPISODE {episode + 1}/{episodes}")
print(f"{'='*60}")
When not enough:
action_result = action_agent.choose_action(state)
If action_result = None
Breaking News
proposed_action, action_reasoning = action_result
suggestions = tool_agent.analyze(state, proposed_action, total_reward, history)
final_action, supervisor_reasoning = supervisor.decide(state, proposed_action, suggestions)
If final_action = None
Breaking News
next_state, reward, done, info = env.step(final_action)
total_reward += reward
action_agent.learn(state, final_action, reward, next_state)
history.append((state, final_action, reward, next_state))
If visual:
clear_output(wait=True)
Thoughts = (f)"Action Agent: {action_reasoning}n"
F"Supervisor: {supervisor_reasoning}n"
F"Tool Agent: {', '.join(suggestions[:2]) if suggestions else 'No suggestions'}n"
F"Reward: {reward:.2f} | Total: {total_reward:.2f}")
env.render(thoughts)
time.sleep(0.3)
Next_state = state
episode_rewards.append(total_reward)
episode_steps.append(env.step_count)
print(f"nEpisode {episode+1} Complete!")
print(f"Total Reward: {total_reward:.2f}")
print(f"Steps Taken: {env.step_count}")
print(f"Cells Visited: {len(env.visited)}/{env.size**2}")
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards, marker="o")
plt.title('Episode Rewards')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(episode_steps, marker="s", color="orange")
plt.title('Episode Steps')
plt.xlabel('Episode')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Agent, agent, return, supervisor
If the __name__ equals "__main__":
print("🤖 Multi-Agent RL System: Grid World Navigation")
print("=" * 60)
print("Components:")
print(" • Action Agent: Proposes actions using Q-learning")
print(" • Tool Agent: Analyzes performance and suggests improvements")
print(" • Supervisor Agent: Makes final decisions")
print("=" * 60)
trained_agents = train_multi_agent(episodes=5, visualize=True)
We perform a full loop of training where agents work together in the same environment over multiple episodes. With each test, we track rewards, monitor movement patterns and visualise learning progress. We can see that the system is improving as we go through this loop.
As a conclusion, we can see that a multi-agent RL is created from simple components. Each layer also contributes to a smarter navigation. For example, the Action Agent gains knowledge via Q updates, while the Tool Agent helps guide improvements. Finally, the Supervisor assures safe goal-oriented actions. The dynamic, yet simple, grid-based world allows us to see learning, exploration, as well as decision-making, in real time.
Click here to find out more FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

