10. Headless Training with PPO
By Killian Trouillet
Starting GAMA Headless​
Training uses GAMA in headless mode — no GUI, just a WebSocket server. This is much faster than running with the display.
Windows​
gama-headless.bat -socket 1001
Linux / MacOS​
./gama-headless.sh -socket 1001
Wait for the message indicating the server is ready before running the Python script.
Port choice: Any port except
1000(reserved for GUI). Common choices:1001,6868,8080.
Understanding PPO​
PPO (Proximal Policy Optimization) is a policy gradient algorithm. Unlike Q-Learning which learns values for state-action pairs, PPO directly learns a policy — a neural network that maps observations to actions.
PPO vs Q-Learning​
| Aspect | Q-Learning (Part 1) | PPO (Part 2) |
|---|---|---|
| What it learns | A table of Q-values | A neural network policy |
| Action selection | Pick max Q-value | Sample from policy distribution |
| Exploration | ε-greedy (random with decay) | Entropy bonus (natural noise) |
| State space | Finite (needs a table) | Infinite (network generalizes) |
| Action space | Discrete only | Discrete or continuous |
| Update rule | Bellman equation | Gradient ascent on policy |
Key PPO Hyperparameters​
| Parameter | Value | Meaning |
|---|---|---|
learning_rate | 3e-4 | How fast the network updates (default PPO value) |
n_steps | 2048 | Collect 2048 steps before each network update |
batch_size | 64 | Process 64 steps at a time during optimization |
n_epochs | 10 | Iterate 10 times over the collected data |
gamma | 0.99 | Discount factor (same concept as Part 1) |
gae_lambda | 0.95 | GAE smoothing for advantage estimation |
clip_range | 0.2 | PPO's key innovation: limits how much the policy can change per update |
ent_coef | 0.01 | Entropy bonus — encourages exploration |
MlpPolicy​
This stands for Multi-Layer Perceptron Policy. It's a fully connected neural network with:
- Input layer: 13 neurons (one per observation value)
- Hidden layer 1: 64 neurons (ReLU activation)
- Hidden layer 2: 64 neurons (ReLU activation)
- Output layer: 2 neurons (dx, dy actions)
The Training Script​
Connecting to GAMA​
import gymnasium as gym
import gama_gymnasium
from stable_baselines3 import PPO
env = None
for attempt in range(5):
try:
env = gym.make(
"gama_gymnasium_env/GamaEnv-v0",
gaml_experiment_path="path/to/forager_gym.gaml",
gaml_experiment_name="gym_env",
gama_ip_address="localhost",
gama_port=1001,
)
break
except Exception as e:
print(f"Connection attempt {attempt + 1} failed. Retrying in 5s...")
time.sleep(5)
if env is None:
raise RuntimeError("Failed to connect to GAMA after several attempts.")
The gym.make() call:
- Connects to the GAMA WebSocket server
- Loads and starts the experiment
- Reads the action/observation spaces from the
GymAgent
Creating the PPO Model​
model = PPO(
"MlpPolicy",
env,
verbose=1,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
tensorboard_log="logs/",
)
Training Loop​
model.learn(total_timesteps=20_000, progress_bar=True)
model.save("saved_models/ppo_forager")
env.close()
That's it! Stable Baselines3 handles the entire training loop internally:
- Collecting rollouts (2048 steps each)
- Computing advantages using GAE
- Optimizing the policy with clipped surrogate objective
- Logging statistics
Why
asyncio? Thegama-gymnasiumlibrary uses asynchronous I/O internally to communicate with GAMA's WebSocket server, so thetrain()function must beasyncand launched withasyncio.run().
Monitoring with a Callback​
To track episode rewards, we use a custom callback:
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
class RewardLoggerCallback(BaseCallback):
def __init__(self):
super().__init__()
self.episode_rewards = []
self.current_reward = 0.0
def _on_step(self) -> bool:
self.current_reward += self.locals["rewards"][0]
if self.locals["dones"][0]:
self.episode_rewards.append(self.current_reward)
if len(self.episode_rewards) % 50 == 0:
avg = np.mean(self.episode_rewards[-50:])
print(f"Episode {len(self.episode_rewards)} | Avg Reward: {avg:.1f}")
self.current_reward = 0.0
return True
callback = RewardLoggerCallback()
model.learn(total_timesteps=100_000, callback=callback, progress_bar=True)
Complete Training Script​
"""
Smart Forager - PPO Training Script (Headless)
Author: Killian Trouillet
Usage:
1. Start GAMA headless: gama-headless.bat -socket 1001
2. Run this script: python train_forager.py
"""
import asyncio
from pathlib import Path
import gymnasium as gym
import gama_gymnasium
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
import matplotlib.pyplot as plt
import os
GAML_FILE = str(Path(__file__).parent / "forager_gym.gaml")
EXPERIMENT_NAME = "gym_env"
GAMA_PORT = 1001
TOTAL_TIMESTEPS = 50_000
SAVE_DIR = str(Path(__file__).parent / "saved_models")
LOG_DIR = str(Path(__file__).parent / "logs")
class RewardLoggerCallback(BaseCallback):
def __init__(self, verbose=0):
super().__init__(verbose)
self.episode_rewards = []
self.current_reward = 0.0
def _on_step(self) -> bool:
self.current_reward += self.locals["rewards"][0]
if self.locals["dones"][0]:
self.episode_rewards.append(self.current_reward)
if len(self.episode_rewards) % 50 == 0:
avg = np.mean(self.episode_rewards[-50:])
print(f" Episode {len(self.episode_rewards)} | "
f"Avg Reward (last 50): {avg:.1f}")
self.current_reward = 0.0
return True
def plot_training(rewards, save_path):
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(rewards, alpha=0.3, color="blue", label="Episode Reward")
if len(rewards) >= 20:
window = min(50, len(rewards))
avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
plt.plot(range(window-1, len(rewards)), avg,
color="red", linewidth=2, label=f"Moving Avg ({window} ep)")
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("PPO Training Progress - Smart Forager")
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
if len(rewards) >= 20:
plt.hist(rewards[-100:], bins=30, color="steelblue", edgecolor="white")
plt.xlabel("Episode Reward")
plt.ylabel("Count")
plt.title("Reward Distribution (last 100 episodes)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(save_path, dpi=150)
print(f"Training plot saved to: {save_path}")
plt.show()
async def train():
os.makedirs(SAVE_DIR, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)
print("=" * 50)
print(" Smart Forager - PPO Training (Headless)")
print("=" * 50)
print(f" GAML model: {GAML_FILE}")
print(f" GAMA port: {GAMA_PORT}")
print(f" Timesteps: {TOTAL_TIMESTEPS:,}")
print("=" * 50)
# Create the environment
# We use a retry loop because GAMA can take time to be 'ready'
env = None
for attempt in range(5):
try:
print(f"Connecting to GAMA (attempt {attempt + 1}/5)...")
env = gym.make(
"gama_gymnasium_env/GamaEnv-v0",
gaml_experiment_path=GAML_FILE,
gaml_experiment_name=EXPERIMENT_NAME,
gama_ip_address="localhost",
gama_port=GAMA_PORT,
)
break
except Exception as e:
if "Unable to find" in str(e) or "NOTREADY" in str(e):
print(f"GAMA not ready yet. Waiting 5s...")
time.sleep(5)
else:
raise e
if env is None:
print("Failed to connect to GAMA after several attempts.")
return
print(f"\nObservation space: {env.observation_space}")
print(f"Action space: {env.action_space}")
model = PPO(
"MlpPolicy", env, verbose=1,
learning_rate=3e-4, n_steps=2048, batch_size=64, n_epochs=10,
gamma=0.99, gae_lambda=0.95, clip_range=0.2, ent_coef=0.01,
tensorboard_log=LOG_DIR,
)
print("\n--- Training started ---\n")
reward_callback = RewardLoggerCallback()
model.learn(total_timesteps=TOTAL_TIMESTEPS, callback=reward_callback,
progress_bar=True)
model_path = os.path.join(SAVE_DIR, "ppo_forager")
model.save(model_path)
print(f"\nModel saved to: {model_path}")
plot_path = os.path.join(SAVE_DIR, "training_rewards.png")
plot_training(reward_callback.episode_rewards, plot_path)
env.close()
print("\n" + "=" * 50)
print(" Training complete!")
print(f" Total episodes: {len(reward_callback.episode_rewards)}")
if reward_callback.episode_rewards:
print(f" Final avg reward (last 50): "
f"{np.mean(reward_callback.episode_rewards[-50:]):.1f}")
print("=" * 50)
if __name__ == "__main__":
asyncio.run(train())
Running the Training​
cd models/gym
python train_forager.py
What to Expect​
- 0-10k steps: The forager moves randomly. Most episodes time out (reward ≈ -5).
- 10k-30k steps: The forager starts approaching the food. Reward improves gradually.
- 30k-50k steps: The forager reliably reaches the food. Reward ≈ 90+.
A reward plot is saved automatically to saved_models/training_rewards.png.