Skip to main content
Version: 🚧 Alpha 🚧

10. Headless Training with PPO

By Killian Trouillet


Starting GAMA Headless​

Training uses GAMA in headless mode — no GUI, just a WebSocket server. This is much faster than running with the display.

Windows​

gama-headless.bat -socket 1001

Linux / MacOS​

./gama-headless.sh -socket 1001

Wait for the message indicating the server is ready before running the Python script.

Port choice: Any port except 1000 (reserved for GUI). Common choices: 1001, 6868, 8080.


Understanding PPO​

PPO (Proximal Policy Optimization) is a policy gradient algorithm. Unlike Q-Learning which learns values for state-action pairs, PPO directly learns a policy — a neural network that maps observations to actions.

PPO vs Q-Learning​

AspectQ-Learning (Part 1)PPO (Part 2)
What it learnsA table of Q-valuesA neural network policy
Action selectionPick max Q-valueSample from policy distribution
Explorationε-greedy (random with decay)Entropy bonus (natural noise)
State spaceFinite (needs a table)Infinite (network generalizes)
Action spaceDiscrete onlyDiscrete or continuous
Update ruleBellman equationGradient ascent on policy

Key PPO Hyperparameters​

ParameterValueMeaning
learning_rate3e-4How fast the network updates (default PPO value)
n_steps2048Collect 2048 steps before each network update
batch_size64Process 64 steps at a time during optimization
n_epochs10Iterate 10 times over the collected data
gamma0.99Discount factor (same concept as Part 1)
gae_lambda0.95GAE smoothing for advantage estimation
clip_range0.2PPO's key innovation: limits how much the policy can change per update
ent_coef0.01Entropy bonus — encourages exploration

MlpPolicy​

This stands for Multi-Layer Perceptron Policy. It's a fully connected neural network with:

  • Input layer: 13 neurons (one per observation value)
  • Hidden layer 1: 64 neurons (ReLU activation)
  • Hidden layer 2: 64 neurons (ReLU activation)
  • Output layer: 2 neurons (dx, dy actions)

The Training Script​

Connecting to GAMA​

import gymnasium as gym
import gama_gymnasium
from stable_baselines3 import PPO

env = None
for attempt in range(5):
try:
env = gym.make(
"gama_gymnasium_env/GamaEnv-v0",
gaml_experiment_path="path/to/forager_gym.gaml",
gaml_experiment_name="gym_env",
gama_ip_address="localhost",
gama_port=1001,
)
break
except Exception as e:
print(f"Connection attempt {attempt + 1} failed. Retrying in 5s...")
time.sleep(5)

if env is None:
raise RuntimeError("Failed to connect to GAMA after several attempts.")

The gym.make() call:

  1. Connects to the GAMA WebSocket server
  2. Loads and starts the experiment
  3. Reads the action/observation spaces from the GymAgent

Creating the PPO Model​

model = PPO(
"MlpPolicy",
env,
verbose=1,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01,
tensorboard_log="logs/",
)

Training Loop​

model.learn(total_timesteps=20_000, progress_bar=True)
model.save("saved_models/ppo_forager")
env.close()

That's it! Stable Baselines3 handles the entire training loop internally:

  • Collecting rollouts (2048 steps each)
  • Computing advantages using GAE
  • Optimizing the policy with clipped surrogate objective
  • Logging statistics

Why asyncio? The gama-gymnasium library uses asynchronous I/O internally to communicate with GAMA's WebSocket server, so the train() function must be async and launched with asyncio.run().

Monitoring with a Callback​

To track episode rewards, we use a custom callback:

from stable_baselines3.common.callbacks import BaseCallback
import numpy as np

class RewardLoggerCallback(BaseCallback):
def __init__(self):
super().__init__()
self.episode_rewards = []
self.current_reward = 0.0

def _on_step(self) -> bool:
self.current_reward += self.locals["rewards"][0]
if self.locals["dones"][0]:
self.episode_rewards.append(self.current_reward)
if len(self.episode_rewards) % 50 == 0:
avg = np.mean(self.episode_rewards[-50:])
print(f"Episode {len(self.episode_rewards)} | Avg Reward: {avg:.1f}")
self.current_reward = 0.0
return True

callback = RewardLoggerCallback()
model.learn(total_timesteps=100_000, callback=callback, progress_bar=True)

Complete Training Script​

"""
Smart Forager - PPO Training Script (Headless)
Author: Killian Trouillet

Usage:
1. Start GAMA headless: gama-headless.bat -socket 1001
2. Run this script: python train_forager.py
"""

import asyncio
from pathlib import Path
import gymnasium as gym
import gama_gymnasium
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
import matplotlib.pyplot as plt
import os


GAML_FILE = str(Path(__file__).parent / "forager_gym.gaml")
EXPERIMENT_NAME = "gym_env"
GAMA_PORT = 1001
TOTAL_TIMESTEPS = 50_000
SAVE_DIR = str(Path(__file__).parent / "saved_models")
LOG_DIR = str(Path(__file__).parent / "logs")


class RewardLoggerCallback(BaseCallback):
def __init__(self, verbose=0):
super().__init__(verbose)
self.episode_rewards = []
self.current_reward = 0.0

def _on_step(self) -> bool:
self.current_reward += self.locals["rewards"][0]
if self.locals["dones"][0]:
self.episode_rewards.append(self.current_reward)
if len(self.episode_rewards) % 50 == 0:
avg = np.mean(self.episode_rewards[-50:])
print(f" Episode {len(self.episode_rewards)} | "
f"Avg Reward (last 50): {avg:.1f}")
self.current_reward = 0.0
return True


def plot_training(rewards, save_path):
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(rewards, alpha=0.3, color="blue", label="Episode Reward")
if len(rewards) >= 20:
window = min(50, len(rewards))
avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
plt.plot(range(window-1, len(rewards)), avg,
color="red", linewidth=2, label=f"Moving Avg ({window} ep)")
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("PPO Training Progress - Smart Forager")
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
if len(rewards) >= 20:
plt.hist(rewards[-100:], bins=30, color="steelblue", edgecolor="white")
plt.xlabel("Episode Reward")
plt.ylabel("Count")
plt.title("Reward Distribution (last 100 episodes)")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(save_path, dpi=150)
print(f"Training plot saved to: {save_path}")
plt.show()


async def train():
os.makedirs(SAVE_DIR, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)

print("=" * 50)
print(" Smart Forager - PPO Training (Headless)")
print("=" * 50)
print(f" GAML model: {GAML_FILE}")
print(f" GAMA port: {GAMA_PORT}")
print(f" Timesteps: {TOTAL_TIMESTEPS:,}")
print("=" * 50)

# Create the environment
# We use a retry loop because GAMA can take time to be 'ready'
env = None
for attempt in range(5):
try:
print(f"Connecting to GAMA (attempt {attempt + 1}/5)...")
env = gym.make(
"gama_gymnasium_env/GamaEnv-v0",
gaml_experiment_path=GAML_FILE,
gaml_experiment_name=EXPERIMENT_NAME,
gama_ip_address="localhost",
gama_port=GAMA_PORT,
)
break
except Exception as e:
if "Unable to find" in str(e) or "NOTREADY" in str(e):
print(f"GAMA not ready yet. Waiting 5s...")
time.sleep(5)
else:
raise e

if env is None:
print("Failed to connect to GAMA after several attempts.")
return

print(f"\nObservation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

model = PPO(
"MlpPolicy", env, verbose=1,
learning_rate=3e-4, n_steps=2048, batch_size=64, n_epochs=10,
gamma=0.99, gae_lambda=0.95, clip_range=0.2, ent_coef=0.01,
tensorboard_log=LOG_DIR,
)

print("\n--- Training started ---\n")
reward_callback = RewardLoggerCallback()
model.learn(total_timesteps=TOTAL_TIMESTEPS, callback=reward_callback,
progress_bar=True)

model_path = os.path.join(SAVE_DIR, "ppo_forager")
model.save(model_path)
print(f"\nModel saved to: {model_path}")

plot_path = os.path.join(SAVE_DIR, "training_rewards.png")
plot_training(reward_callback.episode_rewards, plot_path)

env.close()

print("\n" + "=" * 50)
print(" Training complete!")
print(f" Total episodes: {len(reward_callback.episode_rewards)}")
if reward_callback.episode_rewards:
print(f" Final avg reward (last 50): "
f"{np.mean(reward_callback.episode_rewards[-50:]):.1f}")
print("=" * 50)


if __name__ == "__main__":
asyncio.run(train())

Running the Training​

cd models/gym
python train_forager.py

What to Expect​

  1. 0-10k steps: The forager moves randomly. Most episodes time out (reward ≈ -5).
  2. 10k-30k steps: The forager starts approaching the food. Reward improves gradually.
  3. 30k-50k steps: The forager reliably reaches the food. Reward ≈ 90+.

A reward plot is saved automatically to saved_models/training_rewards.png.