Skip to main content
Version: 🚧 Alpha 🚧

10. Headless Training with PPO

By Killian Trouillet


Starting GAMA Headless​

Training uses GAMA in headless mode — no GUI, just a WebSocket server. This is much faster than running with the display.

Windows​

gama-headless.bat -socket 1001

Linux / MacOS​

./gama-headless.sh -socket 1001

Wait for the message indicating the server is ready before running the Python script.

Port choice: Any port except 1000 (reserved for GUI). Common choices: 1001, 6868, 8080.


Understanding PPO​

PPO (Proximal Policy Optimization) is a policy gradient algorithm. Unlike Q-Learning which learns values for state-action pairs, PPO directly learns a policy — a neural network that maps observations to actions.

PPO vs Q-Learning​

AspectQ-Learning (Part 1)PPO (Part 2)
What it learnsA table of Q-valuesA neural network policy
Action selectionPick max Q-valueSample from policy distribution
Explorationε-greedy (random with decay)Entropy bonus (natural noise)
State spaceFinite (needs a table)Infinite (network generalizes)
Action spaceDiscrete onlyDiscrete or continuous
Update ruleBellman equationGradient ascent on policy

Key PPO Hyperparameters​

ParameterValueMeaning
lr3e-4How fast the network updates
gamma0.99Discount factor (same concept as Part 1)
K_epochs10Iterate 10 times over the collected data
eps_clip0.2PPO's key innovation: limits how much the policy can change per update
ent_coef0.01Entropy bonus — encourages exploration

The ActorCritic Network​

We build a small neural network in PyTorch that outputs both an action (Actor) and a value estimate (Critic):

class ActorCritic(nn.Module):
def __init__(self, state_dim=13, action_dim=2, hidden=64):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
)
self.actor_mean = nn.Linear(hidden, action_dim)
self.actor_log_std = nn.Parameter(torch.full((action_dim,), -0.5))
self.critic = nn.Linear(hidden, 1)

# Small init — prevents tanh saturation early in training
nn.init.orthogonal_(self.actor_mean.weight, gain=0.01)
nn.init.zeros_(self.actor_mean.bias)

def forward(self, x):
h = self.shared(x)
mean = torch.tanh(self.actor_mean(h)) # actions in [-1, 1]
std = self.actor_log_std.exp().expand_as(mean)
return mean, std, self.critic(h)
  • Shared backbone: Two hidden layers (64 neurons, Tanh) process the observation
  • Actor head: Outputs a mean action vector bounded to [-1, 1] by tanh. Small orthogonal initialization keeps outputs near 0 early in training, preventing gradient saturation
  • Critic head: Outputs a single value estimate (how good is this state?)
  • Normal distribution: Actions are sampled from Normal(mean, std), giving smooth continuous control

The PPO Update​

Each episode, we collect a trajectory (states, actions, rewards), then:

  1. Compute discounted returns: future reward from each step
  2. Compute advantages: returns − value estimates (how much better was reality vs prediction)
  3. Run K gradient epochs: update the network using the PPO clipped objective
ratio = exp(new_log_prob - old_log_prob)
surr1 = ratio * advantages
surr2 = clamp(ratio, 1 - eps, 1 + eps) * advantages
loss = -min(surr1, surr2) + vf_coef * value_loss - ent_coef * entropy

The clipping prevents the policy from changing too much in one update — that's what makes PPO stable.


The Training Script​

Connecting to GAMA​

import gymnasium as gym
import gama_gymnasium # registers the environment

env = gym.make(
"gama_gymnasium_env/GamaEnv-v0",
gaml_experiment_path="path/to/forager_gym.gaml",
gaml_experiment_name="gym_env",
gama_ip_address="localhost",
gama_port=1001,
)

Training Loop​

agent = PPOAgent(state_dim=13, action_dim=2)
buffer = RolloutBuffer()
UPDATE_EVERY = 2048

total_steps = 0
for ep in range(1, NUM_EPISODES + 1):
obs, _ = env.reset()
done = False
step = 0

while not done and step < 300:
action, log_prob, value = agent.select_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated

buffer.states.append(torch.FloatTensor(obs))
buffer.actions.append(torch.FloatTensor(action))
buffer.logprobs.append(torch.tensor(log_prob))
buffer.values.append(torch.tensor(value))
buffer.rewards.append(reward)
buffer.dones.append(done)

obs = next_obs
step += 1
total_steps += 1

# PPO update every UPDATE_EVERY steps (accumulates data across episodes)
if total_steps >= UPDATE_EVERY:
agent.update(buffer)
buffer.clear()
total_steps = 0

agent.save("saved_models/ppo_forager.pth")
env.close()

Why asyncio? The gama-gymnasium library uses asynchronous I/O internally to communicate with GAMA's WebSocket server, so the train() function must be async and launched with asyncio.run().


Running the Training​

cd models/gym
python train_forager.py

What to Expect​

  1. Ep 0-100: The forager moves randomly. Most episodes time out (reward ≈ -5).
  2. Ep 100-300: The forager starts approaching the food. Reward improves gradually.
  3. Ep 300-500: The forager reliably reaches the food. Reward ≈ 90+.

A reward plot is saved automatically to saved_models/training_rewards.png.


Complete Training Script​

See models/gym/train_forager.py for the full implementation. Key components:

ComponentPurpose
ActorCriticNeural network (shared backbone + actor/critic heads)
RolloutBufferStores trajectory data (states, actions, rewards, etc.)
PPOAgentWraps the network with action selection + PPO update
plot_training()Saves reward curves after training