Version: 🚧 Alpha 🚧

10. Headless Training with PPO

By Killian Trouillet

Starting GAMA Headless

Training uses GAMA in headless mode — no GUI, just a WebSocket server. This is much faster than running with the display.

Windows

gama-headless.bat -socket 1001

Linux / MacOS

./gama-headless.sh -socket 1001

Wait for the message indicating the server is ready before running the Python script.

Port choice: Any port except 1000 (reserved for GUI). Common choices: 1001, 6868, 8080.

Understanding PPO

PPO (Proximal Policy Optimization) is a policy gradient algorithm. Unlike Q-Learning which learns values for state-action pairs, PPO directly learns a policy — a neural network that maps observations to actions.

PPO vs Q-Learning

Aspect	Q-Learning (Part 1)	PPO (Part 2)
What it learns	A table of Q-values	A neural network policy
Action selection	Pick max Q-value	Sample from policy distribution
Exploration	ε-greedy (random with decay)	Entropy bonus (natural noise)
State space	Finite (needs a table)	Infinite (network generalizes)
Action space	Discrete only	Discrete or continuous
Update rule	Bellman equation	Gradient ascent on policy

Key PPO Hyperparameters

Parameter	Value	Meaning
`lr`	`3e-4`	How fast the network updates
`gamma`	`0.99`	Discount factor (same concept as Part 1)
`K_epochs`	`10`	Iterate 10 times over the collected data
`eps_clip`	`0.2`	PPO's key innovation: limits how much the policy can change per update
`ent_coef`	`0.01`	Entropy bonus — encourages exploration

The ActorCritic Network

We build a small neural network in PyTorch that outputs both an action (Actor) and a value estimate (Critic):

class ActorCritic(nn.Module):
    def __init__(self, state_dim=13, action_dim=2, hidden=64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
        )
        self.actor_mean = nn.Linear(hidden, action_dim)
        self.actor_log_std = nn.Parameter(torch.full((action_dim,), -0.5))
        self.critic = nn.Linear(hidden, 1)

        # Small init — prevents tanh saturation early in training
        nn.init.orthogonal_(self.actor_mean.weight, gain=0.01)
        nn.init.zeros_(self.actor_mean.bias)

    def forward(self, x):
        h = self.shared(x)
        mean = torch.tanh(self.actor_mean(h))   # actions in [-1, 1]
        std = self.actor_log_std.exp().expand_as(mean)
        return mean, std, self.critic(h)

Shared backbone: Two hidden layers (64 neurons, Tanh) process the observation
Actor head: Outputs a mean action vector bounded to [-1, 1] by tanh. Small orthogonal initialization keeps outputs near 0 early in training, preventing gradient saturation
Critic head: Outputs a single value estimate (how good is this state?)
Normal distribution: Actions are sampled from Normal(mean, std), giving smooth continuous control

The PPO Update

Each episode, we collect a trajectory (states, actions, rewards), then:

Compute discounted returns: future reward from each step
Compute advantages: returns − value estimates (how much better was reality vs prediction)
Run K gradient epochs: update the network using the PPO clipped objective

ratio = exp(new_log_prob - old_log_prob)
surr1 = ratio * advantages
surr2 = clamp(ratio, 1 - eps, 1 + eps) * advantages
loss = -min(surr1, surr2) + vf_coef * value_loss - ent_coef * entropy

The clipping prevents the policy from changing too much in one update — that's what makes PPO stable.

The Training Script

Connecting to GAMA

import gymnasium as gym
import gama_gymnasium  # registers the environment

env = gym.make(
    "gama_gymnasium_env/GamaEnv-v0",
    gaml_experiment_path="path/to/forager_gym.gaml",
    gaml_experiment_name="gym_env",
    gama_ip_address="localhost",
    gama_port=1001,
)

Training Loop

agent = PPOAgent(state_dim=13, action_dim=2)
buffer = RolloutBuffer()
UPDATE_EVERY = 2048

total_steps = 0
for ep in range(1, NUM_EPISODES + 1):
    obs, _ = env.reset()
    done = False
    step = 0

    while not done and step < 300:
        action, log_prob, value = agent.select_action(obs)
        next_obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        buffer.states.append(torch.FloatTensor(obs))
        buffer.actions.append(torch.FloatTensor(action))
        buffer.logprobs.append(torch.tensor(log_prob))
        buffer.values.append(torch.tensor(value))
        buffer.rewards.append(reward)
        buffer.dones.append(done)

        obs = next_obs
        step += 1
        total_steps += 1

        # PPO update every UPDATE_EVERY steps (accumulates data across episodes)
        if total_steps >= UPDATE_EVERY:
            agent.update(buffer)
            buffer.clear()
            total_steps = 0

agent.save("saved_models/ppo_forager.pth")
env.close()

Why asyncio? The gama-gymnasium library uses asynchronous I/O internally to communicate with GAMA's WebSocket server, so the train() function must be async and launched with asyncio.run().

Running the Training

cd models/gym
python train_forager.py

What to Expect

Ep 0-100: The forager moves randomly. Most episodes time out (reward ≈ -5).
Ep 100-300: The forager starts approaching the food. Reward improves gradually.
Ep 300-500: The forager reliably reaches the food. Reward ≈ 90+.

A reward plot is saved automatically to saved_models/training_rewards.png.

Complete Training Script

See models/gym/train_forager.py for the full implementation. Key components:

Component	Purpose
`ActorCritic`	Neural network (shared backbone + actor/critic heads)
`RolloutBuffer`	Stores trajectory data (states, actions, rewards, etc.)
`PPOAgent`	Wraps the network with action selection + PPO update
`plot_training()`	Saves reward curves after training

Starting GAMA Headless​

Windows​

Linux / MacOS​

Understanding PPO​

PPO vs Q-Learning​

Key PPO Hyperparameters​

The ActorCritic Network​

The PPO Update​

The Training Script​

Connecting to GAMA​

Training Loop​

Running the Training​

What to Expect​

Complete Training Script​