9. Sensors, Movement Rewards
By Killian Trouillet
Content
In Step 8 we defined the GymAgent bridge with stub actions. Now we implement the three forager actions that make the bridge actually work: movement, observation, and reward computation.
The Forager Species
The forager agent has three responsibilities:
- Apply the action received from Python (movement)
- Compute its observation (what it "sees")
- Compute the reward (how well it did)
species forager {
float sensor_range <- 30.0;
float previous_distance <- 0.0;
GymAgent gym_agent <- GymAgent[0];
init {
previous_distance <- location distance_to food_location;
do compute_observation();
}
}
Continuous Movement
Action Format
Python sends a 2D velocity vector [dx, dy], where each component is in [-1, 1].
action apply_action() {
if (empty(gym_agent.next_action)) { return; }
list act <- list(gym_agent.next_action);
if (length(act) < 2) { return; }
float dx <- float(act[0]); // normalized x velocity [-1, 1]
float dy <- float(act[1]); // normalized y velocity [-1, 1]
// Scale: max movement = 3.0 units per step
float move_x <- dx * 3.0;
float move_y <- dy * 3.0;
float new_x <- location.x + move_x;
float new_y <- location.y + move_y;
// Clamp to world bounds
new_x <- max([1.0, min([world_size - 1.0, new_x])]);
new_y <- max([1.0, min([world_size - 1.0, new_y])]);
point new_loc <- {new_x, new_y};
// Check obstacle collision
bool blocked <- false;
loop obs over: obstacles {
if (new_loc intersects obs) {
blocked <- true;
break;
}
}
if (!blocked) {
location <- new_loc;
}
current_step <- current_step + 1;
}
Compared to Part 1
| Part 1 | Part 2 |
|---|---|
switch direction { match 0 { new_y <- new_y - 1; } ... } | location <- location + {dx * 3, dy * 3} |
| 4 possible actions | Infinite actions (any direction + speed) |
| Always moves exactly 1 cell | Moves 0 to 3 units per step |
The Observation Vector (13 values)
The forager builds a rich observation using its position, food direction, and obstacle sensors.
Position & Food (5 values)
float dist <- location distance_to food_location;
float angle <- location towards food_location;
gym_agent.state <- [
location.x / world_size, // [0] Normalized X
location.y / world_size, // [1] Normalized Y
min([1.0, dist / (world_size * 1.414)]), // [2] Normalized distance to food
(cos(angle) + 1.0) / 2.0, // [3] Food direction X (normalized)
(sin(angle) + 1.0) / 2.0 // [4] Food direction Y (normalized)
];
Why normalize? Neural networks learn much better when all inputs are in the same range
[0, 1]. Raw values likex=85.0anddist=120.0would create imbalance.
Ray-Cast Sensors (8 values)
The forager casts 8 rays around itself (every 45°) to detect obstacles:
45° · 90°
\|/
0° ───●─── 135°
/|\
315° · 225°
270°
list<float> sensors;
loop i from: 0 to: 7 {
float ray_angle <- float(i) * 45.0;
point ray_end <- {
location.x + sensor_range * cos(ray_angle),
location.y + sensor_range * sin(ray_angle)
};
geometry ray <- line([location, ray_end]);
float closest <- 1.0; // 1.0 = nothing detected
loop obs over: obstacles {
if (ray intersects obs) {
geometry inter <- ray inter obs;
if (inter != nil) {
float d <- (location distance_to inter.location) / sensor_range;
closest <- min([closest, d]);
}
}
}
add closest to: sensors;
}
Each sensor returns a value between 0 and 1:
- 0.0 = obstacle is touching the agent
- 0.5 = obstacle is at half sensor range
- 1.0 = no obstacle detected in this direction
Complete Observation Table
| Index | Value | Range | Description |
|---|---|---|---|
| 0 | x / world_size | [0, 1] | Agent's X position |
| 1 | y / world_size | [0, 1] | Agent's Y position |
| 2 | dist / max_dist | [0, 1] | Distance to food |
| 3 | (cos(angle)+1)/2 | [0, 1] | Food direction (X) |
| 4 | (sin(angle)+1)/2 | [0, 1] | Food direction (Y) |
| 5 | sensor 0° | [0, 1] | Obstacle right |
| 6 | sensor 45° | [0, 1] | Obstacle down-right |
| 7 | sensor 90° | [0, 1] | Obstacle down |
| 8 | sensor 135° | [0, 1] | Obstacle down-left |
| 9 | sensor 180° | [0, 1] | Obstacle left |
| 10 | sensor 225° | [0, 1] | Obstacle up-left |
| 11 | sensor 270° | [0, 1] | Obstacle up |
| 12 | sensor 315° | [0, 1] | Obstacle up-right |
Reward Shaping
A well-designed reward function is critical for learning. We use reward shaping to guide the agent.
action compute_reward() {
float dist <- location distance_to food_location;
// Goal reached!
if (dist < food_radius) {
gym_agent.reward <- 100.0;
gym_agent.terminated <- true;
gym_agent.truncated <- false;
gym_agent.info <- [];
ask gym_agent { do update_data(); }
return;
}
// Timeout
if (current_step >= max_steps) {
gym_agent.reward <- -1.0;
gym_agent.terminated <- false;
gym_agent.truncated <- true;
gym_agent.info <- [];
ask gym_agent { do update_data(); }
return;
}
// Shaping: encourage getting closer
float delta <- previous_distance - dist;
gym_agent.reward <- delta * 2.0 - 0.01;
previous_distance <- dist;
gym_agent.terminated <- false;
gym_agent.truncated <- false;
gym_agent.info <- [];
ask gym_agent { do update_data(); }
}
Reward Breakdown
| Event | Reward | Why? |
|---|---|---|
| Reach food | +100.0 | Large positive reward for completing the task |
| Get closer | +delta × 2.0 | Encourages approach. If agent moves 1.5 units closer → reward = +3.0 |
| Get farther | −delta × 2.0 | Penalizes moving away from food |
| Each step | −0.01 | Small living penalty to encourage speed |
| Timeout | −1.0 | Mild penalty for not finding food in 300 steps |
terminated vs truncated
This is a Gymnasium convention:
terminated = true: The episode ended because of the environment logic (food found).truncated = true: The episode ended because of a time limit (max steps).
Python handles them identically (both trigger env.reset()), but they mean different things for logging.
Forager Aspect
aspect default {
draw circle(3) color: #blue;
draw circle(sensor_range) color: rgb(0, 0, 255, 30) border: rgb(0, 0, 255, 80);
}
The forager is drawn as a blue circle with a semi-transparent sensor range indicator.
Complete Model
/**
* Name: SmartForagerGym - Continuous Forager for Gymnasium
* Author: Killian Trouillet
* Description: A continuous-world version of the Smart Forager that exposes
* itself as a Gymnasium environment via the gama-gymnasium bridge.
* Tags: reinforcement-learning, gymnasium, ppo, continuous, tutorial
*/
model SmartForagerGym
global {
float world_size <- 100.0;
point food_location <- {95.0, 95.0}; // Native cell (9,9)
float food_radius <- 5.0;
list<geometry> obstacles <- [];
int max_steps <- 300;
int current_step <- 0;
int gama_server_port <- 0;
init {
// Same 6 obstacle cells as native, each = 10×10 square
obstacles << square(10) at_location {25.0, 25.0}; // cell (2,2)
obstacles << square(10) at_location {35.0, 25.0}; // cell (3,2)
obstacles << square(10) at_location {25.0, 35.0}; // cell (2,3)
obstacles << square(10) at_location {65.0, 45.0}; // cell (6,4)
obstacles << square(10) at_location {75.0, 45.0}; // cell (7,4)
obstacles << square(10) at_location {75.0, 55.0}; // cell (7,5)
create GymAgent;
GymAgent[0].action_space <- [
"type"::"Box",
"low"::[-1.0, -1.0],
"high"::[1.0, 1.0],
"dtype"::"float"
];
GymAgent[0].observation_space <- [
"type"::"Box",
"low"::list_with(13, 0.0),
"high"::list_with(13, 1.0),
"dtype"::"float"
];
create forager number: 1 {
location <- {5.0, 5.0}; // Native cell (0,0)
}
}
reflex main_loop {
if (empty(GymAgent[0].next_action)) { return; }
ask forager[0] {
do apply_action();
do compute_observation();
do compute_reward();
}
}
}
species GymAgent {
map<string, unknown> action_space;
map<string, unknown> observation_space;
list<float> state;
float reward;
bool terminated;
bool truncated;
map<string, unknown> info;
list<float> next_action;
map<string, unknown> data;
// Action to receive action values from Python (avoids list literal bug in expression API)
action set_action(float dx, float dy) {
next_action <- [dx, dy];
return "ok";
}
action update_data() {
data <- [
"State"::state,
"Reward"::reward,
"Terminated"::terminated,
"Truncated"::truncated,
"Info"::info
];
}
}
species forager {
float sensor_range <- 30.0;
float previous_distance <- 0.0;
GymAgent gym_agent <- GymAgent[0];
init {
previous_distance <- location distance_to food_location;
do compute_observation();
}
action apply_action() {
if (empty(gym_agent.next_action)) { return; }
list act <- list(gym_agent.next_action);
if (length(act) < 2) { return; }
float dx <- float(act[0]);
float dy <- float(act[1]);
float move_x <- dx * 3.0;
float move_y <- dy * 3.0;
float new_x <- location.x + move_x;
float new_y <- location.y + move_y;
new_x <- max([1.0, min([world_size - 1.0, new_x])]);
new_y <- max([1.0, min([world_size - 1.0, new_y])]);
point new_loc <- {new_x, new_y};
bool blocked <- false;
loop obs over: obstacles {
if (new_loc intersects obs) {
blocked <- true;
break;
}
}
if (!blocked) {
location <- new_loc;
}
current_step <- current_step + 1;
}
action compute_observation() {
float dist <- location distance_to food_location;
float angle <- location towards food_location;
list<float> sensors;
loop i from: 0 to: 7 {
float ray_angle <- float(i) * 45.0;
point ray_end <- {
location.x + sensor_range * cos(ray_angle),
location.y + sensor_range * sin(ray_angle)
};
geometry ray <- line([location, ray_end]);
float closest <- 1.0;
loop obs over: obstacles {
if (ray intersects obs) {
geometry inter <- ray inter obs;
if (inter != nil) {
float d <- (location distance_to inter.location) / sensor_range;
closest <- min([closest, d]);
}
}
}
add closest to: sensors;
}
gym_agent.state <- [
location.x / world_size,
location.y / world_size,
min([1.0, dist / (world_size * 1.414)]),
(cos(angle) + 1.0) / 2.0,
(sin(angle) + 1.0) / 2.0
] + sensors;
}
action compute_reward() {
float dist <- location distance_to food_location;
if (dist < food_radius) {
gym_agent.reward <- 100.0;
gym_agent.terminated <- true;
gym_agent.truncated <- false;
gym_agent.info <- [];
ask gym_agent { do update_data(); }
return;
}
if (current_step >= max_steps) {
gym_agent.reward <- -1.0;
gym_agent.terminated <- false;
gym_agent.truncated <- true;
gym_agent.info <- [];
ask gym_agent { do update_data(); }
return;
}
float delta <- previous_distance - dist;
gym_agent.reward <- delta * 2.0 - 0.01;
previous_distance <- dist;
gym_agent.terminated <- false;
gym_agent.truncated <- false;
gym_agent.info <- [];
ask gym_agent { do update_data(); }
}
aspect default {
draw circle(3) color: #blue;
draw circle(sensor_range) color: rgb(0, 0, 255, 30) border: rgb(0, 0, 255, 80);
}
}
experiment gym_env {
parameter "communication_port" var: gama_server_port;
output {
display "Continuous World" type: 2d {
graphics "background" {
draw rectangle(world_size, world_size) at: {world_size/2, world_size/2} color: rgb(240, 240, 240);
}
graphics "obstacles" {
loop obs over: obstacles {
draw obs color: rgb(80, 80, 80);
}
}
graphics "food" {
draw circle(food_radius) at: food_location color: rgb(50, 180, 50);
}
species forager;
}
}
}