ppo Agent playing Pyramids
This is a trained model of a ppo agent playing Pyramids using the Unity ML-Agents Library.
Usage (with ML-Agents)
The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/
We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:
- A short tutorial where you teach Huggy the Dog πΆ to fetch the stick and then play with him directly in your browser: https://huggingface.co/learn/deep-rl-course/unitbonus1/introduction
- A longer tutorial to understand how works ML-Agents: https://huggingface.co/learn/deep-rl-course/unit5/introduction
Resume the training
mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
Watch your Agent play
You can watch your agent playing directly in your browser
- If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
- Step 1: Find your model_id: jetfan-xin/ppo-Pyramids
- Step 2: Select your .nn /.onnx file
- Click on Watch the agent play π
π§ PPO Agent Trained on Unity Pyramids Environment
This repository contains a reinforcement learning agent trained using Proximal Policy Optimization (PPO) on Unityβs Pyramids environment via ML-Agents.
π Model Overview
- Algorithm: PPO with RND (Random Network Distillation)
- Environment: Unity Pyramids (3D sparse-reward maze)
- Framework: ML-Agents v1.2.0.dev0
- Backend: PyTorch 2.7.1 (CUDA-enabled)
The agent learns to navigate a 3D maze and reach the goal area by combining extrinsic and intrinsic rewards.
π How to Use This Model
You can use the .onnx
model directly in Unity.
β Steps:
Download the model
Clone the repository or download
Pyramids.onnx
:git lfs install git clone https://huggingface.co/jetfan-xin/ppo-Pyramids
Place in Unity project
Put the model file in your Unity project under:
Assets/ML-Agents/Examples/Pyramids/Pyramids.onnx
Assign in Unity Editor
- Select your agent GameObject.
- In
Behavior Parameters
, assignPyramids.onnx
as the model. - Make sure the Behavior Name matches your training config.
βοΈ Training Configuration
Key settings from configuration.yaml
:
trainer_type
:ppo
max_steps
:1000000
batch_size
:128
,buffer_size
:2048
learning_rate
:3e-4
reward_signals
:extrinsic
: Ξ³=0.99, strength=1.0rnd
: Ξ³=0.99, strength=0.01
hidden_units
:512
,num_layers
:2
summary_freq
:30000
See configuration.yaml
for full details.
π Training Performance
Sample rewards from training log:
Step | Mean Reward |
---|---|
300,000 | -0.22 |
480,000 | 0.35 |
660,000 | 1.14 |
840,000 | 1.47 |
990,000 | 1.54 |
details:
(rl_py310) 4xin@ltgpu3:~/deep_rl/unit5/ml-agents$ CUDA_VISIBLE_DEVICES=3 mlagents-learn ./config/ppo/PyramidsRND.yaml \
--env=./training-envs-executables/linux/Pyramids/Pyramids.x86_64 \
--run-id="PyramidsGPUTest" \
--no-graphics
β β
βββ¬ββ‘ βββ¬ββ
βββ¬ββββββ β¬ββββββ¬β
ββ¬ββββββ¬β ββ¬βββββββ βββ
β¬β¬β¬β¬ββββ¦β ββ¬ββββ£β£β£β¬ ββ£β£β¬ ββ£β£β£ βββ ββ£β£
β¬β¬β¬β¬β¬β¬β¬β¬βββ¬ββββ¬βͺβββ£β£β£β£β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£βββ£β£β£β β£β£β£ β£β£β£β£β£β£ ββ£β£β β£β£β£
β¬β¬β¬β¬β ββ¬β¬β¬β¬βββ£β£β£ββ β«β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£β£β ββ£β£β£ β£β£β£ βββ£β£ββ β«β£β£ ββ£β£
β¬β¬β¬β¬β ββ¬β¬β£β£ β«β£β£β£β¬ ββ£β£β¬ ββ£β£β£ ββ£β£β¬ β£β£β£ β£β£β£ ββ£β£ β£β£β£ββ£β£β
β¬β¬β¬β β¬β¬β£β£ βββ£β£β¬ ββ£β£β£βββββ£β£β£β ββ£β£β¬ β£β£β£ β£β£β£ ββ£β£β¦β β£β£β£β£β£
β ββ¦β β¬β¬β£β£ ββββ βββ£β£β£β£ββ ββββ βββ βββ ββ£β£β£ ββ£β£β£
β©β¬β¬β¬β¬β¬β¬β¦β¦β¬β¬β£β£ββ£β£β£β£β£β£β£β β«β£β£β£β£
ββ¬β¬β¬β¬β¬β¬β¬β£β£β£β£β£β£ββ
ββ¬β¬β¬β£β£β£β
β
Version information:
ml-agents: 1.2.0.dev0,
ml-agents-envs: 1.2.0.dev0,
Communicator API: 1.5.0,
PyTorch: 2.7.1+cu126
[INFO] Connected to Unity environment with package version 2.2.1-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Pyramids?team=0
[INFO] Hyperparameters for behavior name Pyramids:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 3
shared_critic: False
learning_rate_schedule: linear
beta_schedule: linear
epsilon_schedule: linear
checkpoint_interval: 500000
network_settings:
normalize: False
hidden_units: 512
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
rnd:
gamma: 0.99
strength: 0.01
network_settings:
normalize: False
hidden_units: 64
num_layers: 3
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
learning_rate: 0.0001
encoding_size: None
init_path: None
keep_checkpoints: 5
even_checkpoints: False
max_steps: 1000000
time_horizon: 128
summary_freq: 30000
threaded: False
self_play: None
behavioral_cloning: None
[INFO] Pyramids. Step: 30000. Time Elapsed: 45.356 s. Mean Reward: -1.000. Std of Reward: 0.000. Training.
[INFO] Pyramids. Step: 60000. Time Elapsed: 90.519 s. Mean Reward: -0.853. Std of Reward: 0.588. Training.
[INFO] Pyramids. Step: 90000. Time Elapsed: 136.319 s. Mean Reward: -0.797. Std of Reward: 0.646. Training.
[INFO] Pyramids. Step: 120000. Time Elapsed: 182.893 s. Mean Reward: -0.831. Std of Reward: 0.654. Training.
[INFO] Pyramids. Step: 150000. Time Elapsed: 227.995 s. Mean Reward: -0.715. Std of Reward: 0.760. Training.
[INFO] Pyramids. Step: 180000. Time Elapsed: 270.527 s. Mean Reward: -0.731. Std of Reward: 0.712. Training.
[INFO] Pyramids. Step: 210000. Time Elapsed: 316.617 s. Mean Reward: -0.699. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 240000. Time Elapsed: 361.434 s. Mean Reward: -0.640. Std of Reward: 0.822. Training.
[INFO] Pyramids. Step: 270000. Time Elapsed: 407.787 s. Mean Reward: -0.520. Std of Reward: 0.969. Training.
[INFO] Pyramids. Step: 300000. Time Elapsed: 451.612 s. Mean Reward: -0.222. Std of Reward: 1.135. Training.
[INFO] Pyramids. Step: 330000. Time Elapsed: 496.996 s. Mean Reward: -0.328. Std of Reward: 1.124. Training.
[INFO] Pyramids. Step: 360000. Time Elapsed: 541.248 s. Mean Reward: -0.452. Std of Reward: 0.995. Training.
[INFO] Pyramids. Step: 390000. Time Elapsed: 587.186 s. Mean Reward: -0.411. Std of Reward: 1.044. Training.
[INFO] Pyramids. Step: 420000. Time Elapsed: 630.923 s. Mean Reward: -0.042. Std of Reward: 1.228. Training.
[INFO] Pyramids. Step: 450000. Time Elapsed: 675.866 s. Mean Reward: 0.009. Std of Reward: 1.237. Training.
[INFO] Pyramids. Step: 480000. Time Elapsed: 721.391 s. Mean Reward: 0.351. Std of Reward: 1.271. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-499992.onnx
[INFO] Pyramids. Step: 510000. Time Elapsed: 767.344 s. Mean Reward: 0.647. Std of Reward: 1.140. Training.
[INFO] Pyramids. Step: 540000. Time Elapsed: 812.656 s. Mean Reward: 0.526. Std of Reward: 1.178. Training.
[INFO] Pyramids. Step: 570000. Time Elapsed: 857.156 s. Mean Reward: 0.525. Std of Reward: 1.236. Training.
[INFO] Pyramids. Step: 600000. Time Elapsed: 900.647 s. Mean Reward: 0.979. Std of Reward: 0.977. Training.
[INFO] Pyramids. Step: 630000. Time Elapsed: 949.947 s. Mean Reward: 1.044. Std of Reward: 1.040. Training.
[INFO] Pyramids. Step: 660000. Time Elapsed: 1006.810 s. Mean Reward: 1.143. Std of Reward: 0.937. Training.
[INFO] Pyramids. Step: 690000. Time Elapsed: 1062.833 s. Mean Reward: 1.151. Std of Reward: 0.997. Training.
[INFO] Pyramids. Step: 720000. Time Elapsed: 1119.948 s. Mean Reward: 1.499. Std of Reward: 0.563. Training.
[INFO] Pyramids. Step: 750000. Time Elapsed: 1178.547 s. Mean Reward: 1.308. Std of Reward: 0.835. Training.
[INFO] Pyramids. Step: 780000. Time Elapsed: 1226.204 s. Mean Reward: 1.278. Std of Reward: 0.866. Training.
[INFO] Pyramids. Step: 810000. Time Elapsed: 1275.499 s. Mean Reward: 1.318. Std of Reward: 0.856. Training.
[INFO] Pyramids. Step: 840000. Time Elapsed: 1322.302 s. Mean Reward: 1.477. Std of Reward: 0.641. Training.
[INFO] Pyramids. Step: 870000. Time Elapsed: 1370.429 s. Mean Reward: 1.367. Std of Reward: 0.816. Training.
[INFO] Pyramids. Step: 900000. Time Elapsed: 1418.228 s. Mean Reward: 1.471. Std of Reward: 0.689. Training.
[INFO] Pyramids. Step: 930000. Time Elapsed: 1465.721 s. Mean Reward: 1.514. Std of Reward: 0.619. Training.
[INFO] Pyramids. Step: 960000. Time Elapsed: 1513.116 s. Mean Reward: 1.403. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 990000. Time Elapsed: 1563.057 s. Mean Reward: 1.544. Std of Reward: 0.666. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-999909.onnx
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx
[INFO] Copied results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx to results/PyramidsGPUTest/Pyramids.onnx.
β
Model exported to Pyramids.onnx
after reaching max steps.
π₯οΈ Training Setup
- Run ID:
PyramidsGPUTest
- GPU: NVIDIA A100 80GB PCIe
- Training time: ~26 minutes
- ML-Agents Envs: v1.2.0.dev0
- Communicator API: v1.5.0
π Repository Contents
File / Folder | Description |
---|---|
Pyramids.onnx |
Exported trained PPO agent |
configuration.yaml |
Full PPO + RND training config |
run_logs/ |
Training logs from ML-Agents |
Pyramids/ |
Environment-specific output folder |
config.json |
Metadata for Hugging Face model card |
π Citation
If you use this model, please consider citing:
@misc{ppoPyramidsJetfan,
author = {Jingfan Xin},
title = {PPO Agent Trained on Unity Pyramids Environment},
year = {2025},
howpublished = {\url{https://huggingface.co/jetfan-xin/ppo-Pyramids}},
}
- Downloads last month
- 14