ppo Agent playing Pyramids

This is a trained model of a ppo agent playing Pyramids using the Unity ML-Agents Library.

Usage (with ML-Agents)

The Documentation: https://unity-technologies.github.io/ml-agents/ML-Agents-Toolkit-Documentation/

We wrote a complete tutorial to learn to train your first agent using ML-Agents and publish it to the Hub:

Resume the training

mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume

Watch your Agent play

You can watch your agent playing directly in your browser

  1. If the environment is part of ML-Agents official environments, go to https://huggingface.co/unity
  2. Step 1: Find your model_id: jetfan-xin/ppo-Pyramids
  3. Step 2: Select your .nn /.onnx file
  4. Click on Watch the agent play πŸ‘€

🧠 PPO Agent Trained on Unity Pyramids Environment

This repository contains a reinforcement learning agent trained using Proximal Policy Optimization (PPO) on Unity’s Pyramids environment via ML-Agents.

πŸ“Œ Model Overview

  • Algorithm: PPO with RND (Random Network Distillation)
  • Environment: Unity Pyramids (3D sparse-reward maze)
  • Framework: ML-Agents v1.2.0.dev0
  • Backend: PyTorch 2.7.1 (CUDA-enabled)

The agent learns to navigate a 3D maze and reach the goal area by combining extrinsic and intrinsic rewards.


πŸš€ How to Use This Model

You can use the .onnx model directly in Unity.

βœ… Steps:

  1. Download the model

    Clone the repository or download Pyramids.onnx:

    git lfs install
    git clone https://huggingface.co/jetfan-xin/ppo-Pyramids
    
  2. Place in Unity project

    Put the model file in your Unity project under:

    Assets/ML-Agents/Examples/Pyramids/Pyramids.onnx
    
  3. Assign in Unity Editor

    • Select your agent GameObject.
    • In Behavior Parameters, assign Pyramids.onnx as the model.
    • Make sure the Behavior Name matches your training config.

βš™οΈ Training Configuration

Key settings from configuration.yaml:

  • trainer_type: ppo
  • max_steps: 1000000
  • batch_size: 128, buffer_size: 2048
  • learning_rate: 3e-4
  • reward_signals:
    • extrinsic: Ξ³=0.99, strength=1.0
    • rnd: Ξ³=0.99, strength=0.01
  • hidden_units: 512, num_layers: 2
  • summary_freq: 30000

See configuration.yaml for full details.


πŸ“ˆ Training Performance

Sample rewards from training log:

Step Mean Reward
300,000 -0.22
480,000 0.35
660,000 1.14
840,000 1.47
990,000 1.54

details:

(rl_py310) 4xin@ltgpu3:~/deep_rl/unit5/ml-agents$ CUDA_VISIBLE_DEVICES=3 mlagents-learn ./config/ppo/PyramidsRND.yaml \
  --env=./training-envs-executables/linux/Pyramids/Pyramids.x86_64 \
  --run-id="PyramidsGPUTest" \
  --no-graphics

            ┐  β•–
        ╓╖╬│║  ││╬╖╖
    β•“β•–β•¬β”‚β”‚β”‚β”‚β”‚β”˜  ╬│││││╬╖
 β•–β•¬β”‚β”‚β”‚β”‚β”‚β•¬β•œ        ╙╬│││││╖╖                               β•—β•—β•—
 ╬╬╬╬╖││╦╖        ╖╬││╗╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£             β•œβ•œβ•œ  β•Ÿβ•£β•£
 ╬╬╬╬╬╬╬╬╖│╬╖╖╓╬β•ͺ│╓╣╣╣╣╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£ β•’β•£β•£β•–β•—β•£β•£β•£β•—   β•£β•£β•£ β•£β•£β•£β•£β•£β•£ β•Ÿβ•£β•£β•–   β•£β•£β•£
 ╬╬╬╬┐  β•™β•¬β•¬β•¬β•¬β”‚β•“β•£β•£β•£β•β•œ  ╫╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£ β•Ÿβ•£β•£β•£β•™ β•™β•£β•£β•£  β•£β•£β•£ β•™β•Ÿβ•£β•£β•œβ•™  β•«β•£β•£  β•Ÿβ•£β•£
 ╬╬╬╬┐     ╙╬╬╣╣      ╫╣╣╣╬      β•Ÿβ•£β•£β•¬    β•Ÿβ•£β•£β•£ β•Ÿβ•£β•£β•¬   β•£β•£β•£  β•£β•£β•£  β•Ÿβ•£β•£     β•£β•£β•£β”Œβ•£β•£β•œ
 β•¬β•¬β•¬β•œ       ╬╬╣╣      ╙╝╣╣╬      β•™β•£β•£β•£β•—β•–β•“β•—β•£β•£β•£β•œ β•Ÿβ•£β•£β•¬   β•£β•£β•£  β•£β•£β•£  β•Ÿβ•£β•£β•¦β•“    β•£β•£β•£β•£β•£
 β•™   ╓╦╖    ╬╬╣╣   β•“β•—β•—β•–            β•™β•β•£β•£β•£β•£β•β•œ   β•˜β•β•β•œ   ╝╝╝  ╝╝╝   β•™β•£β•£β•£    β•Ÿβ•£β•£β•£
   ╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝                                             β•«β•£β•£β•£β•£
      β•™β•¬β•¬β•¬β•¬β•¬β•¬β•¬β•£β•£β•£β•£β•£β•£β•β•œ
          β•™β•¬β•¬β•¬β•£β•£β•£β•œ
             β•™
        
 Version information:
  ml-agents: 1.2.0.dev0,
  ml-agents-envs: 1.2.0.dev0,
  Communicator API: 1.5.0,
  PyTorch: 2.7.1+cu126
[INFO] Connected to Unity environment with package version 2.2.1-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Pyramids?team=0
[INFO] Hyperparameters for behavior name Pyramids: 
        trainer_type:   ppo
        hyperparameters:        
          batch_size:   128
          buffer_size:  2048
          learning_rate:        0.0003
          beta: 0.01
          epsilon:      0.2
          lambd:        0.95
          num_epoch:    3
          shared_critic:        False
          learning_rate_schedule:       linear
          beta_schedule:        linear
          epsilon_schedule:     linear
        checkpoint_interval:    500000
        network_settings:       
          normalize:    False
          hidden_units: 512
          num_layers:   2
          vis_encode_type:      simple
          memory:       None
          goal_conditioning_type:       hyper
          deterministic:        False
        reward_signals: 
          extrinsic:    
            gamma:      0.99
            strength:   1.0
            network_settings:   
              normalize:        False
              hidden_units:     128
              num_layers:       2
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
          rnd:  
            gamma:      0.99
            strength:   0.01
            network_settings:   
              normalize:        False
              hidden_units:     64
              num_layers:       3
              vis_encode_type:  simple
              memory:   None
              goal_conditioning_type:   hyper
              deterministic:    False
            learning_rate:      0.0001
            encoding_size:      None
        init_path:      None
        keep_checkpoints:       5
        even_checkpoints:       False
        max_steps:      1000000
        time_horizon:   128
        summary_freq:   30000
        threaded:       False
        self_play:      None
        behavioral_cloning:     None
[INFO] Pyramids. Step: 30000. Time Elapsed: 45.356 s. Mean Reward: -1.000. Std of Reward: 0.000. Training.
[INFO] Pyramids. Step: 60000. Time Elapsed: 90.519 s. Mean Reward: -0.853. Std of Reward: 0.588. Training.
[INFO] Pyramids. Step: 90000. Time Elapsed: 136.319 s. Mean Reward: -0.797. Std of Reward: 0.646. Training.
[INFO] Pyramids. Step: 120000. Time Elapsed: 182.893 s. Mean Reward: -0.831. Std of Reward: 0.654. Training.
[INFO] Pyramids. Step: 150000. Time Elapsed: 227.995 s. Mean Reward: -0.715. Std of Reward: 0.760. Training.
[INFO] Pyramids. Step: 180000. Time Elapsed: 270.527 s. Mean Reward: -0.731. Std of Reward: 0.712. Training.
[INFO] Pyramids. Step: 210000. Time Elapsed: 316.617 s. Mean Reward: -0.699. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 240000. Time Elapsed: 361.434 s. Mean Reward: -0.640. Std of Reward: 0.822. Training.
[INFO] Pyramids. Step: 270000. Time Elapsed: 407.787 s. Mean Reward: -0.520. Std of Reward: 0.969. Training.
[INFO] Pyramids. Step: 300000. Time Elapsed: 451.612 s. Mean Reward: -0.222. Std of Reward: 1.135. Training.
[INFO] Pyramids. Step: 330000. Time Elapsed: 496.996 s. Mean Reward: -0.328. Std of Reward: 1.124. Training.
[INFO] Pyramids. Step: 360000. Time Elapsed: 541.248 s. Mean Reward: -0.452. Std of Reward: 0.995. Training.
[INFO] Pyramids. Step: 390000. Time Elapsed: 587.186 s. Mean Reward: -0.411. Std of Reward: 1.044. Training.
[INFO] Pyramids. Step: 420000. Time Elapsed: 630.923 s. Mean Reward: -0.042. Std of Reward: 1.228. Training.
[INFO] Pyramids. Step: 450000. Time Elapsed: 675.866 s. Mean Reward: 0.009. Std of Reward: 1.237. Training.
[INFO] Pyramids. Step: 480000. Time Elapsed: 721.391 s. Mean Reward: 0.351. Std of Reward: 1.271. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-499992.onnx
[INFO] Pyramids. Step: 510000. Time Elapsed: 767.344 s. Mean Reward: 0.647. Std of Reward: 1.140. Training.
[INFO] Pyramids. Step: 540000. Time Elapsed: 812.656 s. Mean Reward: 0.526. Std of Reward: 1.178. Training.
[INFO] Pyramids. Step: 570000. Time Elapsed: 857.156 s. Mean Reward: 0.525. Std of Reward: 1.236. Training.
[INFO] Pyramids. Step: 600000. Time Elapsed: 900.647 s. Mean Reward: 0.979. Std of Reward: 0.977. Training.
[INFO] Pyramids. Step: 630000. Time Elapsed: 949.947 s. Mean Reward: 1.044. Std of Reward: 1.040. Training.
[INFO] Pyramids. Step: 660000. Time Elapsed: 1006.810 s. Mean Reward: 1.143. Std of Reward: 0.937. Training.
[INFO] Pyramids. Step: 690000. Time Elapsed: 1062.833 s. Mean Reward: 1.151. Std of Reward: 0.997. Training.
[INFO] Pyramids. Step: 720000. Time Elapsed: 1119.948 s. Mean Reward: 1.499. Std of Reward: 0.563. Training.
[INFO] Pyramids. Step: 750000. Time Elapsed: 1178.547 s. Mean Reward: 1.308. Std of Reward: 0.835. Training.
[INFO] Pyramids. Step: 780000. Time Elapsed: 1226.204 s. Mean Reward: 1.278. Std of Reward: 0.866. Training.
[INFO] Pyramids. Step: 810000. Time Elapsed: 1275.499 s. Mean Reward: 1.318. Std of Reward: 0.856. Training.
[INFO] Pyramids. Step: 840000. Time Elapsed: 1322.302 s. Mean Reward: 1.477. Std of Reward: 0.641. Training.
[INFO] Pyramids. Step: 870000. Time Elapsed: 1370.429 s. Mean Reward: 1.367. Std of Reward: 0.816. Training.
[INFO] Pyramids. Step: 900000. Time Elapsed: 1418.228 s. Mean Reward: 1.471. Std of Reward: 0.689. Training.
[INFO] Pyramids. Step: 930000. Time Elapsed: 1465.721 s. Mean Reward: 1.514. Std of Reward: 0.619. Training.
[INFO] Pyramids. Step: 960000. Time Elapsed: 1513.116 s. Mean Reward: 1.403. Std of Reward: 0.810. Training.
[INFO] Pyramids. Step: 990000. Time Elapsed: 1563.057 s. Mean Reward: 1.544. Std of Reward: 0.666. Training.
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-999909.onnx
[INFO] Exported results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx
[INFO] Copied results/PyramidsGPUTest/Pyramids/Pyramids-1000037.onnx to results/PyramidsGPUTest/Pyramids.onnx.

βœ… Model exported to Pyramids.onnx after reaching max steps.


πŸ–₯️ Training Setup

  • Run ID: PyramidsGPUTest
  • GPU: NVIDIA A100 80GB PCIe
  • Training time: ~26 minutes
  • ML-Agents Envs: v1.2.0.dev0
  • Communicator API: v1.5.0

πŸ“ Repository Contents

File / Folder Description
Pyramids.onnx Exported trained PPO agent
configuration.yaml Full PPO + RND training config
run_logs/ Training logs from ML-Agents
Pyramids/ Environment-specific output folder
config.json Metadata for Hugging Face model card

πŸ“š Citation

If you use this model, please consider citing:

@misc{ppoPyramidsJetfan,
  author = {Jingfan Xin},
  title = {PPO Agent Trained on Unity Pyramids Environment},
  year = {2025},
  howpublished = {\url{https://huggingface.co/jetfan-xin/ppo-Pyramids}},
}
Downloads last month
14
Video Preview
loading