Skip to content

This repository contains the experiments conducted in the NeurIPS 2022 paper "Truly Deterministic Policy Optimization".

Notifications You must be signed in to change notification settings

ehsansaleh/tdpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Truly Deterministic Policy Optimization

Truly Deterministic Policy Optimization (TDPO) is a model-free policy gradient method which trains reinforcement learning agents without requiring a Gaussian stochasticity in the policy's action. This is why we call it truly deterministic; not only it uses deterministic policy gradients, but also it performs deterministic policy search. Notice the distinction between deterministic search and deterministic gradients; DDPG and TD3 use deterministic policy gradients, but they still inject Ornstein or Gaussian noises and thus perform stochastic policy search. To the best our our knowledge, our work is the first to practically implement such a deterministic policy search strategy.

Deterministic policy search has many potential merits.

  • First of all, it can reduce the estimation variance in existing policy gradient methods; less training noise injection can certainly reduce the variance and inconsistencies.
  • Furthermore, it can make trainings for longer episodes more practical. The curse of horizon has been under-stressed in modern reinforcement learning. Deterministic policy search may unlock the ability to train environments with considerably larger discount factors and episode lengths.
  • Also, it may be more resilient to non-MDP artifacts (such as non-local rewards, observation or action delays, etc.).
  • Finally, it can be a valuable asset to safe-RL; it allows training RL methods in noise-susceptible environments, where injecting Gaussian or Ornstein noises can be harmful to the device.

To show the practicality of our approach, we tested the TDPO agents on hardware to control a leg of the MIT Cheetah robot. The MIT Cheetah is a high-performance device, quite challenging to control. For higher power, this robot lacks spring dampers and it can exert more than 50 N.m. of torque at 30 radians per second velocity. This is more than enough to break a human hand in a split second if the controller is not precise enough. None of the existing model-free RL methods could perform global control on this device, even in simulation, after years of performing systematic hyper-parameter optimization and even code-level optimizations. This lack of practical methods for challenging environments motivated the design of TDPO.

Here is a one minute physical test demo of the TDPO agent performing drop-and-catch tests on this leg at 4kHz frequency, and smoothly recovering from 70 cm drops:

large_sm_coeff_22901760k_clips_540p_sd.mp4

Training Details

  • We used Python 3.6, and the exact library versions are included in the requirements.txt file:
python -m pip install -r requirements.txt
  • You also need to install Mujoco. It's open-source now.

  • To train the TDPO agent on the leg environment, run

./train.sh

This is the bare minimum information to use the repository. Of course, we will be updating the repository with better documentation and more user-friendly scripts. If you have trouble setting up the environment and libraries or when running the code, please don't hesitate to reach out to us either at [email protected] or make an issue here.

References

You can find our paper at https://arxiv.org/abs/2205.15379.

Here's the bibtex citation for our work:

@misc{saleh2022truly,
      title={Truly Deterministic Policy Optimization},
      author={Ehsan Saleh and Saba Ghaffari and Timothy Bretl and Matthew West},
      year={2022},
      eprint={2205.15379},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

About

This repository contains the experiments conducted in the NeurIPS 2022 paper "Truly Deterministic Policy Optimization".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published