Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Preferred DDPG Actor model #1993

Open
1 task done
jensbreitung opened this issue Mar 5, 2024 · 2 comments
Open
1 task done

[Feature Request] Preferred DDPG Actor model #1993

jensbreitung opened this issue Mar 5, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@jensbreitung
Copy link

Motivation

I'm trying out DDPG on a RL task and while looking at this repo and its docs came across different solutions for the actor implementation. I would like to know why they are so different or if they aren't, why they aren't actually any different.

  1. The implementation provided in the examples directory. The Actor is an MLP combined with a TanhModule.
  2. The implementation provided in the tutorial "Coding a DDPG Loss" in the docs. Here, the Actor is an MLP, but its output is fed into a ProbabilisticActor which (from my understanding) tries to fit the outputs of the MLP onto a TanhDelta distribution, from which it then samples actions.

I'm mostly confused about the appearance of the ProbabilisticActor.
In my understanding, the first one simply maps the MLP outputs into the valid action space using tanh.
The second one gathers statistics of the MLP and uses these to sample actions according to these statistics. I'm wondering how this is sensible. I have the following "counter-example" if you will:
Suppose during training, for whatever reason, initially the MLP produces almost an identical output for 1000 iterations. From my understanding, this "fills up" the ProbabilisticActor with a distribution that when sampled from will return actions that are all relatively "close" to each other.
Now if in the 1001st iteration, the MLP were to produce a completely different output, then the action in that step should correspond to that completely new output of the MLP, however, if we again sample using the ProbabilisticActor, with high probability, we will sample an action similar to the previous 1000, even though now it should maybe be a vastly different one now.

In a different tutorial in the docs (https://pytorch.org/rl/tutorials/getting-started-1.html#probabilistic-policies) the ProbabilisticActor is only associated with probabilistic policies.
Right below this paragraph, it says that ProbabilisticActor is used for exploration in probabilistic policies, however for deterministic policies it then introduces things such as EGreedyModule and OrnsteinUhlenbeckProcessWrapper that should be used for exploration.
However, both implementations listed above already use such exploration modules. Hence the ProbabilisticActor in 2) surely isn't used for that?

I would appreciate it if you could clarify why the two implementations differ and why the second one uses a paradigm typically associated with probabilistic policies in this context.
What would your preferred way of a DDPG Actor look like?

Checklist

  • I have checked that there is no similar issue in the repo (required)
@jensbreitung jensbreitung added the enhancement New feature or request label Mar 5, 2024
@jensbreitung jensbreitung changed the title [Feature Request] Preferred DDPG model setup [Feature Request] Preferred DDPG Actor model Mar 5, 2024
@vmoens
Copy link
Contributor

vmoens commented Mar 5, 2024

The implementation provided in the tutorial "Coding a DDPG Loss" in the docs. Here, the Actor is an MLP, but its output is fed into a ProbabilisticActor which (from my understanding) tries to fit the outputs of the MLP onto a TanhDelta distribution, from which it then samples actions.

The Delta distribution is a collapsed distribution, it can only sample one value. For the sake of clarity, we should refactor that into a simple network.
We're using this because it makes it easy to define the action space using upper and lower boundaries.

Per se there is nothing probabilistic about the actor in DDPG aside from the explicit exploration strategy that you attach to it (OU or Gaussian).

What would your preferred way of a DDPG Actor look like?

In practice any deterministic neural network would do. If your action space is [-1, 1] just append a nn.Tanh() at the end of the network, and don't bother using a ProbabilisticActor.

Thanks for pointing this out though!

My two action items here are:

  1. Clarify why we use a probabilistic actor
  2. consider another clearer option where no probabilistic module is involved (something like a nn.Tanh() that would have a low and high argument perhaps?)

@jensbreitung
Copy link
Author

thanks for the quick reply :)
I think for your item 2) TanhModule is the best choice because it can take the environments action_spec directly as input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants