[Feature Request] Preferred DDPG Actor model #1993

jensbreitung · 2024-03-05T08:50:36Z

Motivation

I'm trying out DDPG on a RL task and while looking at this repo and its docs came across different solutions for the actor implementation. I would like to know why they are so different or if they aren't, why they aren't actually any different.

The implementation provided in the examples directory. The Actor is an MLP combined with a TanhModule.
The implementation provided in the tutorial "Coding a DDPG Loss" in the docs. Here, the Actor is an MLP, but its output is fed into a ProbabilisticActor which (from my understanding) tries to fit the outputs of the MLP onto a TanhDelta distribution, from which it then samples actions.

I'm mostly confused about the appearance of the ProbabilisticActor.
In my understanding, the first one simply maps the MLP outputs into the valid action space using tanh.
The second one gathers statistics of the MLP and uses these to sample actions according to these statistics. I'm wondering how this is sensible. I have the following "counter-example" if you will:
Suppose during training, for whatever reason, initially the MLP produces almost an identical output for 1000 iterations. From my understanding, this "fills up" the ProbabilisticActor with a distribution that when sampled from will return actions that are all relatively "close" to each other.
Now if in the 1001st iteration, the MLP were to produce a completely different output, then the action in that step should correspond to that completely new output of the MLP, however, if we again sample using the ProbabilisticActor, with high probability, we will sample an action similar to the previous 1000, even though now it should maybe be a vastly different one now.

In a different tutorial in the docs (https://pytorch.org/rl/tutorials/getting-started-1.html#probabilistic-policies) the ProbabilisticActor is only associated with probabilistic policies.
Right below this paragraph, it says that ProbabilisticActor is used for exploration in probabilistic policies, however for deterministic policies it then introduces things such as EGreedyModule and OrnsteinUhlenbeckProcessWrapper that should be used for exploration.
However, both implementations listed above already use such exploration modules. Hence the ProbabilisticActor in 2) surely isn't used for that?

I would appreciate it if you could clarify why the two implementations differ and why the second one uses a paradigm typically associated with probabilistic policies in this context.
What would your preferred way of a DDPG Actor look like?

Checklist

I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

vmoens · 2024-03-05T08:59:44Z

The implementation provided in the tutorial "Coding a DDPG Loss" in the docs. Here, the Actor is an MLP, but its output is fed into a ProbabilisticActor which (from my understanding) tries to fit the outputs of the MLP onto a TanhDelta distribution, from which it then samples actions.

The Delta distribution is a collapsed distribution, it can only sample one value. For the sake of clarity, we should refactor that into a simple network.
We're using this because it makes it easy to define the action space using upper and lower boundaries.

Per se there is nothing probabilistic about the actor in DDPG aside from the explicit exploration strategy that you attach to it (OU or Gaussian).

What would your preferred way of a DDPG Actor look like?

In practice any deterministic neural network would do. If your action space is [-1, 1] just append a nn.Tanh() at the end of the network, and don't bother using a ProbabilisticActor.

Thanks for pointing this out though!

My two action items here are:

Clarify why we use a probabilistic actor
consider another clearer option where no probabilistic module is involved (something like a nn.Tanh() that would have a low and high argument perhaps?)

jensbreitung · 2024-03-05T09:12:37Z

thanks for the quick reply :)
I think for your item 2) TanhModule is the best choice because it can take the environments action_spec directly as input.

jensbreitung added the enhancement New feature or request label Mar 5, 2024

jensbreitung assigned vmoens Mar 5, 2024

jensbreitung changed the title ~~[Feature Request] Preferred DDPG model setup~~ [Feature Request] Preferred DDPG Actor model Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Preferred DDPG Actor model #1993

[Feature Request] Preferred DDPG Actor model #1993

jensbreitung commented Mar 5, 2024

vmoens commented Mar 5, 2024

jensbreitung commented Mar 5, 2024

[Feature Request] Preferred DDPG Actor model #1993

[Feature Request] Preferred DDPG Actor model #1993

Comments

jensbreitung commented Mar 5, 2024

Motivation

Checklist

vmoens commented Mar 5, 2024

jensbreitung commented Mar 5, 2024