-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Preferred DDPG Actor model #1993
Comments
The Per se there is nothing probabilistic about the actor in DDPG aside from the explicit exploration strategy that you attach to it (OU or Gaussian).
In practice any deterministic neural network would do. If your action space is [-1, 1] just append a Thanks for pointing this out though! My two action items here are:
|
thanks for the quick reply :) |
Motivation
I'm trying out DDPG on a RL task and while looking at this repo and its docs came across different solutions for the actor implementation. I would like to know why they are so different or if they aren't, why they aren't actually any different.
examples
directory. The Actor is anMLP
combined with aTanhModule
.MLP
, but its output is fed into aProbabilisticActor
which (from my understanding) tries to fit the outputs of the MLP onto aTanhDelta
distribution, from which it then samples actions.I'm mostly confused about the appearance of the
ProbabilisticActor
.In my understanding, the first one simply maps the MLP outputs into the valid action space using
tanh
.The second one gathers statistics of the MLP and uses these to sample actions according to these statistics. I'm wondering how this is sensible. I have the following "counter-example" if you will:
Suppose during training, for whatever reason, initially the MLP produces almost an identical output for 1000 iterations. From my understanding, this "fills up" the ProbabilisticActor with a distribution that when sampled from will return actions that are all relatively "close" to each other.
Now if in the 1001st iteration, the MLP were to produce a completely different output, then the action in that step should correspond to that completely new output of the MLP, however, if we again sample using the
ProbabilisticActor
, with high probability, we will sample an action similar to the previous 1000, even though now it should maybe be a vastly different one now.In a different tutorial in the docs (https://pytorch.org/rl/tutorials/getting-started-1.html#probabilistic-policies) the
ProbabilisticActor
is only associated with probabilistic policies.Right below this paragraph, it says that
ProbabilisticActor
is used for exploration in probabilistic policies, however for deterministic policies it then introduces things such asEGreedyModule
andOrnsteinUhlenbeckProcessWrapper
that should be used for exploration.However, both implementations listed above already use such exploration modules. Hence the
ProbabilisticActor
in 2) surely isn't used for that?I would appreciate it if you could clarify why the two implementations differ and why the second one uses a paradigm typically associated with probabilistic policies in this context.
What would your preferred way of a DDPG Actor look like?
Checklist
The text was updated successfully, but these errors were encountered: