Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO on Ant-v3 giving unexpected results #874

Open
arnavc1712 opened this issue Jun 17, 2022 · 0 comments
Open

PPO on Ant-v3 giving unexpected results #874

arnavc1712 opened this issue Jun 17, 2022 · 0 comments

Comments

@arnavc1712
Copy link

arnavc1712 commented Jun 17, 2022

Hi,
I am trying to recreate OpenAI's PPO training using Tensorforce. I have tried to replicate the hyperparameters used in https://spinningup.openai.com/en/latest/spinningup/bench.html but:

  1. I am not being able to reach the same reward as mentioned in the URL (~2000)
  2. What the policy learns during training vs evaluation looks very different (when visualizing episodes during training and testing)
  3. During validation the reward seems to be ~950, but all the Ant seems to be doing is staying in one place.

Agent:

network_spec = [
    dict(type='dense', size=64, activation='tanh'),
    dict(type='dense', size=32, activation='tanh'),
]
bs_network_spec = [
    dict(type='dense', size=64, activation='tanh'),
    dict(type='dense', size=32, activation='tanh'),
]

agent = Agent.create(
    agent='ppo', 
    environment=environment, 
    batch_size=4, 
    learning_rate=3e-4, 
    exploration=0.05,
    likelihood_ratio_clipping=0.2,
    network=network_spec,
    subsampling_fraction=1.0,
    discount=0.99,
    multi_step=80,
    baseline=bs_network_spec,
    baseline_optimizer=dict(
        optimizer='adam', learning_rate=1e-3,
        multi_step=80, subsampling_fraction=1.0,
    )
)

I am not sure what is going wrong here. Is the reward calculation wrong? Is the agent being used during train/test different?
Any help would be appreciated!

Here is the code to my colab link: https://colab.research.google.com/drive/1Hpi_DFyxMfTDoQ1iUtwz1epHAziQo1HI?usp=sharing

Video of trained Agent interacting with Ant-v3: Video1

Episodic Rewards (X-axis: Number of episodes, Y-axis: Reward):
image

UPDATE:
I trained this for ~3000 epochs with a batch size of 4 and I get a policy to work much better (ant is moving right vs staying in one place): Video2

However I am not able to understand this phenomenon based on the training episodic rewards itself which looks like this:
image

Based on the reward definition of Ant-v3 it gets a reward of +1 for just surviving, which is why it makes sense for it to just stay in one place for the entirety of the episode (In Video1). I do not understand what incentivized it to learn anything else, and if it is even learning how do I measure it (since the the reward plot above itself makes it look like the episodic rewards are not increasing) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant