PPO on Ant-v3 giving unexpected results #874

arnavc1712 · 2022-06-17T03:41:12Z

Hi,
I am trying to recreate OpenAI's PPO training using Tensorforce. I have tried to replicate the hyperparameters used in https://spinningup.openai.com/en/latest/spinningup/bench.html but:

I am not being able to reach the same reward as mentioned in the URL (~2000)
What the policy learns during training vs evaluation looks very different (when visualizing episodes during training and testing)
During validation the reward seems to be ~950, but all the Ant seems to be doing is staying in one place.

Agent:

network_spec = [
    dict(type='dense', size=64, activation='tanh'),
    dict(type='dense', size=32, activation='tanh'),
]
bs_network_spec = [
    dict(type='dense', size=64, activation='tanh'),
    dict(type='dense', size=32, activation='tanh'),
]

agent = Agent.create(
    agent='ppo', 
    environment=environment, 
    batch_size=4, 
    learning_rate=3e-4, 
    exploration=0.05,
    likelihood_ratio_clipping=0.2,
    network=network_spec,
    subsampling_fraction=1.0,
    discount=0.99,
    multi_step=80,
    baseline=bs_network_spec,
    baseline_optimizer=dict(
        optimizer='adam', learning_rate=1e-3,
        multi_step=80, subsampling_fraction=1.0,
    )
)

I am not sure what is going wrong here. Is the reward calculation wrong? Is the agent being used during train/test different?
Any help would be appreciated!

Here is the code to my colab link: https://colab.research.google.com/drive/1Hpi_DFyxMfTDoQ1iUtwz1epHAziQo1HI?usp=sharing

Video of trained Agent interacting with Ant-v3: Video1

Episodic Rewards (X-axis: Number of episodes, Y-axis: Reward):

UPDATE:
I trained this for ~3000 epochs with a batch size of 4 and I get a policy to work much better (ant is moving right vs staying in one place): Video2

However I am not able to understand this phenomenon based on the training episodic rewards itself which looks like this:

Based on the reward definition of Ant-v3 it gets a reward of +1 for just surviving, which is why it makes sense for it to just stay in one place for the entirety of the episode (In Video1). I do not understand what incentivized it to learn anything else, and if it is even learning how do I measure it (since the the reward plot above itself makes it look like the episodic rewards are not increasing) ?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO on Ant-v3 giving unexpected results #874

PPO on Ant-v3 giving unexpected results #874

arnavc1712 commented Jun 17, 2022 •

edited

PPO on Ant-v3 giving unexpected results #874

PPO on Ant-v3 giving unexpected results #874

Comments

arnavc1712 commented Jun 17, 2022 • edited

arnavc1712 commented Jun 17, 2022 •

edited