You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure what is going wrong here. Is the reward calculation wrong? Is the agent being used during train/test different?
Any help would be appreciated!
Video of trained Agent interacting with Ant-v3: Video1
Episodic Rewards (X-axis: Number of episodes, Y-axis: Reward):
UPDATE:
I trained this for ~3000 epochs with a batch size of 4 and I get a policy to work much better (ant is moving right vs staying in one place): Video2
However I am not able to understand this phenomenon based on the training episodic rewards itself which looks like this:
Based on the reward definition of Ant-v3 it gets a reward of +1 for just surviving, which is why it makes sense for it to just stay in one place for the entirety of the episode (In Video1). I do not understand what incentivized it to learn anything else, and if it is even learning how do I measure it (since the the reward plot above itself makes it look like the episodic rewards are not increasing) ?
The text was updated successfully, but these errors were encountered:
Hi,
I am trying to recreate OpenAI's PPO training using Tensorforce. I have tried to replicate the hyperparameters used in https://spinningup.openai.com/en/latest/spinningup/bench.html but:
Agent:
I am not sure what is going wrong here. Is the reward calculation wrong? Is the agent being used during train/test different?
Any help would be appreciated!
Here is the code to my colab link: https://colab.research.google.com/drive/1Hpi_DFyxMfTDoQ1iUtwz1epHAziQo1HI?usp=sharing
Video of trained Agent interacting with Ant-v3: Video1
Episodic Rewards (X-axis: Number of episodes, Y-axis: Reward):
![image](https://user-images.githubusercontent.com/19833834/174219645-8c5d5ef8-5e32-42ef-8841-3e7bb8e5fbd8.png)
UPDATE:
I trained this for ~3000 epochs with a batch size of 4 and I get a policy to work much better (ant is moving right vs staying in one place): Video2
However I am not able to understand this phenomenon based on the training episodic rewards itself which looks like this:
![image](https://user-images.githubusercontent.com/19833834/174239370-a56a0203-cc03-4073-afb0-31c1004e9878.png)
Based on the reward definition of Ant-v3 it gets a reward of +1 for just surviving, which is why it makes sense for it to just stay in one place for the entirety of the episode (In Video1). I do not understand what incentivized it to learn anything else, and if it is even learning how do I measure it (since the the reward plot above itself makes it look like the episodic rewards are not increasing) ?
The text was updated successfully, but these errors were encountered: