Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrete action space switching continuous action space problem in custom environment #217

Open
shengqie opened this issue Jan 26, 2024 · 1 comment

Comments

@shengqie
Copy link

   Hello developers, I am trying to customize aircombat env, the aircombat environment included in MARLlib, but I have encountered some problems in the post-customization training process, specifically.
   First, I set the 2v2 scene defined by the environment as a competitive multi-agent air combat scene of 1v1. Training with ippo produced a certain effect. Then, I changed the mutidiscrete actionspace defined by the environment into a continuous actionspace as follows:
                                        self.action_space = spaces.Box(low=-10, high=10., shape=(4,))
    However, after defining the action space as continuous, ippo, maddpg, mappo and other algorithms I used in MARLlib had extremely poor training effect, unable to produce effective strategies, and their reward curve could not produce an upward trend and could not converge.
    I have used mappo and other algorithms to achieve similar implementation, so I don't think it is the environment that leads to the failure of the algorithm. I would like to ask if you have any opinions, and whether MARLlib may have special code writing specifications for continuous action space, which I am not familiar with. Thank you for your answer.


    开发者您好,我试图对MARLlib所包含的空战环境aircombat自定义,但自定义后训练过程遇到了一些问题,具体来说是这样的。
    首先我将环境所定义的2v2场景设置为1v1的竞争多智能体空战场景,此时使用ippo进行训练后产生了一定效果,随后,我将环境所定义的mutidiscrete actionspace变为连续动作空间形式,具体为:
                                         self.action_space = spaces.Box(low=-10, high=10., shape=(4,))
    然而,在将动作空间定义为连续化后,我使用MARLlib的ippo、maddpg、mappo等算法均训练效果极差,无法产生有效策略,其奖励曲线也无法产生上升趋势,并且无法收敛。
   类似的实现我曾经使用过mappo等算法完成,所以我并不认为是环境的原因导致算法的失效,想请问您是否有什么见解,是否可能是MARLlib对于连续动作空间有特殊的代码书写规范,而我对此并不熟知,感谢您的回答。
@shengqie
Copy link
Author

I would like to add one more thing to you. After converting the action space to a continuous action space, the algorithm encountered an error after iterating thousands of times:

向您补充一点,在我将动作空间转换为连续动作空间后,算法迭代数千次以后出现了报错:

Failure # 1 (occurred at 2024-01-26_02-47-08)
Traceback (most recent call last):
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/worker.py", line 1625, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): �[36mray::IPPOTrainer.train_buffered()�[39m (pid=1138520, ip=10.31.22.121, repr=IPPOTrainer)
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 46, in ppo_surrogate_loss
curr_action_dist = dist_class(logits, model)
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 186, in init
self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/torch/distributions/normal.py", line 50, in init
super(Normal, self).init(batch_shape, validate_args=validate_args)
File "/home/user/miniconda3/envs/marllib/lib/python3.8/site-packages/torch/distributions/distribution.py", line 53, in init
raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant