Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

value(policy, b, a) for AlphaVectorPolicy #504

Open
zsunberg opened this issue Jun 10, 2023 · 7 comments
Open

value(policy, b, a) for AlphaVectorPolicy #504

zsunberg opened this issue Jun 10, 2023 · 7 comments
Labels
Contribution Opportunity This would be something that would be very useful to the community and a good modular addition. decision Learning Opportunity Fixing this would be a good straightforward exercise to improve your julia coding skills. quick This task shouldn't take too long

Comments

@zsunberg
Copy link
Member

This should be implemented, but is not. One question is what to return for an action that doesn't have an alphavector. Should it be missing or -Inf? Probably missing. If we make that decision, actionvalues should probably also be updated.

Relevant to #492 (reply in thread)

@zsunberg zsunberg added Contribution Opportunity This would be something that would be very useful to the community and a good modular addition. quick This task shouldn't take too long Learning Opportunity Fixing this would be a good straightforward exercise to improve your julia coding skills. labels Jun 10, 2023
@zsunberg
Copy link
Member Author

missing and argmax do not play well together...

julia> argmax([1.0, missing, 3.0])
2

julia> argmin([1.0, missing, 3.0])
2

@zsunberg
Copy link
Member Author

zsunberg commented Jun 10, 2023

We could use nothing:

julia> argmax([1.0, nothing, 3.0])
ERROR: MethodError: no method matching isless(::Float64, ::Nothing)

julia> argmax(something.([1.0, nothing, 3.0], -Inf))
3

but returning nothing from value does not seem right.

@zsunberg
Copy link
Member Author

I guess maybe -Inf is actually the right thing because this policy considers any actions without an alphavector to be infinitely bad?

@johannes-fischer
Copy link
Contributor

I think this is problematic even in the case where every action has an alpha vector: value(policy, b, a) will only be correct for the optimal action, but might be wrong for dominated actions if you are just maximizing over the alpha vectors corresponding to a.

For example in TigerPOMDP, consider a belief b for which it is optimal to open a door. If you want to compute value(policy, b, :listen) the alpha vectors corresponding to :listen give a suboptimal action value estimate for :listen in this belief b. The alpha vector for listening in b is plotted in red below and is globally dominated (on reachable beliefs) by other alpha vectors, but not dominated if you only consider :listen. So value(policy, b, :listen) should be approx 25.4 and not 24.7.

tiger_action_values
tiger_action_values_zoom

I think to implement this correctly a 1-step lookahead for a is necessary, which would also solve the issue of actions that don't have alpha vectors.

@zsunberg
Copy link
Member Author

value is not required to return the true value of the policy, it is just what the policy thinks the value is. For example, QMDP returns an AlphaVectorPolicy, but the alpha vectors are significant over-estimates.

A single step look-ahead might be a viable way to solve the no-alpha-vector-for-this-action problem, but it would not make value always return the true optimal value for that action since we may not have the optimal alpha vectors in the part of the belief space that the lookahead explores.

@johannes-fischer
Copy link
Contributor

I agree with everything you say, value neither needs to return the optimal value nor the policy's value. But I still think AlphaVectorPolicys without one step look-ahead are only theoretically justified to think about value(policy, b) and not to think about value(policy, b, a) by maximizing only over alpha vectors that have a as associated action. Otherwise, policies that kept dominated alpha vectors would give better estimates of value(policy, b, a) for sub-optimal actions.

@zsunberg
Copy link
Member Author

I concede your point that there is no objective theoretical definition of value(p, b, a), but some difficulties would come up if we tried to do a backup in every call to value(p, b, a) including

  1. It would involve iterating over the observation spaces. Some algorithms (e.g. QMDP) avoid or approximate the observation space so they can produce alpha vectors even when the observation space is large or continuous. value(p, b, a) with a backup would be inefficient if the observation space is large or impossible if it is continuous.
  2. If the alpha vectors are approximate, it could be the case that a backup will reveal an action that has a higher value, so maximum(actionvalues(p, b, a)) might be greater than value(p, b)

Also, when Maxime originally implemented actionvalues, he interpreted it to not include a backup. Overall, these reasons convince me that we should not do a backup in value(p, b, a).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contribution Opportunity This would be something that would be very useful to the community and a good modular addition. decision Learning Opportunity Fixing this would be a good straightforward exercise to improve your julia coding skills. quick This task shouldn't take too long
Projects
None yet
Development

No branches or pull requests

2 participants