value(policy, b, a) for AlphaVectorPolicy #504

zsunberg · 2023-06-10T17:46:40Z

This should be implemented, but is not. One question is what to return for an action that doesn't have an alphavector. Should it be missing or -Inf? Probably missing. If we make that decision, actionvalues should probably also be updated.

Relevant to #492 (reply in thread)

The text was updated successfully, but these errors were encountered:

zsunberg · 2023-06-10T18:02:04Z

missing and argmax do not play well together...

julia> argmax([1.0, missing, 3.0])
2

julia> argmin([1.0, missing, 3.0])
2

zsunberg · 2023-06-10T18:18:43Z

We could use nothing:

julia> argmax([1.0, nothing, 3.0])
ERROR: MethodError: no method matching isless(::Float64, ::Nothing)

julia> argmax(something.([1.0, nothing, 3.0], -Inf))
3

but returning nothing from value does not seem right.

zsunberg · 2023-06-10T18:29:17Z

I guess maybe -Inf is actually the right thing because this policy considers any actions without an alphavector to be infinitely bad?

johannes-fischer · 2023-06-20T01:49:14Z

I think this is problematic even in the case where every action has an alpha vector: value(policy, b, a) will only be correct for the optimal action, but might be wrong for dominated actions if you are just maximizing over the alpha vectors corresponding to a.

For example in TigerPOMDP, consider a belief b for which it is optimal to open a door. If you want to compute value(policy, b, :listen) the alpha vectors corresponding to :listen give a suboptimal action value estimate for :listen in this belief b. The alpha vector for listening in b is plotted in red below and is globally dominated (on reachable beliefs) by other alpha vectors, but not dominated if you only consider :listen. So value(policy, b, :listen) should be approx 25.4 and not 24.7.

I think to implement this correctly a 1-step lookahead for a is necessary, which would also solve the issue of actions that don't have alpha vectors.

zsunberg · 2023-06-22T19:28:30Z

value is not required to return the true value of the policy, it is just what the policy thinks the value is. For example, QMDP returns an AlphaVectorPolicy, but the alpha vectors are significant over-estimates.

A single step look-ahead might be a viable way to solve the no-alpha-vector-for-this-action problem, but it would not make value always return the true optimal value for that action since we may not have the optimal alpha vectors in the part of the belief space that the lookahead explores.

johannes-fischer · 2023-06-22T22:17:40Z

I agree with everything you say, value neither needs to return the optimal value nor the policy's value. But I still think AlphaVectorPolicys without one step look-ahead are only theoretically justified to think about value(policy, b) and not to think about value(policy, b, a) by maximizing only over alpha vectors that have a as associated action. Otherwise, policies that kept dominated alpha vectors would give better estimates of value(policy, b, a) for sub-optimal actions.

zsunberg · 2023-06-23T00:44:52Z

I concede your point that there is no objective theoretical definition of value(p, b, a), but some difficulties would come up if we tried to do a backup in every call to value(p, b, a) including

It would involve iterating over the observation spaces. Some algorithms (e.g. QMDP) avoid or approximate the observation space so they can produce alpha vectors even when the observation space is large or continuous. value(p, b, a) with a backup would be inefficient if the observation space is large or impossible if it is continuous.
If the alpha vectors are approximate, it could be the case that a backup will reveal an action that has a higher value, so maximum(actionvalues(p, b, a)) might be greater than value(p, b)

Also, when Maxime originally implemented actionvalues, he interpreted it to not include a backup. Overall, these reasons convince me that we should not do a backup in value(p, b, a).

zsunberg added the decision label Jun 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

value(policy, b, a) for AlphaVectorPolicy #504

value(policy, b, a) for AlphaVectorPolicy #504

zsunberg commented Jun 10, 2023

zsunberg commented Jun 10, 2023

zsunberg commented Jun 10, 2023 •

edited

zsunberg commented Jun 10, 2023

johannes-fischer commented Jun 20, 2023

zsunberg commented Jun 22, 2023

johannes-fischer commented Jun 22, 2023

zsunberg commented Jun 23, 2023

value(policy, b, a) for AlphaVectorPolicy #504

value(policy, b, a) for AlphaVectorPolicy #504

Comments

zsunberg commented Jun 10, 2023

zsunberg commented Jun 10, 2023

zsunberg commented Jun 10, 2023 • edited

zsunberg commented Jun 10, 2023

johannes-fischer commented Jun 20, 2023

zsunberg commented Jun 22, 2023

johannes-fischer commented Jun 22, 2023

zsunberg commented Jun 23, 2023

zsunberg commented Jun 10, 2023 •

edited