Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Response and Response model #61

Closed
wants to merge 5 commits into from
Closed

Conversation

troiwill
Copy link

@troiwill troiwill commented Mar 25, 2024

This pull request is the first stage of implementing the cost-constrained POMCP (CC-POMCP) algorithm. This algorithm cannot be added to the repository directly because it has additional variables and operations not present in the PO-UCT and POMCP algorithms. One example is cost constraints and their corresponding operations.

To accommodate costs and other future variables, I propose to use a generic model, called a Response model, and a corresponding output, called a response. The name "response" comes from the notion of independent and dependent variables, where a response (reward, cost, etc.) depends on the interaction with the real or simulated environment. Thus, a response model is a wrapper for more specific models, such as reward and cost models (and any others that will follow in the future). By extension, a response is a wrapper for the reward, cost, etc.

The pull request has the following:

  1. Implementation of the ResponseModel and Response classes in the basics files,
  2. Updates to classes in basics to use ResponseModel instead of RewardModel directly,
  3. Updates to the appropriate pre-existing algorithms, including PO-UCT and POMCP,
  4. Updates to the code for the tiger, rocksample, mos, load_unload, and tag problems,
  5. Added test script for response arithmetic operations in test_response.py,
  6. Passes for the test_all.py script,
  7. Passes for the tiger, rocksample, mos, load_unload, and tag problems.
  8. Comments to the ResponseModel and Response classes.


cdef class Response:
cdef float _reward

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @troiwill. What is the significance of wrapping a float inside Response? Why does this change need to happen in basic.pyx? This affects all other programs currently using pomdp-py. This would not be acceptable.

"""
A Response class that only handles a scalar reward. Subclasses of Response can add
more (scalar or vector) variables. But the subclasses must implement how to handle
arithmetic and comparison operations.
Copy link
Collaborator

@zkytony zkytony Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The idea of vector reward is interesting, and I value the contribution. But currently the changes are not backwards compatible. I suggest the following changes:

  • Create a file pomdp_py/framework/generalization.pyx in which:
    • Define a class called ResponseAgent that inherits Agent but it takes in a response model instead of a reward model.
    • Define Response, ResponseModel and other necessary components there.
    • Define a sample_generative_function specific for ResponseAgent.
  • Create a file pomdp_py/algorithms/cc_pomcp.pyx in which you can implement the CC-POMCP algorithm.

That way you don't need to make changes to the existing code using float-based reward.
Please keep the changes to the original code in framework and algorithms as little as possible. In addition, you shouldn't need to modify all the examples to achieve your contribution.

Also, please provide a meaningful example that demonstrates CC-POMCP. Currently the test in test_response.py is rather trivial.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zkytony, thanks for the notes. Your suggestion is helpful.

Ideally, I would have liked to reuse methods from your PO-UCT and POMCP implementations in the CC-POMCP algorithm. However, I cannot do so via subclassing because PO-UCT and POMCP return a single value in rollout and simulate, while CC-POMCP returns two values.

To avoid a backwards compatibility issue, I could reproduce most of the PO-UCT and POMCP code in CC-POMCP. Please let me know how this approach sounds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you subclass and override rollout and simulate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@troiwill does that make sense? You can still let CC-POMCP inherit PO-UCT, but override rollout and simulate so they can internally work with multi-valued rewards. You might want to create custom VNode and/or QNode classes, since the value stored in the nodes are also not just floats, I suppose.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zkytony, that makes sense. Once I've made all the changes, I will create another pull request. Thanks for the note.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will close this PR for now.

@zkytony zkytony closed this Apr 3, 2024
@troiwill troiwill deleted the response-model branch April 19, 2024 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants