Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove systems if their parent entity is removed #2232

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

arjo129
Copy link
Contributor

@arjo129 arjo129 commented Nov 7, 2023

🦟 Bug fix

Fixes #2217

Summary

In particular if a user despawns an entity, the associated plugin gets removed. This should prevent issues like #2165. TBH I'm not sure if this is the right way forward as a system should technically be able to access any entity in a traditional ECS.

no_crash

To better understand the changes and review this, the PR consists of the following:

  • Add a Drop function to Barrier.cc & Barrier.hh. This is similar to C++20's std::barrier::await_and_drop
  • Add a SystemContainer structure (and tests). This is a masked vector that allows systems to be removed safely while retaining their original addresses.
  • Add a function to SystemManager to facilitate removing systems from a SystemManager Interface (and tests).
  • Update SimulationRunner (and tests) to use the new barrier and removal processing routines.
  • Update PosePublisher (cause I found it crashing with this new PR due to an unrelated optimization).

Checklist

  • Signed all commits for DCO
  • Added tests
  • Updated documentation (as needed)
  • Updated migration guide (as needed)
  • Consider updating Python bindings (if the library has them)
  • codecheck passed (See contributing)
  • All tests passed (See test coverage)
  • While waiting for a review on your PR, please help review another open pull request to support the maintainers

Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining Signed-off-by messages.

🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸

This commit tries to address #2217.

In particular if a user despawns an entity, the associated plugin gets
removed. This should prevent issues like #2165. TBH I'm not sure if this
is the right way forward as a system should technically be able to
access any entity in a traditional ECS. I also recognize that there may
be some performance impact. I will need to quantify this.

Signed-off-by: Arjo Chakravarty <[email protected]>
@github-actions github-actions bot added the 🌱 garden Ignition Garden label Nov 7, 2023
Signed-off-by: Arjo Chakravarty <[email protected]>
remove.

This makes sense because for the most part, we will not be removing
entities very often so it does not make sense to iterate unless there
are entities which do need removing.

Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Copy link

codecov bot commented Nov 8, 2023

Codecov Report

Attention: Patch coverage is 92.55319% with 7 lines in your changes missing coverage. Please review.

Project coverage is 64.78%. Comparing base (0f27ce0) to head (89a0f7d).
Report is 119 commits behind head on main.

Current head 89a0f7d differs from pull request most recent head f3c57d0

Please upload reports for the commit f3c57d0 to get more accurate results.

Files Patch % Lines
src/SystemManager.cc 89.83% 6 Missing ⚠️
src/systems/pose_publisher/PosePublisher.cc 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2232      +/-   ##
==========================================
- Coverage   65.64%   64.78%   -0.87%     
==========================================
  Files         324      358      +34     
  Lines       30937    29213    -1724     
==========================================
- Hits        20308    18925    -1383     
+ Misses      10629    10288     -341     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jrutgeer
Copy link
Contributor

jrutgeer commented Nov 9, 2023

@arjo129

I thought of this possible alternative implementation:

  • Only the system plugin itself knows which entities are necessary for its functioning,
    • This could e.g. be the model entity it is attached to, but it could as well be any other entity,
    • Hence it's the system plugin's responsibility to identify those entities,
  • To identify these entities, the system plugin adds a component used_by_plugin to each entity it acts upon,
    • A used_by_plugin component of an entity identifies a plugin which acts on that entity,
  • Upon removal of entities, the ECM checks whether they have used_by_plugin components,
    • If so, the SystemManager is triggered to remove the corresponding plugin.

A possible extension could be to also have a kind of severity flag in the component, to allow for different reactions from the SystemManager depending on which entities are removed.

E.g.: consider a system attached to a model and which acts on some subentities of that model:

  • If the model entity is removed, the system should also be removed,
  • But if a subentity is removed, maybe the system should not be removed yet, but rather perform some 'reconfigure' action instead.

@arjo129
Copy link
Contributor Author

arjo129 commented Nov 9, 2023

I like the thought process behind your suggested API addition. However, in my mind that would be an added feature. We need to be able to remove systems safely. That is something that will take a fair deal of work right now and what this PR hopes to (eventually achieve). Your proposed API could be bolted on top in Ionic once we are able to perform safe removal. The current state of this PR is we can safely remove some plugins, but it causes weird breakages in the PosePublisher plugin which I need to investigate prior to making any changes. It is also entirely possible that I'm doing something wrong with the barriers. Either ways I need to revisit this another day when my head is clearer and I have more time.

image

Signed-off-by: Arjo Chakravarty <[email protected]>
This commit removes an "optimization" in the PosePublisher that was
causing the system to segfault when an entity was removed. In reality
this commit is unrelated to the previous few commits. However, without this change
the previous change would cause removal of an entity to crash demos that
used pose publishers. I think the crux of the issue is that protobufs takeover
memory management of objects leading to a scenario where if a reference
was freed elsewhere, the protobuf would continue to hold said reference.
When we try to do operations on the protobuf we end up double-freeing
the memory and hence there is a 💥.

Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
@arjo129
Copy link
Contributor Author

arjo129 commented Nov 24, 2023

I think this works now... Will have to write some unit tests to verify. But it does not go 💥 on the warehouse world.

@azeey
Copy link
Contributor

azeey commented Dec 1, 2023

I haven't had a chance to look into this deeply, but one thing to keep is how this interacts with the levels feature. If a model that contains a system is unloaded by the levels feature and gets loaded back again, what would be the behavior? Would the system get configured and start anew? If that works properly, that would cover most cases, but we might also want the ability for a system to stay loaded even when its parent entity is removed so that if the parent entity is loaded back in, the system might continue from where it left off.

@arjo129
Copy link
Contributor Author

arjo129 commented Dec 1, 2023

Thanks for pointing that out. I'll go check out levels and see how this PR interacts with it. A cursory reading of the levels code though makes it seem like we despawn entities and then respawn them which makes me wonder if holding a reference "entity" in a system is correct. I've not looked into this deeply enough to say anything. Will update this PR when I try it. I'm currently out on my anual firefighting reservist.

@Bogdanov-am
Copy link
Contributor

Is any work ongoing in this PR? I'm currently working on a simulation that requires frequent loading and unloading of models. The problem of unloaded plugins really worries me.

@arjo129
Copy link
Contributor Author

arjo129 commented Mar 6, 2024

I just need to write some tests for this.

Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
@arjo129 arjo129 marked this pull request as ready for review March 18, 2024 08:25
@arjo129 arjo129 requested a review from mjcarroll as a code owner March 18, 2024 08:25
@arjo129 arjo129 changed the base branch from gz-sim7 to main March 18, 2024 08:26
@arjo129
Copy link
Contributor Author

arjo129 commented May 2, 2024

If you'd like to get this checked in faster you could help with reviewing it. One thing to do is to checkout the relevant branch and run it on example worlds to see that things work as expected.

@arjo129 arjo129 requested a review from azeey May 2, 2024 02:10
@landersson
Copy link

If you'd like to get this checked in faster you could help with reviewing it. One thing to do is to checkout the relevant branch and run it on example worlds to see that things work as expected.

Unfortunately, I don't have any experience with the gz-sim codebase and can't really comment on whether this is a suitable solution or not.

Just need a way to remove models without plugins like JointStatePublisher crashing.

@arjo129
Copy link
Contributor Author

arjo129 commented May 2, 2024

Unfortunately, I don't have any experience with the gz-sim codebase and can't really comment on whether this is a suitable solution or not.

You can try it out and tell me if it crashes! There may be plugins I've not tested on!

src/SystemManager.hh Outdated Show resolved Hide resolved
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Copy link
Member

@scpeters scpeters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix spelling in header file name

src/SystemCointainer.hpp Outdated Show resolved Hide resolved
src/SystemContainer_TEST.cc Outdated Show resolved Hide resolved
src/SystemManager.cc Outdated Show resolved Hide resolved
src/SystemManager.hh Outdated Show resolved Hide resolved
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
src/SystemContainer.hpp Outdated Show resolved Hide resolved
Signed-off-by: Arjo Chakravarty <[email protected]>
src/SimulationRunner.cc Outdated Show resolved Hide resolved
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
@azeey azeey requested a review from scpeters June 10, 2024 18:38
Copy link
Member

@scpeters scpeters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me! I just noted two places where comments could be improved, and then I will be ready to approve

src/SimulationRunner.hh Outdated Show resolved Hide resolved
src/SystemManager.cc Outdated Show resolved Hide resolved
Co-authored-by: Steve Peters <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
src/SystemContainer.hpp Outdated Show resolved Hide resolved
src/SystemContainer.hpp Outdated Show resolved Hide resolved
src/SystemContainer.hpp Outdated Show resolved Hide resolved
src/SystemManager.cc Outdated Show resolved Hide resolved
/// \brief A variable that gets populated with poses. This also here as a
/// member variable to avoid repeated memory allocations and improve
/// performance.
public: msgs::Pose_V poseVMsg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely concerning. Do we know the root cause? I would be hesitant to merge this if we have to put this kind of condition on system authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to reproduce the failure by reverting the changes to PosePublisher on this branch, running the pose_publisher.sdf example world, echoing the double pendulum pose topic, and then removing the pendulum, and I haven't seen any problems in testing on macOS. I suppose I could try to reproduce on Ubuntu

Copy link
Contributor Author

@arjo129 arjo129 Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a4c89e3:

This commit removes an "optimization" in the PosePublisher that was
causing the system to segfault when an entity was removed. In reality
this commit is unrelated to the previous few commits. However, without this change
the previous change would cause removal of an entity to crash demos that
used pose publishers. I think the crux of the issue is that protobufs takeover
memory management of objects leading to a scenario where if a reference
was freed elsewhere, the protobuf would continue to hold said reference.
When we try to do operations on the protobuf we end up double-freeing
the memory and hence there is a 💥.

Any form of removal will require that system authors think about lifetimes (hence the reason I target ionic and do not recommend backports). I could get the warehouse world to segfault on ubuntu when I made this change. Its been several months now so I havent tried recently.

Copy link
Contributor Author

@arjo129 arjo129 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep can confirm without this change I get a crash in PosePublisher:

#11   Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in 
#10   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc6e332684f, in 
#9    Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc6e3294ac2, in 
#8    Object "/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fc6df4dc252, in 
#7    Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-sim9.so.9", at 0x7fc6de77203e, in 
#6    Object "/home/arjoc/ws/gz/ionic/install/lib/gz-sim-9/plugins/libgz-sim-pose-publisher-system.so", at 0x7fc6aabfa3b7, in 
#5    Object "/home/arjoc/ws/gz/ionic/install/lib/gz-sim-9/plugins/libgz-sim-pose-publisher-system.so", at 0x7fc6aabf9ef0, in 
#4    Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-transport14.so.14", at 0x7fc6de8ed6b0, in gz::transport::v14::Node::Publisher::Publish(google::protobuf::Message const&)
#3    Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-msgs11.so.11", at 0x7fc6dde8e156, in gz::msgs::Pose::ByteSizeLong() const
#2    Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-msgs11.so.11", at 0x7fc6dddf11e8, in gz::msgs::Header::ByteSizeLong() const
#1    Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-msgs11.so.11", at 0x7fc6dddeec5a, in gz::msgs::Header_Map::ByteSizeLong() const
#0    Object "/lib/x86_64-linux-gnu/libprotobuf.so.23", at 0x7fc6dd8f3922, in google::protobuf::RepeatedPtrField<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::Get(int) const
Segmentation fault (Address not mapped to object [0x8])

I can remove this change into a seperate PR if you'd like me to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack trace indicates that the PosePublisher actually tried to publish after it was removed. To me, this looks like we haven't cleared the PostUpdate thread queue. Somehow we're calling the PostUpdate ona deleted object?

The PosePublisher system is not doing anything special. It's simply publishing in the PostUpdate call, so I don't think we can justify requiring system authors to think this deeply about lifetimes. I would understand having this requirement if it had its own thread like the Sensors system. But even then, the destructor of the system should be called, which should ideally terminate any internal threads and join them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is SystemContainer does not call the destructor in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce the crash with the following world, but my patch appears to fix it:

gz sim -v 4 "https://fuel.gazebosim.org/1.0/OpenRobotics/worlds/Tugbot in Warehouse"

Copy link
Contributor Author

@arjo129 arjo129 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azeey what do you think of @scpeters solution?

Also do I need to call destructors upon deletion? Right now I just defer calling the destructor until the entire application terminates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructors are now called in a764498.

Copy link
Contributor Author

@arjo129 arjo129 Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on print statements I can guarantee that the pose publisher is actually being terminated. @scpeters patch causes a crash still. I am more or less sure this has to do with some spooky action at a distance within PosePublisher itself and not my new code. Specifically I ran the warehouse world. I despawned entity 16. The threads belonging to 16 cleanly exit:
Screenshot from 2024-06-20 09-02-48

The thread that crashes belongs to entity 216 (which incidentally has a pose publisher as well and was not de-spawned and so is supposed to publish stuff):

Screenshot from 2024-06-20 08-53-18

Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
@arjo129 arjo129 requested a review from azeey June 19, 2024 02:14
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏛️ ionic Gazebo Ionic
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

System plugins keep executing after their model is removed.
6 participants