-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove systems if their parent entity is removed #2232
base: main
Are you sure you want to change the base?
Conversation
This commit tries to address #2217. In particular if a user despawns an entity, the associated plugin gets removed. This should prevent issues like #2165. TBH I'm not sure if this is the right way forward as a system should technically be able to access any entity in a traditional ECS. I also recognize that there may be some performance impact. I will need to quantify this. Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
remove. This makes sense because for the most part, we will not be removing entities very often so it does not make sense to iterate unless there are entities which do need removing. Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2232 +/- ##
==========================================
- Coverage 65.64% 64.78% -0.87%
==========================================
Files 324 358 +34
Lines 30937 29213 -1724
==========================================
- Hits 20308 18925 -1383
+ Misses 10629 10288 -341 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
I thought of this possible alternative implementation:
A possible extension could be to also have a kind of E.g.: consider a system attached to a model and which acts on some subentities of that model:
|
I like the thought process behind your suggested API addition. However, in my mind that would be an added feature. We need to be able to remove systems safely. That is something that will take a fair deal of work right now and what this PR hopes to (eventually achieve). Your proposed API could be bolted on top in Ionic once we are able to perform safe removal. The current state of this PR is we can safely remove some plugins, but it causes weird breakages in the |
Signed-off-by: Arjo Chakravarty <[email protected]>
This commit removes an "optimization" in the PosePublisher that was causing the system to segfault when an entity was removed. In reality this commit is unrelated to the previous few commits. However, without this change the previous change would cause removal of an entity to crash demos that used pose publishers. I think the crux of the issue is that protobufs takeover memory management of objects leading to a scenario where if a reference was freed elsewhere, the protobuf would continue to hold said reference. When we try to do operations on the protobuf we end up double-freeing the memory and hence there is a 💥. Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
I think this works now... Will have to write some unit tests to verify. But it does not go 💥 on the warehouse world. |
I haven't had a chance to look into this deeply, but one thing to keep is how this interacts with the levels feature. If a model that contains a system is unloaded by the levels feature and gets loaded back again, what would be the behavior? Would the system get configured and start anew? If that works properly, that would cover most cases, but we might also want the ability for a system to stay loaded even when its parent entity is removed so that if the parent entity is loaded back in, the system might continue from where it left off. |
Thanks for pointing that out. I'll go check out levels and see how this PR interacts with it. A cursory reading of the levels code though makes it seem like we despawn entities and then respawn them which makes me wonder if holding a reference "entity" in a system is correct. I've not looked into this deeply enough to say anything. Will update this PR when I try it. I'm currently out on my anual firefighting reservist. |
Is any work ongoing in this PR? I'm currently working on a simulation that requires frequent loading and unloading of models. The problem of unloaded plugins really worries me. |
I just need to write some tests for this. |
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
If you'd like to get this checked in faster you could help with reviewing it. One thing to do is to checkout the relevant branch and run it on example worlds to see that things work as expected. |
Unfortunately, I don't have any experience with the gz-sim codebase and can't really comment on whether this is a suitable solution or not. Just need a way to remove models without plugins like JointStatePublisher crashing. |
You can try it out and tell me if it crashes! There may be plugins I've not tested on! |
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fix spelling in header file name
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me! I just noted two places where comments could be improved, and then I will be ready to approve
Co-authored-by: Steve Peters <[email protected]> Signed-off-by: Arjo Chakravarty <[email protected]>
/// \brief A variable that gets populated with poses. This also here as a | ||
/// member variable to avoid repeated memory allocations and improve | ||
/// performance. | ||
public: msgs::Pose_V poseVMsg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely concerning. Do we know the root cause? I would be hesitant to merge this if we have to put this kind of condition on system authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to reproduce the failure by reverting the changes to PosePublisher
on this branch, running the pose_publisher.sdf example world, echoing the double pendulum pose topic, and then removing the pendulum, and I haven't seen any problems in testing on macOS. I suppose I could try to reproduce on Ubuntu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a4c89e3:
This commit removes an "optimization" in the PosePublisher that was
causing the system to segfault when an entity was removed. In reality
this commit is unrelated to the previous few commits. However, without this change
the previous change would cause removal of an entity to crash demos that
used pose publishers. I think the crux of the issue is that protobufs takeover
memory management of objects leading to a scenario where if a reference
was freed elsewhere, the protobuf would continue to hold said reference.
When we try to do operations on the protobuf we end up double-freeing
the memory and hence there is a 💥.
Any form of removal will require that system authors think about lifetimes (hence the reason I target ionic and do not recommend backports). I could get the warehouse world to segfault on ubuntu when I made this change. Its been several months now so I havent tried recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep can confirm without this change I get a crash in PosePublisher
:
#11 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#10 Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc6e332684f, in
#9 Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc6e3294ac2, in
#8 Object "/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fc6df4dc252, in
#7 Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-sim9.so.9", at 0x7fc6de77203e, in
#6 Object "/home/arjoc/ws/gz/ionic/install/lib/gz-sim-9/plugins/libgz-sim-pose-publisher-system.so", at 0x7fc6aabfa3b7, in
#5 Object "/home/arjoc/ws/gz/ionic/install/lib/gz-sim-9/plugins/libgz-sim-pose-publisher-system.so", at 0x7fc6aabf9ef0, in
#4 Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-transport14.so.14", at 0x7fc6de8ed6b0, in gz::transport::v14::Node::Publisher::Publish(google::protobuf::Message const&)
#3 Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-msgs11.so.11", at 0x7fc6dde8e156, in gz::msgs::Pose::ByteSizeLong() const
#2 Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-msgs11.so.11", at 0x7fc6dddf11e8, in gz::msgs::Header::ByteSizeLong() const
#1 Object "/home/arjoc/ws/gz/ionic/install/lib/libgz-msgs11.so.11", at 0x7fc6dddeec5a, in gz::msgs::Header_Map::ByteSizeLong() const
#0 Object "/lib/x86_64-linux-gnu/libprotobuf.so.23", at 0x7fc6dd8f3922, in google::protobuf::RepeatedPtrField<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::Get(int) const
Segmentation fault (Address not mapped to object [0x8])
I can remove this change into a seperate PR if you'd like me to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack trace indicates that the PosePublisher actually tried to publish after it was removed. To me, this looks like we haven't cleared the PostUpdate
thread queue. Somehow we're calling the PostUpdate
ona deleted object?
The PosePublisher
system is not doing anything special. It's simply publishing in the PostUpdate
call, so I don't think we can justify requiring system authors to think this deeply about lifetimes. I would understand having this requirement if it had its own thread like the Sensors
system. But even then, the destructor of the system should be called, which should ideally terminate any internal threads and join them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is SystemContainer does not call the destructor in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce the crash with the following world, but my patch appears to fix it:
gz sim -v 4 "https://fuel.gazebosim.org/1.0/OpenRobotics/worlds/Tugbot in Warehouse"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destructors are now called in a764498.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on print statements I can guarantee that the pose publisher is actually being terminated. @scpeters patch causes a crash still. I am more or less sure this has to do with some spooky action at a distance within PosePublisher itself and not my new code. Specifically I ran the warehouse world. I despawned entity 16. The threads belonging to 16 cleanly exit:
The thread that crashes belongs to entity 216 (which incidentally has a pose publisher as well and was not de-spawned and so is supposed to publish stuff):
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
Signed-off-by: Arjo Chakravarty <[email protected]>
🦟 Bug fix
Fixes #2217
Summary
In particular if a user despawns an entity, the associated plugin gets removed. This should prevent issues like #2165. TBH I'm not sure if this is the right way forward as a system should technically be able to access any entity in a traditional ECS.
To better understand the changes and review this, the PR consists of the following:
Drop
function toBarrier.cc
&Barrier.hh
. This is similar to C++20'sstd::barrier::await_and_drop
SystemContainer
structure (and tests). This is a masked vector that allows systems to be removed safely while retaining their original addresses.SystemManager
to facilitate removing systems from aSystemManager
Interface (and tests).SimulationRunner
(and tests) to use the new barrier and removal processing routines.PosePublisher
(cause I found it crashing with this new PR due to an unrelated optimization).Checklist
codecheck
passed (See contributing)Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining
Signed-off-by
messages.🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸🔸