-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subscriber is being held when another subscriber callback is being called #2678
Comments
The implementation of Executor has been refactored. When there are some "wait" ready to handle, rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp Lines 698 to 702 in 37e3688
A WaitResult will be created. storage_acquire_ownerships() will be called in wait_result_acquire(). It will get ownership for all waitable entities. rclcpp/rclcpp/include/rclcpp/wait_result.hpp Lines 300 to 308 in 37e3688
When handling each ready wait (call callback function), the ownerships for all waitable entities are hold (I understand that this is probably done for efficiency reasons). Is this something that needs to be changed? Let's hear what others think. |
I also ran into this while debugging something related. It seems like an easy pitfall that's quite hard to debug, without address sanitizer hinting about the use after free. The thing I'm actually doing is resetting an object containing a subscriber in a service callback, but in my minimal test setup, I switched to subscribers and hit this bug. In my case this seems to be a red herring though, as switching my minimal example to a service does not trigger the crash. |
Update: I think I managed to reproduce it with service calls resetting subscribers as well. |
I've also noticed that some combination works fine. I think timers resetting subscribers is fine (or the other way around I don't remember). I think the executor might run timer and subscriber in a specific order so it's fine for timer to reset subscribers since subscriber are executed first so the timer can reset them without risking it being executed after the reset. Even if some combination work I don't think it's something we should rely on. Even if one order is safe I don't think it's specified anywhere so it could change in the future. For our use case we opted to declare both subscribers in the main Node class and have it check if I've also experimented with #include <rclcpp/rclcpp.hpp>
#include <memory>
#include <std_msgs/msg/empty.hpp>
class MyClass : public std::enable_shared_from_this<MyClass> {
private:
struct Private{ explicit Private() = default; };
public:
MyClass() = delete;
MyClass(Private) {
std::cout << "Constructor: " << this << std::endl;
}
~MyClass() {
std::cout << "Destructor: " << this << ", Count: " << _sub.use_count() << std::endl;
}
void initialize(rclcpp::Node& node) {
std::cout << "initialization: " << this << std::endl;
_sub = node.create_subscription<std_msgs::msg::Empty>("test", 10, [weak_ptr = weak_from_this()](const std_msgs::msg::Empty& msg) { auto ptr = weak_ptr.lock(); if(ptr) ptr->callback(msg); });
}
static std::shared_ptr<MyClass> create(rclcpp::Node& node) {
auto ptr = std::make_shared<MyClass>(Private{});
ptr->initialize(node);
return ptr;
}
void callback(const std_msgs::msg::Empty&) {
std::cout << "Callback: " << this << std::endl;
my_data += 1.0;
}
private:
rclcpp::Subscription<std_msgs::msg::Empty>::SharedPtr _sub;
double my_data;
};
int main(int argc, char* argv[]) {
rclcpp::init(argc, argv);
auto node = std::make_shared<rclcpp::Node>("test_node");
std::shared_ptr<MyClass> object;
auto sub = node->create_subscription<std_msgs::msg::Empty>("sub", 10, [&object, &node](const std_msgs::msg::Empty&){
if(object){
object.reset();
}
else {
object = MyClass::create(*node);
}
});
rclcpp::spin(node);
rclcpp::shutdown();
return 0;
} We haven't used this solution in our codebase yet because I'm not fully confident it completely works. The sanitizer doesn't complain and from what I understand about |
I'm using subscribers with event callbacks and looking at
so in that case I will get:
Deleting a timer that has captures from a service call would also always have a high chance of producing a use after free. I guess it's also not really possible to have a "remove this timer" interface to explicitly remove it, as it would mean modifying the collected executables while looping over them. |
I had a look at the code, and this looks indeed broken. The rclcpp::WaitResultrclcpp::WaitSet holds strong references to ready entities, which is a bug. |
mmm yes, normally we should use weak references and try to lock each of them when we are about to execute the work |
We discussed this in the client library working group today. This does seem to be a bug, but there are some limits on what we can do to address it. I'm going to try an implementation where we demote all of the entities to We discussed that this should make the cases a lot more rare, but does not completely protect it from happening again. In order to be completely safe, anyone who captures a reference in a subscription/timer/etc callback should also be checking that the reference is still valid before executing. It was discussed that there could be some documentation and best practices around this added. |
Here is a draft pull request to bondcpp, ros/bond_core#108, showing how one might use a helper function to "capture a reference in a subscription/timer/etc callback should also be checking that the reference is still valid before executing." Maybe something like createSafeSubscriptionMemFuncCallback() could make its way into the rclcpp namespace. |
The weak pointer message things puts a lot of boilerplate code on the user side. I wonder if it would be better to add something like 'stop_callbacks' to entities. This method would have a mutex inside, that we also hold during the callback itself. This would be a race free way to disable the callbacks. As the mutex is per entities, it should also be mostly dead lock free. I guess you could still deadlock if you had two callbacks that would try to stop each other at the same time... After calling this method it would basically not matter any more, that there are still references to the entities around, as they would not access any 'outside' memory any more. |
Sure it does, which is why a helper function at least minimizes it.
@jmachowinski Do you envisage the user code calling this? If not, what triggers the call to stop_callbacks? (BTW I will need to study rclcpp code more to get a grip on what you are suggesting, any pointers(links to parts of the code you mean would be welcome.) Spitballing/Brain Storming below IOW take with grain of salt/ignore as taking notes as I explore the details.
|
I found this video, "WaitSet Presentation at RTWG Nov 2020" about the design of the WaitSet which was a good introduction. |
See PR #2725 for exploration of option 4.
|
Option 4 has a race. The executor runs asynchronous in respect to the user code, therefore you can't figure out if the executor is currently handling the callback, while you are dropping the pointer.
This is an optional call that you need in certain cases to avoid races, like the one in this bug report. In most cases users would not need to call it, as dropping off the shared ptr to the entitie is good enough for their use case. |
Won't collecting the entities as weak_ptrs instead of shared_ptrs and only locking just for the call solve this for any single-threaded cases (I think this is what @mjcarroll suggested)? That way if subscriber A runs first and deletes subscriber B, locking B will return null and the callback can be skipped. If it's the other way round, then there's no lifetime issue in the first place as Subscriber B is already gone the next time work is collected. |
Yes, for only for the singe threaded executor use case, this would solve the issue. There would still be a race, if you drop a entitiy from a second thread etc. As having other threads and using the multi threaded executor is a 'nominal' usage, we should aim at a solution, that works for everything. |
I agree that it would be nice to safely do this, but Is that even solvable for the general case? For callbacks in the same mutually exclusive callback group it may be reasonable to expect this to work. But can there be a race between one executor thread being in subscriber A while another thread has already locked subscriber B in that case? If the callbacks are in different, concurrent callback groups, then you will need to be very careful when sharing data between them anyway, and arguably if you're opting into that you may be aware of that. Is opting into a "I promise to set this flag to 0 when I've deleted my shared data, please don't call the callback if I've done so" good enough? It will still crash if subscriber A runs after B has already checked the "was I deleted before the callback should be called" flag and deletes B's shared data during the callback. So you would need a shared ptr for your shared data either way. That basically brings you back to the "capture self as weak ptr" or have the subscriber not inside the class and capture the class as shared_ptr, keeping it alive. |
I have the feeling we are not talking about the same problem. The problem I am seeing here is around the dropping of a shared_ptr to an entitiy, and that you can't be sure weather its still in use by the executor or not. Currently there is no API available that would give you a guarantee, that the executor will not call the callback of any entitiy any more. Therefore the only possible workaround in the current API is: rclcpp::SubscriberBase::SharedPtr sub;
std::shared_ptr<Lib> myLibProcessingMsgsFromSub;
void onConfigure()
{
create_subscription([weak_ptr = std::weak_ptr<Lib> = myLibProcessingMsgsFromSub](auto msg)
{
auto shr_lib_ptr = weak_ptr.lock();
if(!shr_lib_ptr)
{
return;
}
}
} The use case I am primary concerned about is the lifecycle node usecase. For this usecase, it it pretty normal, that some event cycles your node state. And if going through 'cleanup' you discard all your entities and the objects were the callbacks of the entities point to. And this is exactly the critical point, were the race might hit you. rclcpp::SubscriberBase::SharedPtr sub;
std::shared_ptr<Lib> myLibProcessingMsgsFromSub;
void onCleanup()
{
sub.reset();
myLibProcessingMsgsFromA.reset();
// Lots of sadness, as sub might still be held and used by the executor
} my proposal: rclcpp::SubscriberBase::SharedPtr sub;
std::shared_ptr<Lib> myLibProcessingMsgsFromSub;
void onCleanup()
{
// if the executor is currently executing sub, the mutex inside the
// entity will delay this call until the execution is done.
sub.stop_callbacks();
// 100% safe, as we now have a guarantee that the callbacks from sub
// will not be called any more
myLibProcessingMsgsFromSzb.reset();
sub.reset();
} |
I was thinking about what you would have to do if there's no having an explicit API for stopping the callbacks is much nicer though. It may still be nice to fix it for the singlethreaded case just to avoid the pitfall. |
Late to chime in - but I'd suggest we simply remove the option to subscribe to smart pointer references? In my opinion, we really shouldn't be having
@jmachowinski Why not have the |
The problem is that calling To resolve this you need some way to explicitly block until the executor drops and unlocks its reference and prevents it from creating any in the future, so that |
Bug report
Required Info:
Steps to reproduce issue
You should be able to build the executable with just
rclcpp
andstd_msgs
linked.I recommend compiling with the address sanitizer to make sure it crashes once the issue occurs. Without it, it might not crash but it's definitely UB.
In three terminals
Expected behavior
My understanding was that the executor had weak ownership of the subscriber and would only get a shared_ptr when executing the callback. Hence, in a single threaded context when a callback from one subscriber is called it should be possible to destroy another subscriber and clean related objects.
Even in a multi threaded context I would expect the same if the two subscribers interacting with each other are part a mutually exclusive callback group.
Actual behavior
Here when the callback for
/sub
is calledMyClass
is destroyed but the subscriber to/test
it contains is held by someone else (the count of the share_ptr is 2).This worked fine with ROS2 humble, the subscriber count was 1 when it's callback wasn't being called.
Assuming this is the intended behavior it becomes difficult cleanup objects used by subscribers (or containing subscribers as in the example). I'm sure it still feasible by keeping a weak_ptr before resting our shared_ptr, then the executor will hopefully release its shared_ptr and only after that destroy the object. But what was as simple as destroying the object becomes more convoluted.
Additional information
I found these issues that are somewhat related but I think it's a different issue since they had callbacks called in parallel which is not the case here:
initial_pose_sub_
innav2_amcl
may cause use-after-free bug ros-navigation/navigation2#4166 (comment)The text was updated successfully, but these errors were encountered: