Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscriber is being held when another subscriber callback is being called #2678

Open
TonyWelte opened this issue Nov 20, 2024 · 21 comments
Open
Labels
bug Something isn't working

Comments

@TonyWelte
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 24.04 (docker and host)
  • Installation type:
    • binaries
  • Version or commit hash:
    • jazzy
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

#include <rclcpp/rclcpp.hpp>

#include <std_msgs/msg/empty.hpp>

class MyClass {
  public:
    MyClass(rclcpp::Node& node) {
        std::cout << "Constructor: " << this << std::endl;

        _sub = node.create_subscription<std_msgs::msg::Empty>("test", 10, std::bind(&MyClass::callback, this, std::placeholders::_1));
    }

    ~MyClass() {
        std::cout << "Destructor:  " << this << ", Count: " << _sub.use_count() << std::endl;
    }

    void callback(const std_msgs::msg::Empty&) {
        std::cout << "Callback:    " << this << std::endl;
        my_data += 1.0;
    } 

  private:
  rclcpp::Subscription<std_msgs::msg::Empty>::SharedPtr _sub;

  double my_data;
};

int main(int argc, char* argv[]) {
    rclcpp::init(argc, argv);
    
    auto node = std::make_shared<rclcpp::Node>("test_node");

    std::unique_ptr<MyClass> object;
    
    auto sub = node->create_subscription<std_msgs::msg::Empty>("sub", 10, [&object, &node](const std_msgs::msg::Empty&){
        if(object){
            object.reset();
        }
        else {
            object = std::make_unique<MyClass>(*node);
        }
    });

    rclcpp::spin(node);

    rclcpp::shutdown();
    
    return 0;
}

You should be able to build the executable with just rclcpp and std_msgs linked.

I recommend compiling with the address sanitizer to make sure it crashes once the issue occurs. Without it, it might not crash but it's definitely UB.

colcon build --cmake-args -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS=-fsanitize=address

In three terminals

# Term1
ros2 topic pub --rate 10.0 /sub std_msgs/msg/Empty '{}'
# Term2
ros2 topic pub --rate 100.1 /test std_msgs/msg/Empty '{}'
# Term3
ASAN_OPTIONS=new_delete_type_mismatch=0 ros2 run YOUR_TEST_PACKAGE YOU_TEST_EXECUTABLE

Expected behavior

My understanding was that the executor had weak ownership of the subscriber and would only get a shared_ptr when executing the callback. Hence, in a single threaded context when a callback from one subscriber is called it should be possible to destroy another subscriber and clean related objects.

Even in a multi threaded context I would expect the same if the two subscribers interacting with each other are part a mutually exclusive callback group.

Actual behavior

Here when the callback for /sub is called MyClass is destroyed but the subscriber to /test it contains is held by someone else (the count of the share_ptr is 2).

This worked fine with ROS2 humble, the subscriber count was 1 when it's callback wasn't being called.

Assuming this is the intended behavior it becomes difficult cleanup objects used by subscribers (or containing subscribers as in the example). I'm sure it still feasible by keeping a weak_ptr before resting our shared_ptr, then the executor will hopefully release its shared_ptr and only after that destroy the object. But what was as simple as destroying the object becomes more convoluted.

Additional information

I found these issues that are somewhat related but I think it's a different issue since they had callbacks called in parallel which is not the case here:

@Barry-Xu-2018
Copy link
Collaborator

The implementation of Executor has been refactored.

When there are some "wait" ready to handle,

[this](WaitResultKind wait_result_kind) -> WaitResult<WaitSetTemplate> {
// convert the result into a WaitResult
switch (wait_result_kind) {
case WaitResultKind::Ready:
return WaitResult<WaitSetTemplate>::from_ready_wait_result_kind(*this);

A WaitResult will be created. storage_acquire_ownerships() will be called in wait_result_acquire(). It will get ownership for all waitable entities.

WaitResult(WaitResultKind wait_result_kind, WaitSetT & wait_set)
: wait_result_kind_(wait_result_kind),
wait_set_pointer_(&wait_set)
{
// Should be enforced by the static factory methods on this class.
assert(WaitResultKind::Ready == wait_result_kind);
// Secure thread-safety (if provided) and shared ownership (if needed).
this->get_wait_set().wait_result_acquire();
}

When handling each ready wait (call callback function), the ownerships for all waitable entities are hold (I understand that this is probably done for efficiency reasons).

Is this something that needs to be changed? Let's hear what others think.

@ipa-fez
Copy link

ipa-fez commented Jan 7, 2025

I also ran into this while debugging something related. It seems like an easy pitfall that's quite hard to debug, without address sanitizer hinting about the use after free.

The thing I'm actually doing is resetting an object containing a subscriber in a service callback, but in my minimal test setup, I switched to subscribers and hit this bug. In my case this seems to be a red herring though, as switching my minimal example to a service does not trigger the crash.

@ipa-fez
Copy link

ipa-fez commented Jan 8, 2025

Update: I think I managed to reproduce it with service calls resetting subscribers as well.

@TonyWelte
Copy link
Author

I've also noticed that some combination works fine. I think timers resetting subscribers is fine (or the other way around I don't remember). I think the executor might run timer and subscriber in a specific order so it's fine for timer to reset subscribers since subscriber are executed first so the timer can reset them without risking it being executed after the reset.

Even if some combination work I don't think it's something we should rely on. Even if one order is safe I don't think it's specified anywhere so it could change in the future.

For our use case we opted to declare both subscribers in the main Node class and have it check if MyClass still exist before dispatching the call to its callback. It's not the cleanest but at least I'm confident it works.

I've also experimented with enable_shared_from_this so that the /test subscriber can hold a weak_ptr to the object it needs. It prevents MyClass::callback from being called once the object has been destroyed and doesn't require special treatment when destroying the object.

#include <rclcpp/rclcpp.hpp>

#include <memory>

#include <std_msgs/msg/empty.hpp>

class MyClass : public std::enable_shared_from_this<MyClass> {
  private:
    struct Private{ explicit Private() = default; };

  public:
    MyClass() = delete;
    MyClass(Private) {
        std::cout << "Constructor: " << this << std::endl;
    }

    ~MyClass() {
        std::cout << "Destructor:  " << this << ", Count: " << _sub.use_count() << std::endl;
    }

    void initialize(rclcpp::Node& node) {
        std::cout << "initialization: " << this << std::endl;

        _sub = node.create_subscription<std_msgs::msg::Empty>("test", 10, [weak_ptr = weak_from_this()](const std_msgs::msg::Empty& msg) { auto ptr = weak_ptr.lock(); if(ptr) ptr->callback(msg); });
    }

    static std::shared_ptr<MyClass> create(rclcpp::Node& node) {
        auto ptr = std::make_shared<MyClass>(Private{});
        ptr->initialize(node);

        return ptr;
    }

    void callback(const std_msgs::msg::Empty&) {
        std::cout << "Callback:    " << this << std::endl;
        my_data += 1.0;
    } 

  private:
  rclcpp::Subscription<std_msgs::msg::Empty>::SharedPtr _sub;

  double my_data;
};

int main(int argc, char* argv[]) {
    rclcpp::init(argc, argv);
    
    auto node = std::make_shared<rclcpp::Node>("test_node");

    std::shared_ptr<MyClass> object;
    
    auto sub = node->create_subscription<std_msgs::msg::Empty>("sub", 10, [&object, &node](const std_msgs::msg::Empty&){
        if(object){
            object.reset();
        }
        else {
            object = MyClass::create(*node);
        }
    });

    rclcpp::spin(node);

    rclcpp::shutdown();
    
    return 0;
}

We haven't used this solution in our codebase yet because I'm not fully confident it completely works. The sanitizer doesn't complain and from what I understand about enable_shared_from_this and the lifetime of lambdas it should work but I might be missing something.

@ipa-fez
Copy link

ipa-fez commented Jan 9, 2025

I'm using subscribers with event callbacks and looking at ready_executables() in executor_entities_collection.cpp it seems that the order is:

subscribers -> services -> actions -> waitables

so in that case I will get:

subscription callback
service call
free subscription and captured data
subscription event callback <- use after free of captured data

Deleting a timer that has captures from a service call would also always have a high chance of producing a use after free.
Unless I'm missing something I think you effectively can't stop and free any timers from any event source safely, if it has captures that aren't also behind a shared_ptr/some other lock, because there's no guarantee that there isn't a stray event queued up that will fire after the user data is freed.
From a user perspective, my assumption would be that that should just work.
But I guess technically a shared_ptr puts this all into the world of reference semantics, so technically a class that has a timer as a member and some data that gets touched by a callback doesn't actually owns the timer.
This means that the lifetime of the timer is decoupled from the class, and RAII destruction order can't be relied on.
I think this is a pretty easy trap to fall into that's very hard to debug, so I'm not sure if that's a good tradeoff for performance.

I guess it's also not really possible to have a "remove this timer" interface to explicitly remove it, as it would mean modifying the collected executables while looping over them.
At the very least this probably needs good documentation.

@jmachowinski
Copy link
Contributor

I had a look at the code, and this looks indeed broken. The rclcpp::WaitResultrclcpp::WaitSet holds strong references to ready entities, which is a bug.
@alsora @mjcarroll @wjwwood thoughts ?

@alsora
Copy link
Collaborator

alsora commented Jan 10, 2025

mmm yes, normally we should use weak references and try to lock each of them when we are about to execute the work

@mjcarroll
Copy link
Member

We discussed this in the client library working group today.

This does seem to be a bug, but there are some limits on what we can do to address it.

I'm going to try an implementation where we demote all of the entities to weak_ptr after the wait is completed and when the result is returned. Before each entity is executed, the weak_ptr will get promoted back to a shared_ptr for the duration of the execution.

We discussed that this should make the cases a lot more rare, but does not completely protect it from happening again. In order to be completely safe, anyone who captures a reference in a subscription/timer/etc callback should also be checking that the reference is still valid before executing. It was discussed that there could be some documentation and best practices around this added.

@ewak
Copy link

ewak commented Jan 11, 2025

Here is a draft pull request to bondcpp, ros/bond_core#108, showing how one might use a helper function to "capture a reference in a subscription/timer/etc callback should also be checking that the reference is still valid before executing."

Maybe something like createSafeSubscriptionMemFuncCallback() could make its way into the rclcpp namespace.
https://github.com/ros/bond_core/pull/108/files#diff-53fc58d73ea8d104280e6d3fb43e83706760d598fd25dbd2d1176e0c8ccf56d4

@jmachowinski
Copy link
Contributor

The weak pointer message things puts a lot of boilerplate code on the user side. I wonder if it would be better to add something like 'stop_callbacks' to entities. This method would have a mutex inside, that we also hold during the callback itself. This would be a race free way to disable the callbacks. As the mutex is per entities, it should also be mostly dead lock free. I guess you could still deadlock if you had two callbacks that would try to stop each other at the same time...

After calling this method it would basically not matter any more, that there are still references to the entities around, as they would not access any 'outside' memory any more.

@ewak
Copy link

ewak commented Jan 12, 2025

The weak pointer message things puts a lot of boilerplate code on the user side.

Sure it does, which is why a helper function at least minimizes it.

I wonder if it would be better to add something like 'stop_callbacks' to entities.

@jmachowinski Do you envisage the user code calling this? If not, what triggers the call to stop_callbacks? (BTW I will need to study rclcpp code more to get a grip on what you are suggesting, any pointers(links to parts of the code you mean would be welcome.)

Spitballing/Brain Storming below IOW take with grain of salt/ignore as taking notes as I explore the details.

  1. I briefly looked at adding a ConstRefLifetimeCheckedCallback signature to AnySubscriptionCallbackPossibleTypes. Although it would still require work in user code to change the callback function signature and probably could not be made to detect a lifetime expiry anyway.

  2. Another way might be to be have a function signature (overload) for create_subscription, or even a new one create_cancellable_subscription, that allows passing in a shared_ptr to be held by as a weak_ptr in AnySubscriptionCallback, along with the callback function pointer. The weak_ptr can then be checked for lifetime expiry before invoking the callback in the various applicable dispatch functions in

  3. This is an opportunity to create a rclcpp_lifecycle::LifeCycleSubscription that includes a lifetime shared_ptr in its create_subscription function signature.

  4. Another possibility is to store a callback_lifetime pointer, maybe in SubscriptionOptionsBase. std::weak_ptr<void> callback_lifetime. This would allow a user to manage the subscription callback lifetime using the existing create_subscription API. Note its void so that SubscriptionOptionsBase doesn't need to know about the type. The intended usage would be to prevent the dispatch of the callback if the callback_lifetime weak_ptr had expired.

@ewak
Copy link

ewak commented Jan 12, 2025

I found this video, "WaitSet Presentation at RTWG Nov 2020" about the design of the WaitSet which was a good introduction.
#1047 (comment)

@ewak
Copy link

ewak commented Jan 12, 2025

See PR #2725 for exploration of option 4.

Another possibility is to store a callback_lifetime pointer, maybe in SubscriptionOptionsBase. std::weak_ptr callback_lifetime. This would allow a user to manage the subscription callback lifetime using the existing create_subscription API. Note its void so that SubscriptionOptionsBase doesn't need to know about the type. The intended usage would be to prevent the dispatch of the callback if the callback_lifetime weak_ptr had expired.

@jmachowinski
Copy link
Contributor

Option 4 has a race. The executor runs asynchronous in respect to the user code, therefore you can't figure out if the executor is currently handling the callback, while you are dropping the pointer.
If you think about it, it's basically what I proposed just in different...

Do you envisage the user code calling this? If not, what triggers the call to stop_callbacks?

This is an optional call that you need in certain cases to avoid races, like the one in this bug report. In most cases users would not need to call it, as dropping off the shared ptr to the entitie is good enough for their use case.

@ipa-fez
Copy link

ipa-fez commented Jan 13, 2025

Won't collecting the entities as weak_ptrs instead of shared_ptrs and only locking just for the call solve this for any single-threaded cases (I think this is what @mjcarroll suggested)?

That way if subscriber A runs first and deletes subscriber B, locking B will return null and the callback can be skipped. If it's the other way round, then there's no lifetime issue in the first place as Subscriber B is already gone the next time work is collected.

@jmachowinski
Copy link
Contributor

Won't collecting the entities as weak_ptrs instead of shared_ptrs and only locking just for the call solve this for any single-threaded cases (I think this is what @mjcarroll suggested)?

Yes, for only for the singe threaded executor use case, this would solve the issue. There would still be a race, if you drop a entitiy from a second thread etc. As having other threads and using the multi threaded executor is a 'nominal' usage, we should aim at a solution, that works for everything.

@ipa-fez
Copy link

ipa-fez commented Jan 13, 2025

I agree that it would be nice to safely do this, but Is that even solvable for the general case?

For callbacks in the same mutually exclusive callback group it may be reasonable to expect this to work. But can there be a race between one executor thread being in subscriber A while another thread has already locked subscriber B in that case?

If the callbacks are in different, concurrent callback groups, then you will need to be very careful when sharing data between them anyway, and arguably if you're opting into that you may be aware of that. Is opting into a "I promise to set this flag to 0 when I've deleted my shared data, please don't call the callback if I've done so" good enough? It will still crash if subscriber A runs after B has already checked the "was I deleted before the callback should be called" flag and deletes B's shared data during the callback. So you would need a shared ptr for your shared data either way. That basically brings you back to the "capture self as weak ptr" or have the subscriber not inside the class and capture the class as shared_ptr, keeping it alive.

@jmachowinski
Copy link
Contributor

I have the feeling we are not talking about the same problem. The problem I am seeing here is around the dropping of a shared_ptr to an entitiy, and that you can't be sure weather its still in use by the executor or not. Currently there is no API available that would give you a guarantee, that the executor will not call the callback of any entitiy any more. Therefore the only possible workaround in the current API is:

rclcpp::SubscriberBase::SharedPtr sub;
std::shared_ptr<Lib> myLibProcessingMsgsFromSub;

void onConfigure()
{
     create_subscription([weak_ptr = std::weak_ptr<Lib> = myLibProcessingMsgsFromSub](auto msg)
     {
          auto shr_lib_ptr = weak_ptr.lock();
          if(!shr_lib_ptr)
          {
                return;
          }
     }
}

The use case I am primary concerned about is the lifecycle node usecase. For this usecase, it it pretty normal, that some event cycles your node state. And if going through 'cleanup' you discard all your entities and the objects were the callbacks of the entities point to. And this is exactly the critical point, were the race might hit you.
E.g.

rclcpp::SubscriberBase::SharedPtr sub;
std::shared_ptr<Lib> myLibProcessingMsgsFromSub;

void onCleanup()
{
     sub.reset();
     myLibProcessingMsgsFromA.reset();
     // Lots of sadness, as sub might still be held and used by the executor
}

my proposal:

rclcpp::SubscriberBase::SharedPtr sub;
std::shared_ptr<Lib> myLibProcessingMsgsFromSub;

void onCleanup()
{
    // if the executor is currently executing sub, the mutex inside the
    // entity will delay this call until the execution is done.
     sub.stop_callbacks();
     // 100% safe, as we now have a guarantee that the callbacks from sub
    // will not be called any more
     myLibProcessingMsgsFromSzb.reset();
     sub.reset();
}

@ipa-fez
Copy link

ipa-fez commented Jan 13, 2025

I was thinking about what you would have to do if there's no stop_callbacks() method added and just changes done in for how long the executor keeps the shared_ptr locked.

having an explicit API for stopping the callbacks is much nicer though. It may still be nice to fix it for the singlethreaded case just to avoid the pitfall.

@SteveMacenski
Copy link
Collaborator

We discussed that this should make the cases a lot more rare, but does not completely protect it from happening again. In order to be completely safe, anyone who captures a reference in a subscription/timer/etc callback should also be checking that the reference is still valid before executing. It was discussed that there could be some documentation and best practices around this added.

Late to chime in - but I'd suggest we simply remove the option to subscribe to smart pointer references? In my opinion, we really shouldn't be having shared_ptr<MstT> & callback types anyway since it doesn't properly increment the counter. Raw objects I understand. That one may be unavoidable without forcing users to using smart pointers -- which might not be the worst thing to do in a future release!

This is an optional call that you need in certain cases to avoid races, like the one in this bug report. In most cases users would not need to call it, as dropping off the shared ptr to the entitie is good enough for their use case.

@jmachowinski Why not have the rclcpp::Subscriber destructor call stop_callbacks so its always called behind the scenes that users never need to concern themselves with?

@ipa-fez
Copy link

ipa-fez commented Jan 14, 2025

Why not have the rclcpp::Subscriber destructor call stop_callbacks so its always called behind the scenes that users never need to concern themselves with?

The problem is that calling subscriber.reset(); can't guarantee that the refcount drops to 0, since the executor may still hold a shared reference. In that case the subscriber's destructor won't run until after the user code's destructor if the executor has already locked its weak reference at that point. And the user has no way of telling if its reset() on the shared ptr was the last use, so there's no way to "safely" destroy a subscriber from a callback, you can only work around the issue by making sure any captured data's lifetime is independent of the user code that the subscriber lives in (i.e. the workarounds mentioned above).

To resolve this you need some way to explicitly block until the executor drops and unlocks its reference and prevents it from creating any in the future, so that subscriber.reset() decreases the refcount to 0 and guarantees that the subscriber is actually deleted.

@fujitatomoya fujitatomoya added the bug Something isn't working label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants