-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition between ProbeProcessAliveWithSigTerm in RouDi and SignalWatcher singleton in applications #2308
Comments
@gpalmer-latai there is a reason I don't like singletons. I think I have a workaround and a solution. Since the destruction order of static variables is in reverse order of their construction, you could just call Unfortunately, we need to reset the signal handler in the dtor, else if a signal was raised after the dtor it would access a destroyed semaphore. But it should be possible to just register an empty function as signal handler. AFAIK, this should prevent immediate termination and since the dtor of the There is a third solution but it is not yet fully workable for you. The experimental posh API does not use a singleton for the runtime but a local variable. With this, it is also guaranteed that the unregistering from RouDi is finished before the dtor of the static |
In one translation unit you are right, but when you have multiple ones this is no longer guaranteed. In this use case it should work but I wouldn't recommend this for production code. However, it is perfect for a quick intermediate solution to remove the annoying failures and gives us time to solve this issue cleanly.
It would render the SignalWatcher useless in all other cases, but I agree that something like this is the only viable solution, when we have a singleton runtime. I am pointing this out because I think I used the SignalWatcher before to capture signals like SIGABRT when the shared memory is set to zero - to provide a useful error output to the user. Also, I would not fix a design mistake we made - with the singleton - by adding another mistake in the SignalWatcher, which would then have surprising behavior. The contract is clear: register a signal, and when it goes out of scope, it restores the previous signal catching calls - maybe another instance has overridden some kind of signal handler, and with this change, we would break it.
I would recommend this path. As far as I remember @elBoberido you mentioned that less than a week has to be invested to finish the featureset. Otherwise, it feels like that we try to reduce bugs/technical debt by introducing even more technical debt. |
@elfenpiff when my memories from creating the API for the |
Thanks for the feedback. I'm going to try the suggested workaround for now, and I agree the proper thing to do is to de-singleton the runtime. If I ever get the resources to do so I might help contribute to this (after I get through my backlog of other things I want to upstream). |
In this case I could at least guide you a little bit. |
Required information
When shutting down applications, RouDi periodically sends SIGTERM signals. Usually these are all eaten by the SignalWatcher in applications which simply sets a flag that a SIGTERM was received.
But there is a very small window of time where right after the destructor for the SignalWatcher has run and reset the signal handler, but before the process is actually dead, that the SIGTERM triggers the default signal handler which terminates the program with a non-zero exit code.
This is pretty annoying when monitoring applications for non-zero exit codes because every once in awhile you'll get the SIGTERM exit code (usually 128 + 15 = 143 though in my environment it seems to be just 15).
Operating system:
NVidia DriveOS 6.0.6
Compiler version:
Can't recall off the top of my head - some version of gcc for arm64 CPU
Eclipse iceoryx version:
Fork based off of 8a5e083 (Upstream mean as of 2024.11.1)
Observed result or behaviour:
Occasionally applications have exit code 15, but all signs from RouDi and the application are that it exited gracefully
Expected result or behaviour:
Exit code 0 always. Preferably we monitor for processes being alive or not some other way besides spamming SIGTERM. I believe the idea of a file lock was mentioned w.r.t. Iceoryx2?
Conditions where it occurred / Performed steps:
Have an application which connects to RouDi and uses the
waitForTermination()
signal handler method. Take a non-trivial amount of time to shutdown such that RouDi sends several followup SIGTERM's after the first. Be unlucky and have the signal watcher partially destroyed when receiving one of those SIGTERMSAdditional helpful information
The text was updated successfully, but these errors were encountered: