-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze fixes and v1 kludges #2545
Conversation
2d3fb7b
to
1f23a7e
Compare
Addressed review comments; rebased. |
1f23a7e
to
2e5b4b5
Compare
2e5b4b5
to
3bf115c
Compare
Done using clang-format 19.1.5 with .clang-format obtained via scripts/fetch-clang-format.sh. Signed-off-by: Kir Kolyshkin <[email protected]>
There are a few issues with the freeze_processes logic: 1. Commit 9fae23f grossly (by 1000x) miscalculated the number of attempts required, as a result, we are seeing something like this: > (00.000340) freezing processes: 100000 attempts with 100 ms steps > (00.000351) freezer.state=THAWED > (00.000358) freezer.state=FREEZING > (00.100446) freezer.state=FREEZING > ...close to 100 lines skipped... > (09.915110) freezer.state=FREEZING > (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0 > (10.000563) freezer.state=FREEZING For 10s with 100ms steps we only need 100 attempts, not 100000. 2. When the timeout is hit, the "failed to freeze cgroup" error is not printed, and the log_unfrozen_stacks is not called either. 3. The nanosleep at the last iteration is useless (this was hidden by issue 1 above, as the timeout was hit first). Fix all these. While at it, 4. Amend the error message with the number of attempts, sleep duration, and timeout. 5. Modify the "freezing cgroup" debug message to be in sync with the above error. Was: > freezing processes: 100000 attempts with 100 ms steps Now: > freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s Signed-off-by: Kir Kolyshkin <[email protected]>
Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <[email protected]>
3bf115c
to
868e9fa
Compare
Refactored to fix some more issues. |
As this is now fixed, we may find additional issues with the |
Testing this in opencontainers/runc#4559, no luck so far (can't reproduce the issue). |
I have concluded the testing, these kludges definitely help (i.e. can easily reproduce the issue without the fix, and can not reproduce it with the fix). See opencontainers/runc#4559 for more details. |
@avagin PTAL |
1. freeze_processes: fix logic
There are a few issues with the freeze_processes logic:
attempts required, as a result, we are seeing something like this:
For 10s with 100ms steps we only need 100 attempts, not 100000.
When the timeout is hit, the "failed to freeze cgroup" error is not
printed, and the log_unfrozen_stacks is not called either.
The nanosleep at the last iteration is useless (this was hidden by
issue 1 above, as the timeout was hit first).
Fix all these.
While at it,
Amend the error message with the number of attempts, sleep duration,
and timeout.
Modify the "freezing cgroup" debug message to be in sync with the
above error.
Was:
Now:
2. freeze_processes: implement kludges for cgroup v1
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.
In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).
When criu is used, it fails to freeze a cgroup from time to time
(see 1, 2). Let's try adding kludges similar to ones in runc.
Alas, I have absolutely no way to test this, so please review carefully.