Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix session JobOrderFn to be predictable #3960

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

raravena80
Copy link
Contributor

@raravena80 raravena80 commented Jan 8, 2025

What type of PR is this?

/kind bug
/area scheduling

What this PR does / why we need it:

  • This fixes an issue we've been seeing where the JobOrder doesn't return the same consistent list order when compared by job UUID. We are ordering by Node Name now
  • Additionally it makes sure that when two nodes have the same score the list is always in the consistent order. We are also ordering by Node Name.

Which issue(s) this PR fixes:

Fixes # #3962

Special notes for your reviewer:

Do we need to open an issue?

Does this PR introduce a user-facing change?

- Fixes an issue we've been seeing where the JobOrder doesn't return the same consistent list order when compared by job UUID. We are ordering by Node Name now
- Makes sure that when two nodes have the same score the list is always in the consistent order. We are also ordering by Node Name.

- This fixes an issue we've been seeing where the JobOrder doesn't
  return the same consistent list order when compared by job UUID. We
  are ordering by Job Name now
- Additionally it makes sure that when two jobs have the same score the
  list is always in the consistent order. We are also ordering by Job
  Name.

Signed-off-by: Ricardo Aravena <[email protected]>
@volcano-sh-bot volcano-sh-bot added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 8, 2025
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign monokaix
You can assign the PR to them by writing /assign @monokaix in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. area/scheduling area/controllers area/cli area/dependency Issues or PRs related to dependency changes area/webhooks area/deploy Issues or PRs related to deploy/helm/build/scripts changes area/documentation documentation of design or user-guide size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/performance Issues or PRs related to performance area/test CI and test related Issues or PRs labels Jan 8, 2025
@Monokaix Monokaix removed kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/controllers area/cli area/dependency Issues or PRs related to dependency changes kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. kind/documentation Categorizes issue or PR as related to documentation. area/test CI and test related Issues or PRs labels Jan 9, 2025
@Monokaix Monokaix added kind/bug Categorizes issue or PR as related to a bug. and removed area/webhooks area/deploy Issues or PRs related to deploy/helm/build/scripts changes area/documentation documentation of design or user-guide area/performance Issues or PRs related to performance labels Jan 9, 2025
@JesseStutler
Copy link
Member

Additionally it makes sure that when two jobs have the same score the list is always in the consistent order. We are also ordering by Job Name.

  • two jobs --> two nodes
  • Job Name --> Node Name

@JesseStutler
Copy link
Member

Yes, we'd better create an issue to let other contributors trace back why this was changed.
Besides, what will happen to your cluster if we use UUID sorting? I hope to see some screenshots of the phenomenon in the issue:)

@JesseStutler
Copy link
Member

cc @lowang-bh @Monokaix @hwdef

@raravena80
Copy link
Contributor Author

two jobs --> two nodes
Job Name --> Node Name

Thanks!
Fixed these in the description.

Will get back to you on the exact issue, but I believe with UID we were getting different orders.

return lv.UID < rv.UID
// Use the Name of the Job instead of UID to get deterministic order by Job name
klog.V(3).Infof("Creation timestamps are the same for job %v and %v: %v . using name to order", lv.Name, rv.Name, lv.CreationTimestamp)
return lv.Name < rv.Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we encounter a situation where vcjobs have the same name but are in different namespaces, will the JobOrder still not return a consistent list order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So Job Names are namespaced?

@raravena80
Copy link
Contributor Author

Created ticket #3962 with more details

@hwdef hwdef mentioned this pull request Jan 19, 2025
@JesseStutler
Copy link
Member

Hi, I have some comments about your issue description, please check it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scheduling kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants