Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preempt ordering issue #3962

Open
raravena80 opened this issue Jan 9, 2025 · 5 comments
Open

Preempt ordering issue #3962

raravena80 opened this issue Jan 9, 2025 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@raravena80
Copy link
Contributor

raravena80 commented Jan 9, 2025

Description

We have 2 nodes with 4 GPU each and we have the following jobs deployed

  • 1 Job (Job 2 ) with a high priority class with 4 tasks that need 1 GPU each and takes up 1 whole node
  • 1 Job (Job 1) with low priority class and 2 replicas that need 1 GPU each
    • 1 job (Job 3) with low priority class and 2 replicas that need 1 GPU each
    • The low priority jobs takes up all the GPU capacity of the 2nd Node
  • A new Job (Job 4) with 1 replica that needs 1 GPU each is deployed
    • For each task in the New Job (Job 4) the following above preemption flow takes place
      • For the first replica, the eligible Node is the 2nd node where the low priority jobs are deployed
      • The victim preemptable tasks are the tasks that are part of the low priority jobs
      • Since the lower priority class has the same priorityclass the ordering is taken by either the creationTimestamp or the task UID. Which causes the ranking to be nondeterministic.
        Victims list 
        Task (dev/preempt-dev-job1-low-priority-nginx-0)
        Task (dev/preempt-dev-job3-low-nginx-1)
        Task (:dev/preempt-dev-job1-low-priority-nginx-1)
        Task (:dev/preempt-dev-job3-low-nginx-0)
        
      • This task to be evicted as the first highest priority task is preempt-dev-job3-low-nginx-1 which is part of a gang job.
        Preemptor task <dev/preempt-dev-job4-high-nginx-0> on Node <ip-10-64-x-x.ec2.internal>. 
        Preemptee task preempt-dev-job3-low-nginx-1
        
      • When the preemption flow runs again for the second replica of the high priority job the victim list is again prioritized in non deterministic manner without considering gang semantics of the task that was already pipelined for eviction:
        Preemptor <mltraining-dev/preempt-dev-job4-high-nginx-1> on Node <ip-10-64-x-x.ec2.internal> victims
        Task (dev/preempt-dev-job1-low-priority-nginx-0)
        Task (dev/preempt-dev-job1-low-priority-nginx-1)
        Task (dev/preempt-dev-job3-low-nginx-0)
        
      • This leads to task from another gang job being evicted
        Premptor <dev/preempt-dev-job4-high-nginx-1> on Node <ip-10-64-x-x.ec2.internal>. preemptee 
        preempt-dev-job1-low-priority-nginx-1
        
      • This leads 2 gang jobs being terminated and 4 GPU being freed instead of 2
      • Notice that the the preemptees preempt-dev-job3-low-nginx-1 and preempt-dev-job1-low-priority-nginx-1 always show as second on the victims list (Those are the ones picked as victims)

Steps to reproduce the issue

  1. See description

Describe the results you received and expected

Expect to always to have consistent ordering for preempting victims.

What version of Volcano are you using?

1.10

Any other relevant information

No response

@raravena80 raravena80 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2025
@hwdef
Copy link
Member

hwdef commented Jan 13, 2025

I will check this later
/cc

@hwdef
Copy link
Member

hwdef commented Jan 19, 2025

Thank you for the scene you mentioned, which is very detailed.
I think the modification you made in PR #3960 can solve the problem, but we have to consider the namespace.

@hwdef
Copy link
Member

hwdef commented Jan 19, 2025

@lowang-bh @Monokaix @JesseStutler
PTAL, Do you have any suggestions?

@JesseStutler
Copy link
Member

This leads 2 gang jobs being terminated and 4 GPU being freed instead of 2, after checking your scenario, preempt-dev-job3-low-nginx-1 and preempt-dev-job1-low-priority-nginx-1 will be evicted to free 2GPUs, why is 4GPU will be freed?

I didn't get why your PR fixes the problem you're talking about, in the problem you're describing it doesn't make a difference if the Job is sorted by UUID or by Name, if MinAvailable is 1 and the gang plugin is turned on, both job-1 and job-3 can't be preempted below 1 replica

@raravena80
Copy link
Contributor Author

preempt-dev-job3-low-nginx-1 and preempt-dev-job1-low-priority-nginx-1 both have and need 4 GPUs broken down in 2 separate nodes.

preempt-dev-job3-low-nginx-1 -> node 1 (2 GPUs), node 2 (2 GPUs)
preempt-dev-job1-low-priority-nginx-1 -> node 1 (2 GPUs), node 2 (GPUs)

I didn't get why your PR fixes the problem you're talking about, in the problem you're describing it doesn't make a difference if the Job is sorted by UUID or by Name, if MinAvailable is 1 and the gang plugin is turned on, both job-1 and job-3 can't be preempted below 1 replica

I believe we have MinAvailable 0 (or default). What we found is that when the victim list on a node (tasks) was sorted by UUID rather than by name, the sorting order wasn't consistent across all nodes. Makes sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants