Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorchJob worker replica set to 0 still creates the worker pod #256

Open
DanNiESh opened this issue Jan 22, 2025 · 3 comments
Open

PyTorchJob worker replica set to 0 still creates the worker pod #256

DanNiESh opened this issue Jan 22, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@DanNiESh
Copy link
Contributor

DanNiESh commented Jan 22, 2025

When setting train_nnodes to 1 and train_nproc_per_node to 4, the PyTorchJob should not create a worker pod (replica=0). However, the real behavior is, it still tries to create a worker pod. The worker pod gets stuck in Pending and Terminating states. Specifically, when the PyTorchJob phase1 job finishes, the phase1-worker pod transitions to init:0/1, which unnecessarily occupies GPUs, preventing phase2 from starting.

@tumido identified that this bug has already been reported upstream: kubeflow/training-operator#1709. We need to implement an workaround while waiting for the upstream fix

Environment: RHOAI 2.16.0, OpenShift: 4.16.19, NVIDIA cuda driver: 550.90.07, cluster: console-openshift-console.apps.barcelona.nerc.mghpcc.org/

@leseb
Copy link
Collaborator

leseb commented Jan 22, 2025

@DanNiESh is the bug targetting standalone or KFP? Or both? On which environment did you experience this issue?

@tumido
Copy link
Member

tumido commented Jan 22, 2025

It was discovered in the KFP version, but I suspect it would affect both.

It seems KFTO completely disregards the Workers.replicas: 0. As a workaround, upstream suggests NOT to pass the Workers spec at all.

@tumido tumido added the bug Something isn't working label Jan 22, 2025
@DanNiESh
Copy link
Contributor Author

I updated the environment that I experienced the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants