PyTorchJob worker replica set to 0 still creates the worker pod #256

DanNiESh · 2025-01-22T14:09:53Z

When setting train_nnodes to 1 and train_nproc_per_node to 4, the PyTorchJob should not create a worker pod (replica=0). However, the real behavior is, it still tries to create a worker pod. The worker pod gets stuck in Pending and Terminating states. Specifically, when the PyTorchJob phase1 job finishes, the phase1-worker pod transitions to init:0/1, which unnecessarily occupies GPUs, preventing phase2 from starting.

@tumido identified that this bug has already been reported upstream: kubeflow/training-operator#1709. We need to implement an workaround while waiting for the upstream fix

Environment: RHOAI 2.16.0, OpenShift: 4.16.19, NVIDIA cuda driver: 550.90.07, cluster: console-openshift-console.apps.barcelona.nerc.mghpcc.org/

The text was updated successfully, but these errors were encountered:

leseb · 2025-01-22T14:15:09Z

@DanNiESh is the bug targetting standalone or KFP? Or both? On which environment did you experience this issue?

tumido · 2025-01-22T14:17:49Z

It was discovered in the KFP version, but I suspect it would affect both.

It seems KFTO completely disregards the Workers.replicas: 0. As a workaround, upstream suggests NOT to pass the Workers spec at all.

DanNiESh · 2025-01-22T14:33:31Z

I updated the environment that I experienced the issue

tumido added the bug Something isn't working label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorchJob worker replica set to 0 still creates the worker pod #256

PyTorchJob worker replica set to 0 still creates the worker pod #256

DanNiESh commented Jan 22, 2025 •

edited

Loading

leseb commented Jan 22, 2025

tumido commented Jan 22, 2025

DanNiESh commented Jan 22, 2025

PyTorchJob worker replica set to 0 still creates the worker pod #256

PyTorchJob worker replica set to 0 still creates the worker pod #256

Comments

DanNiESh commented Jan 22, 2025 • edited Loading

leseb commented Jan 22, 2025

tumido commented Jan 22, 2025

DanNiESh commented Jan 22, 2025

DanNiESh commented Jan 22, 2025 •

edited

Loading