You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When setting train_nnodes to 1 and train_nproc_per_node to 4, the PyTorchJob should not create a worker pod (replica=0). However, the real behavior is, it still tries to create a worker pod. The worker pod gets stuck in Pending and Terminating states. Specifically, when the PyTorchJob phase1 job finishes, the phase1-worker pod transitions to init:0/1, which unnecessarily occupies GPUs, preventing phase2 from starting.
@tumido identified that this bug has already been reported upstream: kubeflow/training-operator#1709. We need to implement an workaround while waiting for the upstream fix
When setting
train_nnodes
to 1 andtrain_nproc_per_node
to 4, the PyTorchJob should not create a worker pod (replica=0). However, the real behavior is, it still tries to create a worker pod. The worker pod gets stuck inPending
andTerminating
states. Specifically, when the PyTorchJob phase1 job finishes, the phase1-worker pod transitions to init:0/1, which unnecessarily occupies GPUs, preventing phase2 from starting.@tumido identified that this bug has already been reported upstream: kubeflow/training-operator#1709. We need to implement an workaround while waiting for the upstream fix
Environment: RHOAI 2.16.0, OpenShift: 4.16.19, NVIDIA cuda driver: 550.90.07, cluster: console-openshift-console.apps.barcelona.nerc.mghpcc.org/
The text was updated successfully, but these errors were encountered: