Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

abhijeet-dhumal · 2025-01-10T13:12:45Z

Description

Updated training script to utilise Multi-Node/Multi-GPU scenario properly

How Has This Been Tested?

Following tests are executed on a cluster having NVIDIA GPUs :

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 33s (time taken to execute)

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 36s

TestPyTorchJobMnistMultiNodeSingleGpuWithCuda - 2m 35s

TestPyTorchJobMnistMultiNodeMultiGpuWithCuda - 2m 16s

Following tests are executed on a cluster having AMD GPUs :

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 45s

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 42s

TestPyTorchJobMnistMultiNodeSingleGpuWithROCm - 4m 18s

TestPyTorchJobMnistMultiNodeMultiGpuWithROCm - 3m 8s

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

abhijeet-dhumal · 2025-01-10T13:53:47Z

Multi-Node / Multi-GPU scenario verified :

tests/kfto/kfto_mnist_training_test.go

sutaakar · 2025-01-10T14:36:53Z

tests/kfto/kfto_mnist_training_test.go

 }

-func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLabel string, image string, requirementsFile string) {
+func runKFTOPyTorchMnistJob(t *testing.T, totalNumGpus int, workerReplicas int, numCPUsOrGPUsCountPerNode int, gpuLabel string, image string, requirementsFile string) {


IMHO it would have more sense to rename numCPUsOrGPUsCountPerNode to numProcPerNode and keep CPU number hardcoded.
numCPUsOrGPUsCountPerNode looks confusing to me as it is not clear what does it represent.

Actually, by this variable I meant the number of devices(GPUs/CPUs) to be utilised per cluster-node.. but I agree that word framing was quite confusing 😅
By this approach I wanted to add test coverage for multi-node's use cases:

single-CPUs/GPUs per node

multi-CPUs/GPUs per node

Similar to the torchrun command's --nproc_per_node arg which allows to specify number of devices to be utilised per node, whether it may be number of CPUs or GPUs..

astefanutti · 2025-01-13T09:00:34Z

/lgtm

Great work!

astefanutti

Some leftover comments :)

tests/kfto/resources/mnist.py

sutaakar

/lgtm
good job

sutaakar · 2025-01-13T15:53:02Z

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

abhijeet-dhumal · 2025-01-15T06:11:47Z

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

I couldn't reliably reproduce this error but have seen it before, I think you're right each process is trying to download dataset concurrently this can be avoided by ensuring that only one process downloads the dataset.. for example rank 0 process, and all other processes should wait until the download is complete. .. 🤔

…enario using DDP example

…de concurrently by using pre-downloaded dataset

tests/kfto/support.go

tests/kfto/kfto_mnist_training_test.go

tests/kfto/support.go

tests/kfto/kfto_mnist_training_test.go

…KFTO tests

sutaakar

/lgtm

astefanutti · 2025-01-20T10:11:52Z

/approve

Awesome work!

openshift-ci · 2025-01-20T10:11:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [astefanutti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhijeet-dhumal requested a review from sutaakar January 10, 2025 13:12

openshift-ci bot added the do-not-merge/work-in-progress label Jan 10, 2025

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from bd6930b to 24405d8 Compare January 10, 2025 13:58

abhijeet-dhumal requested a review from astefanutti January 10, 2025 13:58

abhijeet-dhumal marked this pull request as ready for review January 10, 2025 13:59

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 10, 2025

openshift-ci bot requested review from KPostOffice and varshaprasad96 January 10, 2025 13:59

abhijeet-dhumal requested review from ChughShilpa and removed request for KPostOffice and varshaprasad96 January 10, 2025 13:59

astefanutti reviewed Jan 10, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

sutaakar reviewed Jan 10, 2025

View reviewed changes

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from 24405d8 to b19447b Compare January 13, 2025 08:55

abhijeet-dhumal requested review from astefanutti and sutaakar January 13, 2025 08:55

astefanutti reviewed Jan 13, 2025

View reviewed changes

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

astefanutti reviewed Jan 13, 2025

View reviewed changes

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from b19447b to e7edc7b Compare January 13, 2025 11:51

abhijeet-dhumal requested a review from astefanutti January 13, 2025 11:51

sutaakar reviewed Jan 13, 2025

View reviewed changes

openshift-ci bot assigned sutaakar Jan 13, 2025

openshift-ci bot added the lgtm label Jan 13, 2025

abhijeet-dhumal added 2 commits January 17, 2025 16:21

Update KFTO MNIST multi-node test script to add multi-gpu training sc…

42e8f17

…enario using DDP example

Update MNIST training script to avoid downloading datasets on each no…

c3f99cf

…de concurrently by using pre-downloaded dataset

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from e7edc7b to c3f99cf Compare January 17, 2025 10:52

openshift-ci bot removed the lgtm label Jan 17, 2025

abhijeet-dhumal requested a review from sutaakar January 17, 2025 10:53

sutaakar reviewed Jan 17, 2025

View reviewed changes

tests/kfto/support.go Outdated Show resolved Hide resolved

sutaakar reviewed Jan 17, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

sutaakar reviewed Jan 17, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Show resolved Hide resolved

sutaakar reviewed Jan 20, 2025

View reviewed changes

tests/kfto/support.go Outdated Show resolved Hide resolved

sutaakar reviewed Jan 20, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

sutaakar reviewed Jan 20, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

Update Gpu struct to Accelerator and add isGpu method to be used for …

95a6f5c

…KFTO tests

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from 9d990ee to 95a6f5c Compare January 20, 2025 08:43

sutaakar reviewed Jan 20, 2025

View reviewed changes

openshift-ci bot added the lgtm label Jan 20, 2025

openshift-ci bot added the approved label Jan 20, 2025

openshift-merge-bot bot merged commit 92575c8 into opendatahub-io:main Jan 20, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

abhijeet-dhumal commented Jan 10, 2025 •

edited

Loading

abhijeet-dhumal commented Jan 10, 2025

sutaakar Jan 10, 2025

abhijeet-dhumal Jan 13, 2025 •

edited

Loading

astefanutti commented Jan 13, 2025

astefanutti left a comment

sutaakar left a comment

sutaakar commented Jan 13, 2025

abhijeet-dhumal commented Jan 15, 2025

sutaakar left a comment

astefanutti commented Jan 20, 2025

openshift-ci bot commented Jan 20, 2025

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Conversation

abhijeet-dhumal commented Jan 10, 2025 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

abhijeet-dhumal commented Jan 10, 2025

sutaakar Jan 10, 2025

Choose a reason for hiding this comment

abhijeet-dhumal Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

astefanutti commented Jan 13, 2025

astefanutti left a comment

Choose a reason for hiding this comment

sutaakar left a comment

Choose a reason for hiding this comment

sutaakar commented Jan 13, 2025

abhijeet-dhumal commented Jan 15, 2025

sutaakar left a comment

Choose a reason for hiding this comment

astefanutti commented Jan 20, 2025

openshift-ci bot commented Jan 20, 2025

abhijeet-dhumal commented Jan 10, 2025 •

edited

Loading

abhijeet-dhumal Jan 13, 2025 •

edited

Loading