Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Conversation

abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Jan 10, 2025

Description

  • Updated training script to utilise Multi-Node/Multi-GPU scenario properly

How Has This Been Tested?

image


Following tests are executed on a cluster having NVIDIA GPUs :

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 33s (time taken to execute)

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 36s

TestPyTorchJobMnistMultiNodeSingleGpuWithCuda - 2m 35s

TestPyTorchJobMnistMultiNodeMultiGpuWithCuda - 2m 16s


Following tests are executed on a cluster having AMD GPUs :

image

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 45s

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 42s

TestPyTorchJobMnistMultiNodeSingleGpuWithROCm - 4m 18s

TestPyTorchJobMnistMultiNodeMultiGpuWithROCm - 3m 8s


Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@abhijeet-dhumal
Copy link
Contributor Author

Multi-Node / Multi-GPU scenario verified :

image-7
image-8
image-9

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from bd6930b to 24405d8 Compare January 10, 2025 13:58
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review January 10, 2025 13:59
@abhijeet-dhumal abhijeet-dhumal requested review from ChughShilpa and removed request for KPostOffice and varshaprasad96 January 10, 2025 13:59
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
}

func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLabel string, image string, requirementsFile string) {
func runKFTOPyTorchMnistJob(t *testing.T, totalNumGpus int, workerReplicas int, numCPUsOrGPUsCountPerNode int, gpuLabel string, image string, requirementsFile string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would have more sense to rename numCPUsOrGPUsCountPerNode to numProcPerNode and keep CPU number hardcoded.
numCPUsOrGPUsCountPerNode looks confusing to me as it is not clear what does it represent.

Copy link
Contributor Author

@abhijeet-dhumal abhijeet-dhumal Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, by this variable I meant the number of devices(GPUs/CPUs) to be utilised per cluster-node.. but I agree that word framing was quite confusing 😅
By this approach I wanted to add test coverage for multi-node's use cases:

  1. single-CPUs/GPUs per node
  2. multi-CPUs/GPUs per node

Similar to the torchrun command's --nproc_per_node arg which allows to specify number of devices to be utilised per node, whether it may be number of CPUs or GPUs..

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from 24405d8 to b19447b Compare January 13, 2025 08:55
@astefanutti
Copy link
Contributor

/lgtm

Great work!

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some leftover comments :)

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from b19447b to e7edc7b Compare January 13, 2025 11:51
Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
good job

@sutaakar
Copy link
Contributor

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

@abhijeet-dhumal
Copy link
Contributor Author

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

I couldn't reliably reproduce this error but have seen it before, I think you're right each process is trying to download dataset concurrently this can be avoided by ensuring that only one process downloads the dataset.. for example rank 0 process, and all other processes should wait until the download is complete. .. 🤔

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from e7edc7b to c3f99cf Compare January 17, 2025 10:52
@openshift-ci openshift-ci bot removed the lgtm label Jan 17, 2025
tests/kfto/support.go Outdated Show resolved Hide resolved
tests/kfto/support.go Outdated Show resolved Hide resolved
@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from 9d990ee to 95a6f5c Compare January 20, 2025 08:43
Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 20, 2025
@astefanutti
Copy link
Contributor

/approve

Awesome work!

Copy link

openshift-ci bot commented Jan 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 92575c8 into opendatahub-io:main Jan 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants