-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301
Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301
Conversation
bd6930b
to
24405d8
Compare
} | ||
|
||
func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLabel string, image string, requirementsFile string) { | ||
func runKFTOPyTorchMnistJob(t *testing.T, totalNumGpus int, workerReplicas int, numCPUsOrGPUsCountPerNode int, gpuLabel string, image string, requirementsFile string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would have more sense to rename numCPUsOrGPUsCountPerNode
to numProcPerNode
and keep CPU number hardcoded.
numCPUsOrGPUsCountPerNode
looks confusing to me as it is not clear what does it represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, by this variable I meant the number of devices(GPUs/CPUs) to be utilised per cluster-node.. but I agree that word framing was quite confusing 😅
By this approach I wanted to add test coverage for multi-node's use cases:
- single-CPUs/GPUs per node
- multi-CPUs/GPUs per node
Similar to the torchrun
command's --nproc_per_node
arg which allows to specify number of devices to be utilised per node, whether it may be number of CPUs or GPUs..
24405d8
to
b19447b
Compare
/lgtm Great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some leftover comments :)
b19447b
to
e7edc7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
good job
@abhijeet-dhumal I ran
I guess the issue is caused by concurrent downloading of dataset? |
I couldn't reliably reproduce this error but have seen it before, I think you're right each process is trying to download dataset concurrently this can be avoided by ensuring that only one process downloads the dataset.. for example rank 0 process, and all other processes should wait until the download is complete. .. 🤔 |
…enario using DDP example
…de concurrently by using pre-downloaded dataset
e7edc7b
to
c3f99cf
Compare
9d990ee
to
95a6f5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve Awesome work! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
How Has This Been Tested?
Following tests are executed on a cluster having NVIDIA GPUs :
TestPyTorchJobMnistMultiNodeSingleCpu - 3m 33s (time taken to execute)
TestPyTorchJobMnistMultiNodeMultiCpu - 2m 36s
TestPyTorchJobMnistMultiNodeSingleGpuWithCuda - 2m 35s
TestPyTorchJobMnistMultiNodeMultiGpuWithCuda - 2m 16s
Following tests are executed on a cluster having AMD GPUs :
TestPyTorchJobMnistMultiNodeSingleCpu - 3m 45s
TestPyTorchJobMnistMultiNodeMultiCpu - 2m 42s
TestPyTorchJobMnistMultiNodeSingleGpuWithROCm - 4m 18s
TestPyTorchJobMnistMultiNodeMultiGpuWithROCm - 3m 8s
Merge criteria: