Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Patch Fix] Support arbitrary symmetric allocations and fix MFC time log in workers #60

Merged
merged 13 commits into from
Sep 3, 2024

Conversation

garrett4wade
Copy link
Contributor

@garrett4wade garrett4wade commented Sep 2, 2024

New Features

  • Support arbitrary symmetric parallelization via regex. Now the system can parse strings like "d4m1p2" or "m1p2d5" and apply this parallel strategy to all MFCs.

Patch Fixes

  • Increase the default timeout of NCCL process groups to 1800 secs. Transferring extremely large batches for the first time may consumer over 600 secs, which is the default timeout of pytorch.

  • Offload models upon initialization to save the memory for the first MFC.

  • A more accurate record of the MFC completion time. The model worker will return responses immediately to the master worker upon an MFC's completion.

  • Modify the naming of helper functions in the master worker to avoid confusion.

@garrett4wade garrett4wade changed the title [WIP] [Patch Fix] Support arbitrary symmetric allocations and fix MFC time log in workers [Patch Fix] Support arbitrary symmetric allocations and fix MFC time log in workers Sep 2, 2024
@garrett4wade garrett4wade requested a review from nuzant September 2, 2024 09:48
@garrett4wade garrett4wade merged commit cec417b into main Sep 3, 2024
3 checks passed
@garrett4wade garrett4wade deleted the patch20240902 branch September 3, 2024 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants