-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden data error during training #766
Comments
Hey @faresobeid I tried to recreate the issue, but works fine for me. |
Oh thanks for replying quickly, was wondering whats the easiest way to solve this issue |
Are you able to save global_indices.npy? When you run |
Oh interesting didn't realize, but what would I do with that if I could save (also how large is it?). To clarify I'm trying to run the olmo2 7b config with a smaller model and less steps so was also wondering if I should edit the data in the config to support this |
global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way? |
Right now i'm not on it but I just ran the olmo 2 7b stage 1 config but just modified d_model, n_layers, n_heads, and mlp_hidden_size as well as switching from fsdp to ddp. If more details are needed I will send over tmrw. Thanks a lot! |
Ok update it seems to work now but provide a new error after around 80 steps in training, can see my config here https://github.com/faresobeid/OLMo/blob/main/configs/official-1124/OLMo2-7B-stage1.yaml. Also an unrelated question, if i only want to train on a subset of the data is the easiest way just to change max duration rather than touch any of the data sources? Thanks once again!
|
Hey @faresobeid, seems like you have an unstable network. If you want to train on a subset of data, the easiest way is to edit |
Oh ok and for editing the training data is there a recommended way as in how many links from each source to get rid of? Was also wondering to get past the unstable network issues if i could pre download and tokenize the dataset for example, what would be the easiest way to do so? |
Hi! I am having the same problem. However, my global_indices.npy is 3.48 G when my entire train dataset (dolma v1.5) should be > 2 TB. I'm guessing this is because the iterable dataset streams the download links. Is there a way to get one global_indices.npy for the entire dataset? |
global_indices.npy does not contain the whole dataset. It just contains offsets into the whole dataset. That size for global_indices seems OK. |
🐛 Describe the bug
I'm trying to run a tiny olmo 2 training and have done successfully for some steps of training, then suddenly I get errors like this:
Thanks!
Versions
Python 3.10.9
The text was updated successfully, but these errors were encountered: