Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden data error during training #766

Open
faresobeid opened this issue Dec 16, 2024 · 11 comments
Open

Sudden data error during training #766

faresobeid opened this issue Dec 16, 2024 · 11 comments
Labels
type/bug An issue about a bug

Comments

@faresobeid
Copy link

🐛 Describe the bug

I'm trying to run a tiny olmo 2 training and have done successfully for some steps of training, then suddenly I get errors like this:

AssertionError: Caught AssertionError in DataLoader worker process 23.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 181, in <genexpr>
    return (self._get_dataset_item(int(idx)) for idx in indices)
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 184, in _get_dataset_item
    item = self.dataset[idx]
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 196, in __getitem__
    input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_local_index)
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 162, in _read_chunk_from_memmap
    buffer = get_bytes_range(path, bytes_start, num_bytes)
  File "/root/OLMo/olmo/util.py", line 380, in get_bytes_range
    return _http_get_bytes_range(
  File "/root/OLMo/olmo/util.py", line 712, in _http_get_bytes_range
    len(result) == num_bytes
AssertionError: expected 16384 bytes, got 7170

Thanks!

Versions

Python 3.10.9

@faresobeid faresobeid added the type/bug An issue about a bug label Dec 16, 2024
@aman-17
Copy link
Member

aman-17 commented Dec 16, 2024

Hey @faresobeid I tried to recreate the issue, but works fine for me.
I guess the issue might be caused by accessing the data over the network when trying to fetch from HTTP based source during training. Might be due to connection issues (network interruption during data retrieval) or downloading truncated file because of some network issues, instead of receiving the actual data tokens (16384 bytes), you are receiving an error response of 7170 bytes, which causes the training to fail after some time.

@faresobeid
Copy link
Author

Oh thanks for replying quickly, was wondering whats the easiest way to solve this issue

@aman-17
Copy link
Member

aman-17 commented Dec 16, 2024

Are you able to save global_indices.npy? When you run torchrun --nproc_per_node=2 scripts/train.py configs/tiny/OLMo-20M.yaml --save_overwrite, iterable_dataset.py will save global_indices.npy to your workspace.

@faresobeid
Copy link
Author

Oh interesting didn't realize, but what would I do with that if I could save (also how large is it?). To clarify I'm trying to run the olmo2 7b config with a smaller model and less steps so was also wondering if I should edit the data in the config to support this

@aman-17
Copy link
Member

aman-17 commented Dec 16, 2024

global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way?

@faresobeid
Copy link
Author

Right now i'm not on it but I just ran the olmo 2 7b stage 1 config but just modified d_model, n_layers, n_heads, and mlp_hidden_size as well as switching from fsdp to ddp. If more details are needed I will send over tmrw. Thanks a lot!

@faresobeid
Copy link
Author

Ok update it seems to work now but provide a new error after around 80 steps in training, can see my config here https://github.com/faresobeid/OLMo/blob/main/configs/official-1124/OLMo2-7B-stage1.yaml. Also an unrelated question, if i only want to train on a subset of the data is the easiest way just to change max duration rather than touch any of the data sources? Thanks once again!

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 181, in <genexpr>
    return (self._get_dataset_item(int(idx)) for idx in indices)
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 184, in _get_dataset_item
    item = self.dataset[idx]
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 196, in __getitem__
    input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_local_index)
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 162, in _read_chunk_from_memmap
    buffer = get_bytes_range(path, bytes_start, num_bytes)
  File "/root/OLMo/olmo/util.py", line 380, in get_bytes_range
    return _http_get_bytes_range(
  File "/root/OLMo/olmo/util.py", line 707, in _http_get_bytes_range
    response = requests.get(
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='olmo-data.org', port=80): Max retries exceeded with url: /preprocessed/dclm/text_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/allenai/dolma2-tokenizer/part-015-00001.npy (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa3242d240>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

@aman-17
Copy link
Member

aman-17 commented Dec 17, 2024

Hey @faresobeid, seems like you have an unstable network. If you want to train on a subset of data, the easiest way is to edit .yaml file. Don't forget to inspect the train data using inspect_train_data.py after editing your .yaml

@faresobeid
Copy link
Author

Oh ok and for editing the training data is there a recommended way as in how many links from each source to get rid of? Was also wondering to get past the unstable network issues if i could pre download and tokenize the dataset for example, what would be the easiest way to do so?

@ethanlshen
Copy link

global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way?

Hi! I am having the same problem. However, my global_indices.npy is 3.48 G when my entire train dataset (dolma v1.5) should be > 2 TB. I'm guessing this is because the iterable dataset streams the download links. Is there a way to get one global_indices.npy for the entire dataset?

@dirkgr
Copy link
Member

dirkgr commented Jan 7, 2025

global_indices.npy does not contain the whole dataset. It just contains offsets into the whole dataset. That size for global_indices seems OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

4 participants