Sudden data error during training #766

faresobeid · 2024-12-16T15:44:38Z

🐛 Describe the bug

I'm trying to run a tiny olmo 2 training and have done successfully for some steps of training, then suddenly I get errors like this:

AssertionError: Caught AssertionError in DataLoader worker process 23.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 181, in <genexpr>
    return (self._get_dataset_item(int(idx)) for idx in indices)
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 184, in _get_dataset_item
    item = self.dataset[idx]
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 196, in __getitem__
    input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_local_index)
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 162, in _read_chunk_from_memmap
    buffer = get_bytes_range(path, bytes_start, num_bytes)
  File "/root/OLMo/olmo/util.py", line 380, in get_bytes_range
    return _http_get_bytes_range(
  File "/root/OLMo/olmo/util.py", line 712, in _http_get_bytes_range
    len(result) == num_bytes
AssertionError: expected 16384 bytes, got 7170

Thanks!

Versions

Python 3.10.9

The text was updated successfully, but these errors were encountered:

aman-17 · 2024-12-16T20:17:47Z

Hey @faresobeid I tried to recreate the issue, but works fine for me.
I guess the issue might be caused by accessing the data over the network when trying to fetch from HTTP based source during training. Might be due to connection issues (network interruption during data retrieval) or downloading truncated file because of some network issues, instead of receiving the actual data tokens (16384 bytes), you are receiving an error response of 7170 bytes, which causes the training to fail after some time.

faresobeid · 2024-12-16T20:36:45Z

Oh thanks for replying quickly, was wondering whats the easiest way to solve this issue

aman-17 · 2024-12-16T20:38:22Z

Are you able to save global_indices.npy? When you run torchrun --nproc_per_node=2 scripts/train.py configs/tiny/OLMo-20M.yaml --save_overwrite, iterable_dataset.py will save global_indices.npy to your workspace.

faresobeid · 2024-12-16T20:50:05Z

Oh interesting didn't realize, but what would I do with that if I could save (also how large is it?). To clarify I'm trying to run the olmo2 7b config with a smaller model and less steps so was also wondering if I should edit the data in the config to support this

aman-17 · 2024-12-16T21:05:32Z

global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way?

faresobeid · 2024-12-16T21:37:19Z

Right now i'm not on it but I just ran the olmo 2 7b stage 1 config but just modified d_model, n_layers, n_heads, and mlp_hidden_size as well as switching from fsdp to ddp. If more details are needed I will send over tmrw. Thanks a lot!

faresobeid · 2024-12-17T21:41:33Z

Ok update it seems to work now but provide a new error after around 80 steps in training, can see my config here https://github.com/faresobeid/OLMo/blob/main/configs/official-1124/OLMo2-7B-stage1.yaml. Also an unrelated question, if i only want to train on a subset of the data is the easiest way just to change max duration rather than touch any of the data sources? Thanks once again!

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 181, in <genexpr>
    return (self._get_dataset_item(int(idx)) for idx in indices)
  File "/root/OLMo/olmo/data/iterable_dataset.py", line 184, in _get_dataset_item
    item = self.dataset[idx]
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 196, in __getitem__
    input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_local_index)
  File "/root/OLMo/olmo/data/memmap_dataset.py", line 162, in _read_chunk_from_memmap
    buffer = get_bytes_range(path, bytes_start, num_bytes)
  File "/root/OLMo/olmo/util.py", line 380, in get_bytes_range
    return _http_get_bytes_range(
  File "/root/OLMo/olmo/util.py", line 707, in _http_get_bytes_range
    response = requests.get(
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='olmo-data.org', port=80): Max retries exceeded with url: /preprocessed/dclm/text_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/allenai/dolma2-tokenizer/part-015-00001.npy (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa3242d240>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

aman-17 · 2024-12-17T22:30:14Z

Hey @faresobeid, seems like you have an unstable network. If you want to train on a subset of data, the easiest way is to edit .yaml file. Don't forget to inspect the train data using inspect_train_data.py after editing your .yaml

faresobeid · 2024-12-18T00:30:16Z

Oh ok and for editing the training data is there a recommended way as in how many links from each source to get rid of? Was also wondering to get past the unstable network issues if i could pre download and tokenize the dataset for example, what would be the easiest way to do so?

ethanlshen · 2024-12-28T18:49:07Z

global_indices.npy is the train data. I am not sure at this point why it is throwing the error. Can you provide more details on what exactly are you implementing so that I can help you in best possible way?

Hi! I am having the same problem. However, my global_indices.npy is 3.48 G when my entire train dataset (dolma v1.5) should be > 2 TB. I'm guessing this is because the iterable dataset streams the download links. Is there a way to get one global_indices.npy for the entire dataset?

dirkgr · 2025-01-07T23:50:26Z

global_indices.npy does not contain the whole dataset. It just contains offsets into the whole dataset. That size for global_indices seems OK.

faresobeid added the type/bug An issue about a bug label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden data error during training #766

Sudden data error during training #766

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024 •

edited

Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024 •

edited

Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024

faresobeid commented Dec 16, 2024

faresobeid commented Dec 17, 2024

aman-17 commented Dec 17, 2024 •

edited

Loading

faresobeid commented Dec 18, 2024

ethanlshen commented Dec 28, 2024

dirkgr commented Jan 7, 2025

Sudden data error during training #766

Sudden data error during training #766

Comments

faresobeid commented Dec 16, 2024

🐛 Describe the bug

Versions

aman-17 commented Dec 16, 2024 • edited Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024 • edited Loading

faresobeid commented Dec 16, 2024

aman-17 commented Dec 16, 2024

faresobeid commented Dec 16, 2024

faresobeid commented Dec 17, 2024

aman-17 commented Dec 17, 2024 • edited Loading

faresobeid commented Dec 18, 2024

ethanlshen commented Dec 28, 2024

dirkgr commented Jan 7, 2025

aman-17 commented Dec 16, 2024 •

edited

Loading

aman-17 commented Dec 16, 2024 •

edited

Loading

aman-17 commented Dec 17, 2024 •

edited

Loading