Generating training mix of OLMo2 from dolmino-mix #775

Cy-47 · 2025-01-05T03:46:34Z

❓ The question

Hi! I'm trying to create a dataset that resembles the training data of OLMo2. I saw that the portion of each source in the mix has been given, but I haven't found a script for generating the mix, so I'm wondering if one is available. Thank you!

Cy-47 · 2025-01-05T04:53:44Z

Currently, I'm using this script that I wrote:

import datasets
from datasets import Dataset
from transformers import AutoTokenizer

num_sequence_wanted = 20000
max_seq_len = 4096
portions = {
    "dclm": 0.472,
    "flan": 0.166,
    "pes2o": 0.0585,
    "wiki": 0.0711,
    "stackexchange": 0.0245,
    "math": 0.208,
}
seed = 42
save_path = "data/dolmino-mix-1124-50-20000"
tokenizer = AutoTokenizer.from_pretrained("allenai/dolma2-tokenizer")

dolmino = datasets.load_dataset("allenai/dolmino-mix-1124", streaming=True)
sources = ["dclm", "flan", "math", "pes2o", "stackexchange", "wiki"]

source_datasets = dict()
for source in sources:
    source_datasets[source] = datasets.load_dataset(
        "allenai/dolmino-mix-1124",
        streaming=True,
        data_dir="data/" + source,
        split="train",
    )

shuffled_datasets = dict()
for source in sources:
    shuffled_datasets[source] = source_datasets[source].shuffle(seed=seed)

iterators = dict()
for source in sources:
    iterators[source] = iter(shuffled_datasets[source])

result_dataset = {"id": [], "text": [], "added": [], "created": []}

total_len_added = {
    "dclm": 0,
    "flan": 0,
    "math": 0,
    "pes2o": 0,
    "stackexchange": 0,
    "wiki": 0,
    "total": 0,
}


def add_sample(sample, source):
    result_dataset["id"].append(sample["id"])
    result_dataset["text"].append(sample["text"])
    result_dataset["added"].append(sample["added"])
    result_dataset["created"].append(sample["created"])
    sample_len = len(
        tokenizer(sample["text"], truncation=True, max_length=max_seq_len)["input_ids"]
    )
    total_len_added["total"] += sample_len
    total_len_added[source] += sample_len


for i in range(num_sequence_wanted):
    if i == 0:
        add_sample(next(iterators["dclm"]), "dclm")
        continue
    for source in sources:
        if total_len_added[source] / total_len_added["total"] < portions[source]:
            sample = next(iterators[source], None)
            if sample is not None:
                add_sample(sample, source)
                break

result_dataset_hf = Dataset.from_dict(result_dataset)

result_dataset_hf = result_dataset_hf.shuffle(seed=seed)
result_dataset_hf.save_to_disk(save_path)

The resulting (shuffled) dataset looks like this:

It looks like most sequences start with "Question: ", probably because these sentences are shorter. This happens even though I limited the max sequence length to 4096. Is this distribution accurate?

Cy-47 added the type/question An issue that's a question label Jan 5, 2025

Cy-47 changed the title ~~Generating training mix from dolmino-mix~~ Generating training mix of OLMo2 from dolmino-mix Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating training mix of OLMo2 from dolmino-mix #775

Generating training mix of OLMo2 from dolmino-mix #775

Cy-47 commented Jan 5, 2025 •

edited

Loading

Cy-47 commented Jan 5, 2025 •

edited

Loading

Generating training mix of OLMo2 from dolmino-mix #775

Generating training mix of OLMo2 from dolmino-mix #775

Comments

Cy-47 commented Jan 5, 2025 • edited Loading

❓ The question

Cy-47 commented Jan 5, 2025 • edited Loading

Cy-47 commented Jan 5, 2025 •

edited

Loading

Cy-47 commented Jan 5, 2025 •

edited

Loading