Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating training mix of OLMo2 from dolmino-mix #775

Open
Cy-47 opened this issue Jan 5, 2025 · 1 comment
Open

Generating training mix of OLMo2 from dolmino-mix #775

Cy-47 opened this issue Jan 5, 2025 · 1 comment
Labels
type/question An issue that's a question

Comments

@Cy-47
Copy link

Cy-47 commented Jan 5, 2025

❓ The question

Hi! I'm trying to create a dataset that resembles the training data of OLMo2. I saw that the portion of each source in the mix has been given, but I haven't found a script for generating the mix, so I'm wondering if one is available. Thank you!

@Cy-47 Cy-47 added the type/question An issue that's a question label Jan 5, 2025
@Cy-47
Copy link
Author

Cy-47 commented Jan 5, 2025

Currently, I'm using this script that I wrote:

import datasets
from datasets import Dataset
from transformers import AutoTokenizer

num_sequence_wanted = 20000
max_seq_len = 4096
portions = {
    "dclm": 0.472,
    "flan": 0.166,
    "pes2o": 0.0585,
    "wiki": 0.0711,
    "stackexchange": 0.0245,
    "math": 0.208,
}
seed = 42
save_path = "data/dolmino-mix-1124-50-20000"
tokenizer = AutoTokenizer.from_pretrained("allenai/dolma2-tokenizer")

dolmino = datasets.load_dataset("allenai/dolmino-mix-1124", streaming=True)
sources = ["dclm", "flan", "math", "pes2o", "stackexchange", "wiki"]

source_datasets = dict()
for source in sources:
    source_datasets[source] = datasets.load_dataset(
        "allenai/dolmino-mix-1124",
        streaming=True,
        data_dir="data/" + source,
        split="train",
    )

shuffled_datasets = dict()
for source in sources:
    shuffled_datasets[source] = source_datasets[source].shuffle(seed=seed)

iterators = dict()
for source in sources:
    iterators[source] = iter(shuffled_datasets[source])

result_dataset = {"id": [], "text": [], "added": [], "created": []}

total_len_added = {
    "dclm": 0,
    "flan": 0,
    "math": 0,
    "pes2o": 0,
    "stackexchange": 0,
    "wiki": 0,
    "total": 0,
}


def add_sample(sample, source):
    result_dataset["id"].append(sample["id"])
    result_dataset["text"].append(sample["text"])
    result_dataset["added"].append(sample["added"])
    result_dataset["created"].append(sample["created"])
    sample_len = len(
        tokenizer(sample["text"], truncation=True, max_length=max_seq_len)["input_ids"]
    )
    total_len_added["total"] += sample_len
    total_len_added[source] += sample_len


for i in range(num_sequence_wanted):
    if i == 0:
        add_sample(next(iterators["dclm"]), "dclm")
        continue
    for source in sources:
        if total_len_added[source] / total_len_added["total"] < portions[source]:
            sample = next(iterators[source], None)
            if sample is not None:
                add_sample(sample, source)
                break

result_dataset_hf = Dataset.from_dict(result_dataset)

result_dataset_hf = result_dataset_hf.shuffle(seed=seed)
result_dataset_hf.save_to_disk(save_path)

The resulting (shuffled) dataset looks like this:
Screenshot 2025-01-06 at 2 09 28 PM

It looks like most sequences start with "Question: ", probably because these sentences are shorter. This happens even though I limited the max sequence length to 4096. Is this distribution accurate?

@Cy-47 Cy-47 changed the title Generating training mix from dolmino-mix Generating training mix of OLMo2 from dolmino-mix Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

1 participant