You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm trying to create a dataset that resembles the training data of OLMo2. I saw that the portion of each source in the mix has been given, but I haven't found a script for generating the mix, so I'm wondering if one is available. Thank you!
The text was updated successfully, but these errors were encountered:
It looks like most sequences start with "Question: ", probably because these sentences are shorter. This happens even though I limited the max sequence length to 4096. Is this distribution accurate?
Cy-47
changed the title
Generating training mix from dolmino-mix
Generating training mix of OLMo2 from dolmino-mix
Jan 5, 2025
❓ The question
Hi! I'm trying to create a dataset that resembles the training data of OLMo2. I saw that the portion of each source in the mix has been given, but I haven't found a script for generating the mix, so I'm wondering if one is available. Thank you!
The text was updated successfully, but these errors were encountered: