Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GSM8K in-loop #777

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add GSM8K in-loop #777

wants to merge 3 commits into from

Conversation

davidheineman
Copy link

@davidheineman davidheineman commented Jan 8, 2025

This PR adds the BPB over the correct continuation in GSM8K to the in-loop evals.

This takes the setup as implemented in https://github.com/allenai/oe-eval-internal/pull/374

You would add this eval with:

 - label: gsm8k_gold_bpb_5shot
   type: downstream

Evaluation Setup

The "gold continuation" appears as follows:

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer: Janet sells 16 - 3 - 4 = 9 duck eggs a day. She makes 9 * 2 = $18 every day at the farmer’s market. So the answer is 18.

Why we should use BPB (Higher BPB ~= Higher Exact Match Accuracy)

There are two reasons to do this instead of using exact match:

  • You only run 1 forward pass for each GSM question
  • Models < 3B params do not get above-random chance performance on exact match, but they do improve for BPB

The below figures support two claims:

  • The BPB is monotonically increasing for larger model sizes
  • For large models, a better BPB generally means we will eventually see a better exact match performance (although the mapping isn't perfect)

Two step prediction using model ladder up to 3B-10xC:

image

Step 2 prediction with external models:

image

Smoothness

This is BPB for GSM8K when training a 1B 5xC (shaded region is sample variance):

image

Sanity Check

I verified this works using:

torchrun --nproc_per_node=1 scripts/train.py configs/tiny/OLMo-20M.yaml --save_overwrite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant