Add GSM8K in-loop #777

davidheineman · 2025-01-08T22:27:08Z

This PR adds the BPB over the correct continuation in GSM8K to the in-loop evals.

This takes the setup as implemented in https://github.com/allenai/oe-eval-internal/pull/374

You would add this eval with:

 - label: gsm8k_gold_bpb_5shot
   type: downstream

Evaluation Setup

The "gold continuation" appears as follows:

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer: Janet sells 16 - 3 - 4 = 9 duck eggs a day. She makes 9 * 2 = $18 every day at the farmer’s market. So the answer is 18.

Why we should use BPB (Higher BPB ~= Higher Exact Match Accuracy)

There are two reasons to do this instead of using exact match:

You only run 1 forward pass for each GSM question
Models < 3B params do not get above-random chance performance on exact match, but they do improve for BPB

The below figures support two claims:

The BPB is monotonically increasing for larger model sizes
For large models, a better BPB generally means we will eventually see a better exact match performance (although the mapping isn't perfect)

Two step prediction using model ladder up to 3B-10xC:

Step 2 prediction with external models:

Smoothness

This is BPB for GSM8K when training a 1B 5xC (shaded region is sample variance):

Sanity Check

I verified this works using:

torchrun --nproc_per_node=1 scripts/train.py configs/tiny/OLMo-20M.yaml --save_overwrite

davidheineman added 3 commits January 8, 2025 13:09

add in loop gsm

e485097

fix label_id issue

23e26cb

run linter for gsm

e7bc4cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GSM8K in-loop #777

Add GSM8K in-loop #777

davidheineman commented Jan 8, 2025 •

edited

Loading

Add GSM8K in-loop #777

Are you sure you want to change the base?

Add GSM8K in-loop #777

Conversation

davidheineman commented Jan 8, 2025 • edited Loading

Evaluation Setup

Why we should use BPB (Higher BPB ~= Higher Exact Match Accuracy)

Smoothness

Sanity Check

davidheineman commented Jan 8, 2025 •

edited

Loading