Skip to content

Commit

Permalink
Update Paloma readme
Browse files Browse the repository at this point in the history
  • Loading branch information
IanMagnusson authored Jun 6, 2024
1 parent f108d53 commit 9b3a246
Showing 1 changed file with 6 additions and 14 deletions.
20 changes: 6 additions & 14 deletions paloma/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,13 @@
# Paloma

The Paloma benchmark makes use of this repo to run evaluation inference. This readme will explain everything you need to know to get results on Paloma and make a submission to our benchmark.
In addition to the dataset hosted here, Paloma introduces guidelines for making perplexity results comparable across models and code that implements these guidelines with specific experimental controls. This page will walk you through how to apply these standards to your experiments.

Whether you are just evaluating an off-the-shelf model or preparing to conduct your own pretraining experiment from scratch, we recommend that you employ as much of our standardized code as possible to ensure the greatest level comparability with existing results.

Links:

[Data](https://huggingface.co/datasets/allenai/paloma)

## Getting existing results from the benchmark
Paloma is first and foremost a suite of results from the research community organized by comprability. These are formated as *.jsonl.gz files recording the perplexity per domain over our 585 domains as well as additional metrics discussed in our paper. These are files are the same type of results that are output by running the code in this repo for a given model.

We are also building out a website to allow interactive inspection of these multi-dimensional results. Until then please contact us by emailing the first author of Paloma if you would like access to the raw benchmark results.

So far the models evaluated by the benchmark are the 6 baseline 1B parameter models that we release with Paloma as well as `EleutherAI/pythia-160m`, `EleutherAI/pythia-1B`, and `EleutherAI/pythia-6.9b`.

## Setup
Start by following the installation instructions for this repo in this [readme](../README.md).

Expand All @@ -39,22 +34,19 @@ tango --settings tango.yml run configs/example_paloma_config.jsonnet --workspace
```

## Pretraining your model
Note that if you want to make a submission to our benchmark you must choose whether you will opt in to several experimental controls that will allow your submission to be marked for the greatest level of comparability. In this section we detail how you can accomplish these experimental controls.
If you are pretraining from scratch, we recomend you adopt several experimental controls that will allow the greatest level of comparability for your results. In this section we detail how you can accomplish these experimental controls.

### Decontaminating your pretraining data
Our decontamination approach is implemented in the Dolma Tooling repo. This will allow you to remove any document from any your pretraining data that is contaminated with respect to the Paloma.

To do this please follow the instructions [here](https://github.com/allenai/dolma/blob/decon-instructions/docs/paloma_decontamination.md) to decontaminate your own pretraining data.

### Fixing the training data order
Our approach for fixing the training data order requires the use of the same training code that we employ to train our 1B parameter baselines. This training code cannot yet be released as it is being developed for a separate, ongoing project. When that code is released we will update our instructions here to enable use of this experimental control. If you wish to use this experimental control before then, please feel free to reach out to the first author of Paloma.
Our approach for fixing the training data order requires the use of [the same OLMo training code](https://github.com/allenai/OLMo/tree/1f2f02052d2a5ecba82ff45bbfc731651b1e7d29) that we employ to train our 1B parameter baselines. Contemporary LMs train on instances that are maximum sequence length concatenations of training documents, so we must fix the order of concatenated instances. We do this by fixing the tokenization, maximum sequence length, and random seed, as well as providing dataloading code where order is invariant to number of devices.

### Fixing the vocabulary
We ask that submissions that do not investigate changes in vocabulary opt in to our standardized vocabulary to enable the greatest level of comprability. That vocabulary is available from the tokenizer hosted on HuggingFace hub as `allenai/gpt-neox-olmo-dolma-v1_5`.

## Making a submission
At present we are building out an automatic submission process that will soon be available. Until then please reach out to us by emailing `[email protected]`, if you would like to submit results to the benchmark.

## Citation

```bibtex
Expand All @@ -66,4 +58,4 @@ At present we are building out an automatic submission process that will soon be
volume={abs/2312.10523},
url={https://api.semanticscholar.org/CorpusID:266348815}
}
```
```

0 comments on commit 9b3a246

Please sign in to comment.