Friends of OLMo and their links. Built for the 2024 NeurIPS tutorial on opening the language modeling pipeline by Ai2 (slides here).
Language models (LMs) have become a critical technology for tackling a wide range of natural language processing tasks, making them ubiquitous in both AI research and commercial products. As their commercial importance has surged, the most powerful models have become more secretive, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. In this tutorial, we provide a detailed walkthrough of the language model development pipeline, including pretraining data, model architecture and training, adaptation (e.g., instruction tuning, RLHF). For each of these development stages, we provide examples using open software and data, and discuss tips, tricks, pitfalls, and otherwise often inaccessible details about the full language model pipeline that we've uncovered in our own efforts to develop open models. We have opted not to have the optional panel given the extensive technical details and examples we need to include to cover this topic exhaustively.
This focuses on language models with more than just model weights being open -- looking for training code, data, and more! The best is fully open-source language models with the entire pipeline, but individual pieces are super valuable too.
🚧 Missed something? Give us a PR to add! 🚧
- Collection
- 7B base model
- 13B base model
- 7B instruct
- 13B instruct
- Annealing dataset
- Training Code (1st gen.)
- Training Code (2nd gen.)
- Post-train Code
- Eval Code
- Data Processing Toolkit
- Demo
- SmolLM 2 collection
- SmolLM 2 pretraining data: TBD
- SmolLM instruction mix
- SmolLM collection
- SmolLM pretraining data
- Synthetic pretrain corpus
- Fineweb pretrain corpus
- SmolLM repo
- Blogposts:
- Analysis360
- K2-65B:
- CrystalCoder-7B:
- Amber-7B:
- Paper
-
Pythia
-
GPT-NeoX-20B
-
Llema-7B
- Cerebras-GPT
- Zamba 2 Models:
- Zyda 2 Dataset
- Pretraining Code
- Data Processing
- Post-training Code