Awesome Open (Source) Language Models

Friends of OLMo and their links. Built for the 2024 NeurIPS tutorial on opening the language modeling pipeline by Ai2 (slides here).

Language models (LMs) have become a critical technology for tackling a wide range of natural language processing tasks, making them ubiquitous in both AI research and commercial products. As their commercial importance has surged, the most powerful models have become more secretive, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. In this tutorial, we provide a detailed walkthrough of the language model development pipeline, including pretraining data, model architecture and training, adaptation (e.g., instruction tuning, RLHF). For each of these development stages, we provide examples using open software and data, and discuss tips, tricks, pitfalls, and otherwise often inaccessible details about the full language model pipeline that we've uncovered in our own efforts to develop open models. We have opted not to have the optional panel given the extensive technical details and examples we need to include to cover this topic exhaustively.

This focuses on language models with more than just model weights being open -- looking for training code, data, and more! The best is fully open-source language models with the entire pipeline, but individual pieces are super valuable too.

🚧 Missed something? Give us a PR to add! 🚧

OLMo 2 (Nov. 2024)

Collection
7B base model
13B base model
7B instruct
13B instruct
Annealing dataset
Training Code (1st gen.)
Training Code (2nd gen.)
Post-train Code
Eval Code
Data Processing Toolkit
Demo

AMD OLMo (Oct. 2024)

1B model
- SFT
- SFT-DPO
Blog post

HuggingFace SmolLM (v2 Oct. 2024)

SmolLM 2 collection
SmolLM 2 pretraining data: TBD
SmolLM instruction mix
SmolLM collection
SmolLM pretraining data
Synthetic pretrain corpus
Fineweb pretrain corpus
- Edu Subset
- Fineweb 2 (multilingual)
SmolLM repo
Blogposts:

DataComp (Jun. 2024)

1B Model
- Instruct
7B Model
- Instruct
- 8K Extension
Pretrain Dataset
Training + Eval Code
Paper
Evalchemy (Post-training Eval Code)

Databricks / formerly Mosaic ML

Streaming Datasets
Eval Gauntlet
Pretraining Code
- Megablocks (MoE Training)
DBRX Collection

LLM 360

Analysis360
K2-65B:
CrystalCoder-7B:
Amber-7B:
Paper

EleutherAI

Pythia
GPT-NeoX-20B
- Model
- Paper
Llema-7B
- Central Repo
- Paper
- Dataset
- Blogpost
Training Code
Eval Code
- Paper
The Pile Dataset
- Paper

Cerebras

Cerebras-GPT

RWKV

RWKV-5 Models Collection
RWKV-6 Models Collection
Training Code
Papers:
- RWKV-4
- RWKV-5 and RWKV-6

M.A.P.

Models Collection
Datasets
Paper
Code

Zyphra

Zamba 2 Models:
- 7B
  - 7B Instruct
  - Blogpost
- 2.7B
  - Instruct
  - Blogpost
- 1.2B
  - Instruct
  - Blogpost
- Paper
Zyda 2 Dataset
- Blogpost
- Paper

Together.AI

RedPajama v1 Dataset
RedPajama v2 Dataset
Data Code

NVIDIA

Pretraining Code
- Megatron-LM
- NeMo
Data Processing
Post-training Code

PyTorch / Meta

Torch Titan Pretraining Code
Meta Lingua Pretraining Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Open (Source) Language Models

OLMo 2 (Nov. 2024)

AMD OLMo (Oct. 2024)

HuggingFace SmolLM (v2 Oct. 2024)

DataComp (Jun. 2024)

Databricks / formerly Mosaic ML

LLM 360

EleutherAI

Cerebras

RWKV

M.A.P.

Zyphra

Together.AI

NVIDIA

PyTorch / Meta

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Open (Source) Language Models

OLMo 2 (Nov. 2024)

AMD OLMo (Oct. 2024)

HuggingFace SmolLM (v2 Oct. 2024)

DataComp (Jun. 2024)

Databricks / formerly Mosaic ML

LLM 360

EleutherAI

Cerebras

RWKV

M.A.P.

Zyphra

Together.AI

NVIDIA

PyTorch / Meta