Skip to content

Latest commit

 

History

History
17 lines (12 loc) · 678 Bytes

README.md

File metadata and controls

17 lines (12 loc) · 678 Bytes

tokenizers

C++ implementations for various tokenizers (sentencepiece, tiktoken etc). Useful for other PyTorch repos such as torchchat, ExecuTorch to build LLM runners using ExecuTorch stack or AOT Inductor stack.

SentencePiece tokenizer

Depend on https://github.com/google/sentencepiece from Google.

Tiktoken tokenizer

Adopted from https://github.com/sewenew/tokenizer.

License

tokenizers is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.