We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PackedDatasetBuilder
sep_token
I noticed that PackedDatasetBuilder does not separate the tokens with sep_token.
To illustrate, referencing
lit-llama/scripts/prepare_redpajama.py
Line 71 in da71ade
builder = packed_dataset.PackedDatasetBuilder( outdir=destination_path, prefix=prefix, chunk_size=chunk_size, sep_token=tokenizer.bos_id, dtype="auto", vocab_size=tokenizer.vocab_size, )
and
Line 85 in da71ade
text_ids = tokenizer.encode(text)
The minimal reproducible code is as follows:
from pathlib import Path import numpy as np from lit_gpt.tokenizer import Tokenizer from lit_gpt.packed_dataset import PackedDatasetBuilder tokenizer = Tokenizer(Path('tokenizer')) content = 'foo' tokenized = tokenizer.encode(content) print(tokenized) # prints: # tensor([7953, 2], dtype=torch.int32) training_dataset_builder = PackedDatasetBuilder( outdir='FOO', # Use process_id to differentiate builders prefix='BAR', chunk_size=6, sep_token=tokenizer.bos_id, dtype="auto", vocab_size=tokenizer.vocab_size, ) training_dataset_builder.add_array(np.array(tokenized)) print(training_dataset_builder._arr) # prints: # [7953 2 1 1 1 1] training_dataset_builder.add_array(np.array(tokenized)) print(training_dataset_builder._arr) # prints: # [7953 2 7953 2 1 1]
1 represents the bos token. 2 represents the eos token.
1
2
As you can see, this translates to:
foo</s>foo</s><s><s>
Shouldn't the foo's be wrapped in bos and eos tokens, like this?
# Tensor [1 7953 2 1 7953 2 ] # Plain text <s>foo</s><s>foo</s>
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I noticed that
PackedDatasetBuilder
does not separate the tokens withsep_token
.To illustrate, referencing
lit-llama/scripts/prepare_redpajama.py
Line 71 in da71ade
and
lit-llama/scripts/prepare_redpajama.py
Line 85 in da71ade
The minimal reproducible code is as follows:
1
represents the bos token.2
represents the eos token.As you can see, this translates to:
Shouldn't the foo's be wrapped in bos and eos tokens, like this?
The text was updated successfully, but these errors were encountered: