-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_pretrain_bart.sh returns IndexError #82
Comments
Hi, |
Thank you for prompt reply: The training command of
For the MEGATRON pre-process data command, it is:
which the sample contents of
Please note that I intended to use more than 1 character for the vocabulary, which might lead to the problem. For the data structure of
and I checked the content of each
For the
For the
Hope the above information helps. |
I noticed that you have a custom tokenizer. Have you changed the vocab_size in the model config and the tokenizer config accordingly? You can try to run the pre-training on the original Chinese Bart Model to see if it still works. |
Oh wow, how do you notice I use a custom tokenizer? I use the following codes to add tokens:
I think the You mentioned about the tokenizer config, but the |
The full output is here:
|
After replacing the dataset files in the
Which the the
Is this vocab generation approach correct? |
It seems that everything is good with your configuration. It can be the mismatch of some package versions like torch. Since the code was 3 years ago and we do not have the env to run it in our platform. Could you please add some print at this line:
to show the size and dtype of mask_random, indices and source. |
Here is the output just before the
|
It looks like the dataset can work fine and generate samples. You may check the dataset and the vocab to see if it is some special samples. |
If I use
Each iteration took 2500ms, for 100,000 iterations, it will take 69 hours to complete. |
After the long running time, the pre-processing is finally complete. It generated a For example, how can I use the files with |
run this script “pretrain/tools/convert_ckpt.py” with passing the folder path. It will convert a ckpt that can be used by huggingface module. |
Thanks. I successfully converted the How can I use this generated In the |
Here is the stacktrace of
run_pretrain_bart.sh
error:How can I debug it? It seems the dataset is problematic, but the dataset was generated by the MEGATRON Pre-process scripts in the
tools/
folder and it runs without an error. The.bin
and.idx
are generated properly (and I put it in thedataset/
folder).The text was updated successfully, but these errors were encountered: