Adding new vocabulary into the model

#6
by raptorkwok - opened

I am curious how I can add more vocabularies into the model.

Currently, I use bart-base-chinese as a pre-trained model, but I found that the current 51,271 words are insufficient. From time to time, there are unknown tokens found. Say I have 50,000 more vocabularies to be added to the model, can you share some ideas on how I can achieve the goal?

Thanks.

Fudan NLP org

You can add new tokens into the vacabulary just following the huggingface doc, like this:

# supposed the tokenizer and model are loaded from_pretrained()
tokenizer.add_tokens(["JU","AZ"]) 
model.resize_token_embeddings(len(tokenizer))

Note that the added tokens are untrained, which need to be further pre-trained or fine-tuned on additional datasets.

May I know how to further pre-train / fine-tune?

Specifically, I saw from the README of pre-train:

  • dataset: Place the .bin and .idx files that preprocessed from raw text. I figured out this by reading the MEGATRON README file > Preprocess Data section.

  • vocab: Place the vocab files and model config file. I have saved the tokenizer file using .save_pretrained() function to generate the following files: added_tokens.json, special_tokens_map.json, tokenizer_config.json and vocab.txt. Are these files okay?

  • roberta_zh: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint. How do I do that? Use the .from_pretrained(), then .save_pretrained()?

Thanks in advance.

Fudan NLP org

We provide code for fine-tuning at our github: https://github.com/fastnlp/CPT/finetune

You can use it for further pre-train or fine-tuning.

Thanks for the reply. If I add more vocabularies to the model, I should first pre-train it, then fine-tune, am I correct? Thanks.

For the ./run_pretrain_bart.sh, it shows the error at an early stage:

[rank0]: IndexError: Caught IndexError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/blendable_dataset.py", line 83, in __getitem__
[rank0]:     return self.datasets[dataset_idx][sample_idx]
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 106, in __getitem__
[rank0]:     return self.build_training_sample(sample, self.max_seq_length, np_rng)
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 148, in build_training_sample
[rank0]:     source = self.add_whole_word_mask(source, mask_ratio, replace_length)
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 360, in add_whole_word_mask
[rank0]:     source[indices[mask_random]] = torch.randint(
[rank0]: IndexError: The shape of the mask [2] at index 0 does not match the shape of the indexed tensor [1] at index 0

The dataset is generated using Megatron's Preprocess Data method.

Sign up or log in to comment