Is the BOS token id of 128000 hardcoded into the llama 3.2 tokenizer?
#17
by
rasyosef
- opened
I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of 28k
, but when I use it to tokenize text, the BOS token id is still 128000
.
Here are the first few lines of the tokenizer_config.json
file of the newly trained tokenizer.
{
"added_tokens_decoder": {
"0": {
"content": "<|begin_of_text|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|end_of_text|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
And here's a tokenization of an example text. As can be seen, the first token id is 128000
when it should have been 0
.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")
text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])
Output:
tensor([[128000, 1704, 802]])