Is the BOS token id of 128000 hardcoded into the llama 3.2 tokenizer?

#17
by rasyosef - opened

I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of 28k, but when I use it to tokenize text, the BOS token id is still 128000.

Here are the first few lines of the tokenizer_config.json file of the newly trained tokenizer.

{
  "added_tokens_decoder": {
    "0": {
      "content": "<|begin_of_text|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<|end_of_text|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },

And here's a tokenization of an example text. As can be seen, the first token id is 128000 when it should have been 0.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")

text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])

Output:

tensor([[128000,   1704,    802]])

Sign up or log in to comment