studio-ousia/luke-base · Tokenizer Issue

Feb 22, 2023

Hi,

I'm trying to use LUKE on SQuAD. So, when I try to process the training examples with the tokenizer, I get this error:

NotImplementedError: return_offset_mapping is not available when using Python tokenizers. To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.

Could you tell me what the issue might be? I am running the latest HuggingFace version.

ryo0634

Studio Ousia org Feb 24, 2023

•

edited Feb 24, 2023

The problem stems from the fact that the fast tokenizer is not implemented for LukeTokenizer, so it cannot be used with return_offsets_mapping=True.
As a workaround, you can use roberta-base for tokenizing text, which has its fast tokenizer implementation and the same wordpiece vocabulary as luke-base.

Example:

>>  model_name = "roberta-base"
>>  tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
>>  tokenizer.encode_plus("this is test", return_offsets_mapping=True)
{'input_ids': [0, 9226, 16, 1296, 2], 'attention_mask': [1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 4), (5, 7), (8, 12), (0, 0)]}

Saptarshi7

Feb 27, 2023

Yes, thank you. This worked!

Saptarshi7 changed discussion status to closed Feb 27, 2023