Extremely slow inference

by TZ20 - opened Sep 15, 2023

Discussion

TZ20

Sep 15, 2023

•

edited Sep 15, 2023

Hi, I'm loading this model using 4 bit quantization from huggingface. Im using 4 T4 gpus:

model = LlamaForCausalLM.from_pretrained(
    'Open-Orca/OpenOrca-Platypus2-13B',
    load_in_4bit = True,
    torch_dtype = torch.float16,
    device_map= 'auto')

However, when I do model.generate, it is extremely slow compared to the base LLama-2-13b-chat model. E.g. where the original llama 2 model might take 2 min, this one takes 30 min.
Any reason for this?

alpindale

OpenOrca org Sep 21, 2023

Try replacing your current configs with the updated config.json and generation_config.json. Looks like the cache was disabled, which usually leads to extreme slowdowns.

TZ20

Sep 22, 2023

Thanks, seemed to do the trick

TZ20 changed discussion status to closed Sep 22, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment