Text Generation
Transformers
PyTorch
English
llama
text-generation-inference
Inference Endpoints

Extremely slow inference

#9
by TZ20 - opened

Hi, I'm loading this model using 4 bit quantization from huggingface. Im using 4 T4 gpus:

model = LlamaForCausalLM.from_pretrained(
    'Open-Orca/OpenOrca-Platypus2-13B',
    load_in_4bit = True,
    torch_dtype = torch.float16,
    device_map= 'auto')

However, when I do model.generate, it is extremely slow compared to the base LLama-2-13b-chat model. E.g. where the original llama 2 model might take 2 min, this one takes 30 min.
Any reason for this?

OpenOrca org

Try replacing your current configs with the updated config.json and generation_config.json. Looks like the cache was disabled, which usually leads to extreme slowdowns.

Thanks, seemed to do the trick

TZ20 changed discussion status to closed

Sign up or log in to comment