Problem of vram used with quantized model

by mlaszlo - opened 23 days ago

Discussion

mlaszlo

23 days ago

•

edited 23 days ago

Hello,

When I infer with this model (the 1B one) loaded in 4 bits from hugginface, I cannot see any speed enhancements, and when i run nvidia-smi i can see my vram going from 1gb to 4gb (infering on only one image). I'm using inference on a Tesla t4 without flash attention since it does not support turing gpus. I dont understand why the memory is going up this way.
Many thanks in advance

czczup

OpenGVLab org 18 days ago

Hello,

While the 4-bit quantization primarily reduces VRAM usage, you might not see speed improvements. If you're looking for better speed and memory optimization, I recommend trying the lmdeploy inference framework. It could offer more efficient performance on your setup.

I hope this helps!

czczup changed discussion status to closed 18 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment