Problem of vram used with quantized model

#5
by mlaszlo - opened

Hello,

When I infer with this model (the 1B one) loaded in 4 bits from hugginface, I cannot see any speed enhancements, and when i run nvidia-smi i can see my vram going from 1gb to 4gb (infering on only one image). I'm using inference on a Tesla t4 without flash attention since it does not support turing gpus. I dont understand why the memory is going up this way.
Many thanks in advance

OpenGVLab org

Hello,

While the 4-bit quantization primarily reduces VRAM usage, you might not see speed improvements. If you're looking for better speed and memory optimization, I recommend trying the lmdeploy inference framework. It could offer more efficient performance on your setup.

I hope this helps!

czczup changed discussion status to closed

Sign up or log in to comment