Molmo-7B-GPTQ-4bit 🚀

Overview

The Molmo-7B-GPTQ-4bit model is a transformer-based model fine-tuned for NLP tasks. It has been quantized to 4-bit precision for efficient deployment. This model has been prepared using bitsandbytes for 4-bit quantization rather than using AutoGPTQ, which does not natively support this model format as of now. The quantization leverages the BitsAndBytesConfig from the transformers library, enabling highly optimized GPU inference with reduced memory usage.

Model Information

Model Name: Molmo-7B-GPTQ-4bit
Base Model: allenai/Molmo-7B-D-0924
Quantization: 4-bit quantization using bitsandbytes instead of AutoGPTQ
Repository URL: zamal/Molmo-7B-GPTQ-4bit

Technical Details

This model is quantized using bitsandbytes (not AutoGPTQ), as GPTQ currently lacks direct support for NF4 4-bit quantization via the native AutoGPTQ methods. This approach allows for highly efficient 4-bit precision inference with minimal loss in performance and reduced memory overhead.

Key Quantization Configurations:

bnb_4bit_use_double_quant: Enabled, for more efficient handling of smaller models.
bnb_4bit_quant_type: NF4 (Normal Float 4-bit), which is more efficient and accurate for smaller models.
bnb_4bit_compute_dtype: FP16 (float16) to accelerate GPU-based inference.

Device Compatibility:

bitsandbytes automatically handles device mapping for GPUs via the device_map="auto" parameter.
4-bit models are ideal for GPUs with limited VRAM, allowing inference on larger models without exceeding hardware memory limits.

Limitations

Precision Loss: While the model has been quantized for efficiency, there is a minor trade-off in precision due to the 4-bit quantization, which may slightly affect performance compared to the original full-precision model.
AutoGPTQ Limitation: As mentioned, AutoGPTQ does not natively support this kind of quantization, and this has been achieved through bitsandbytes and the transformers library.

Usage

Installation

Make sure you have the necessary dependencies installed:

pip install transformers torch Pillow torchvision einops accelerate tensorflow bitsandbytes