astronomer
/

Llama-3-8B-Instruct-GPTQ-8-Bit

Text Generation

Inference Endpoints

text-generation-inference

8-bit precision

Model card Files Files and versions Community

davidxmle commited on Apr 19

Commit

fd499b3

•

1 Parent(s): 071ea51

Update README.md

Files changed (1) hide show

README.md +6 -0

README.md CHANGED Viewed

@@ -62,6 +62,10 @@ datasets:
 This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
 <!-- description end -->
 ## GPTQ Quantization Method
@@ -74,6 +78,8 @@ This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3
 | More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
 ## Serving this GPTQ model using vLLM
 Tested with the below command
 ```
 python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16

 This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
+This model can be loaded with just over 10GB of VRAM and can be served lightning fast with the cheapest Nvidia GPUs possible (Nvidia T4, Nvidia K80, RTX 4070, etc).
+The 8 bit GPTQ quant has minimum quality degradation from the original `bfloat16` model due to its higher bitrate.
 <!-- description end -->
 ## GPTQ Quantization Method
 | More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
 ## Serving this GPTQ model using vLLM
+Tested serving this model via vLLM using an Nvidia T4 (16GB VRAM).
 Tested with the below command
 ```
 python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16