davidxmle commited on
Commit
fd499b3
1 Parent(s): 071ea51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -62,6 +62,10 @@ datasets:
62
 
63
  This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
64
 
 
 
 
 
65
  <!-- description end -->
66
 
67
  ## GPTQ Quantization Method
@@ -74,6 +78,8 @@ This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3
74
  | More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
75
 
76
  ## Serving this GPTQ model using vLLM
 
 
77
  Tested with the below command
78
  ```
79
  python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16
 
62
 
63
  This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
64
 
65
+ This model can be loaded with just over 10GB of VRAM and can be served lightning fast with the cheapest Nvidia GPUs possible (Nvidia T4, Nvidia K80, RTX 4070, etc).
66
+
67
+ The 8 bit GPTQ quant has minimum quality degradation from the original `bfloat16` model due to its higher bitrate.
68
+
69
  <!-- description end -->
70
 
71
  ## GPTQ Quantization Method
 
78
  | More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
79
 
80
  ## Serving this GPTQ model using vLLM
81
+ Tested serving this model via vLLM using an Nvidia T4 (16GB VRAM).
82
+
83
  Tested with the below command
84
  ```
85
  python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16