davidxmle commited on
Commit
fc35a4d
1 Parent(s): aa09b14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -10
README.md CHANGED
@@ -31,13 +31,29 @@ tags:
31
  datasets:
32
  - wikitext
33
  ---
34
-
35
- # Important Note
36
- - Two files are modified to address a current issue regarding Llama-3s keeps on generating additional tokens non-stop until hitting max token limit.
37
- - `generation_config.json`'s `eos_token_id` have been modified to add the other EOS token that Llama-3 uses
38
- - `tokenizer_config.json`'s `chat_template` has been modified to only add start generation token at the end of a prompt if `add_generation_prompt` is selected
39
-
40
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  # Llama-3-8B-Instruct-GPTQ-8-Bit
43
  - Model creator: [Meta Llama from Meta](https://huggingface.co/meta-llama)
@@ -52,7 +68,26 @@ This repo contains 8 Bit quantized GPTQ model files for [meta-llama/Meta-Llama-3
52
  <!-- description end -->
53
 
54
  ## GPTQ Quantization Method
55
- | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
56
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
57
- | [main](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ/tree/main) | 8 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 32768 | 4.16 GB | No | 8-bit, with Act Order and group size 32g. Minimum accuracy loss with decent VRAM usage reduction. |
58
- | More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  datasets:
32
  - wikitext
33
  ---
34
+ <!-- header start -->
35
+ <!-- 200823 -->
36
+ <div style="width: auto; margin-left: auto; margin-right: auto">
37
+ <img src="https://www.astronomer.io/logo/astronomer-logo-RGB-standard-1200px.png" alt="Astronomer" style="width: 60%; min-width: 400px; display: block; margin: auto;">
38
+ </div>
39
+ <div style="display: flex; justify-content: space-between; width: 100%;">
40
+ <div style="display: flex; flex-direction: column; align-items: flex-start;">
41
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"></p>
42
+ </div>
43
+ <div style="display: flex; flex-direction: column; align-items: flex-end;">
44
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.linkedin.com/in/david-xue-uva/">Quantized by David Xue, ML Engineer @ Astronomer</a></p>
45
+ </div>
46
+ </div>
47
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">This model is generously created and made open source by <a href="https://astronomer.io">Astronomer</a></p></div>
48
+ <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
49
+ <!-- header end -->
50
+ # Important Note Regarding a Known Bug in Llama 3
51
+ - Two files are modified to address a current issue regarding Llama 3 models keep on generating additional tokens non-stop until hitting max token limit.
52
+ - `generation_config.json`'s `eos_token_id` have been modified to add the other EOS token that Llama-3 uses.
53
+ - `tokenizer_config.json`'s `chat_template` has been modified to only add start generation token at the end of a prompt if `add_generation_prompt` is selected.
54
+ - For loading this model onto vLLM, make sure all requests have `"stop_token_ids":[128001, 128009]` to temporarily address the non-stop generation issue.
55
+ - vLLM does not yet respect `generation_config.json`.
56
+ - vLLM team is working on a a fix for this https://github.com/vllm-project/vllm/issues/4180
57
 
58
  # Llama-3-8B-Instruct-GPTQ-8-Bit
59
  - Model creator: [Meta Llama from Meta](https://huggingface.co/meta-llama)
 
68
  <!-- description end -->
69
 
70
  ## GPTQ Quantization Method
71
+ | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | VRAM Size | ExLlama | Desc |
72
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
73
+ | [main](https://huggingface.co/astronomer-io/Llama-3-8B-Instruct-GPTQ-8-Bit/tree/main) | 8 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 8192 | 9.09 GB | No | 8-bit, with Act Order and group size 32g. Minimum accuracy loss with decent VRAM usage reduction. |
74
+ | More variants to come | TBD | TBD | TBD | TBD | TBD | TBD | TBD | TBD | May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc. |
75
+
76
+ ## Serving this GPTQ model using vLLM
77
+ Tested with the below command
78
+ ```
79
+ python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16
80
+ ```
81
+ For the non-stop token generation bug, make sure to send requests with `stop_token_ids":[128001, 128009]` to vLLM endpoint
82
+ Example:
83
+ ```
84
+ {
85
+ "model": "Llama-3-8B-Instruct-GPTQ-8-Bit",
86
+ "messages": [
87
+ {"role": "system", "content": "You are a helpful assistant."},
88
+ {"role": "user", "content": "Who created Llama 3?"}
89
+ ],
90
+ "max_tokens": 2000,
91
+ "stop_token_ids":[128001,128009]
92
+ }
93
+ ```