davidxmle's picture
Update README.md
fc35a4d verified
|
raw
history blame
No virus
4.57 kB
metadata
base_model: meta-llama/Meta-Llama-3-8B-Instruct
inference: false
model_creator: astronomer-io
model_name: Meta-Llama-3-8B-Instruct
model_type: llama
pipeline_tag: text-generation
prompt_template: >-
  {% set loop_messages = messages %}{% for message in loop_messages %}{% set
  content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>


  '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set
  content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if
  add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>


  ' }}{% endif %}
quantized_by: davidxmle
license: other
license_name: llama-3-community-license
license_link: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/LICENSE
tags:
  - llama
  - llama-3
  - facebook
  - meta
  - astronomer
  - gptq
  - pretrained
datasets:
  - wikitext
Astronomer

This model is generously created and made open source by Astronomer


# Important Note Regarding a Known Bug in Llama 3 - Two files are modified to address a current issue regarding Llama 3 models keep on generating additional tokens non-stop until hitting max token limit. - `generation_config.json`'s `eos_token_id` have been modified to add the other EOS token that Llama-3 uses. - `tokenizer_config.json`'s `chat_template` has been modified to only add start generation token at the end of a prompt if `add_generation_prompt` is selected. - For loading this model onto vLLM, make sure all requests have `"stop_token_ids":[128001, 128009]` to temporarily address the non-stop generation issue. - vLLM does not yet respect `generation_config.json`. - vLLM team is working on a a fix for this https://github.com/vllm-project/vllm/issues/4180

Llama-3-8B-Instruct-GPTQ-8-Bit

Description

This repo contains 8 Bit quantized GPTQ model files for meta-llama/Meta-Llama-3-8B-Instruct.

GPTQ Quantization Method

Branch Bits GS Act Order Damp % GPTQ Dataset Seq Len VRAM Size ExLlama Desc
main 8 32 Yes 0.1 wikitext 8192 9.09 GB No 8-bit, with Act Order and group size 32g. Minimum accuracy loss with decent VRAM usage reduction.
More variants to come TBD TBD TBD TBD TBD TBD TBD TBD May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc.

Serving this GPTQ model using vLLM

Tested with the below command

python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16

For the non-stop token generation bug, make sure to send requests with stop_token_ids":[128001, 128009] to vLLM endpoint Example:

{
    "model": "Llama-3-8B-Instruct-GPTQ-8-Bit",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who created Llama 3?"}
        ],
    "max_tokens": 2000,
    "stop_token_ids":[128001,128009]
}