metadata

base_model: meta-llama/Meta-Llama-3-8B-Instruct
inference: false
model_creator: astronomer-io
model_name: Meta-Llama-3-8B-Instruct
model_type: llama
pipeline_tag: text-generation
prompt_template: >-
  {% set loop_messages = messages %}{% for message in loop_messages %}{% set
  content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>


  '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set
  content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if
  add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>


  ' }}{% endif %}
quantized_by: davidxmle
license: other
license_name: llama-3-community-license
license_link: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/LICENSE
tags:
  - llama
  - llama-3
  - facebook
  - meta
  - astronomer
  - gptq
  - pretrained
datasets:
  - wikitext

Quantized by David Xue, ML Engineer @ Astronomer

This model is generously created and made open source by Astronomer

# Important Note Regarding a Known Bug in Llama 3 - Two files are modified to address a current issue regarding Llama 3 models keep on generating additional tokens non-stop until hitting max token limit. - `generation_config.json`'s `eos_token_id` have been modified to add the other EOS token that Llama-3 uses. - `tokenizer_config.json`'s `chat_template` has been modified to only add start generation token at the end of a prompt if `add_generation_prompt` is selected. - For loading this model onto vLLM, make sure all requests have `"stop_token_ids":[128001, 128009]` to temporarily address the non-stop generation issue. - vLLM does not yet respect `generation_config.json`. - vLLM team is working on a a fix for this https://github.com/vllm-project/vllm/issues/4180

Llama-3-8B-Instruct-GPTQ-8-Bit

Model creator: Meta Llama from Meta
Original model: meta-llama/Meta-Llama-3-8B-Instruct
Built with Meta Llama 3

Description

This repo contains 8 Bit quantized GPTQ model files for meta-llama/Meta-Llama-3-8B-Instruct.

GPTQ Quantization Method

Branch	Bits	GS	Act Order	Damp %	GPTQ Dataset	Seq Len	VRAM Size	ExLlama	Desc
main	8	32	Yes	0.1	wikitext	8192	9.09 GB	No	8-bit, with Act Order and group size 32g. Minimum accuracy loss with decent VRAM usage reduction.
More variants to come	TBD	TBD	TBD	TBD	TBD	TBD	TBD	TBD	May upload additional variants of GPTQ 8 bit models in the future using different parameters such as 128g group size and etc.

Serving this GPTQ model using vLLM

Tested with the below command

python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16

For the non-stop token generation bug, make sure to send requests with stop_token_ids":[128001, 128009] to vLLM endpoint Example:

{
    "model": "Llama-3-8B-Instruct-GPTQ-8-Bit",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who created Llama 3?"}
        ],
    "max_tokens": 2000,
    "stop_token_ids":[128001,128009]
}