Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AI Model Name: Llama 3 8B "Built with Meta Llama 3" https://llama.meta.com/llama3/license/

This is the result of running AutoAWQ to quantize the LLaMA-3 8B model to ~4 bits/parameter.

To launch an OpenAI-compatible API endpoint on your Linux server:

git lfs install
git clone https://huggingface.co/catid/cat-llama-3-8b-awq-q128-w4-gemm

conda create -n vllm8 python=3.10 -y && conda activate vllm8

pip install -U git+https://github.com/vllm-project/vllm.git@a134ef6

python -m vllm.entrypoints.openai.api_server --model cat-llama-3-8b-awq-q128-w4-gemm

To use 2 GPUs add --tensor-parallel-size 2 --gpu-memory-utilization 0.95:

python -m vllm.entrypoints.openai.api_server --model cat-llama-3-8b-awq-q128-w4-gemm --tensor-parallel-size 2 --gpu-memory-utilization 0.95

My personal TextWorld common-sense reasoning benchmark ( https://github.com/catid/textworld_llm_benchmark ) results for this model:

cat-llama-3-8b-awq-q128-w4-gemm : Average Score: 2.02 ± 0.29
Mixtral 8x7B : Average Score: 2.22 ± 0.33
GPT 3.5 : Average Score: 2.8 ± 1.69

This is very respectable for a relatively small model!

Downloads last month
6
Safetensors
Model size
1.98B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.