--- library_name: transformers license: llama3 language: - ja - en --- # Llama-3-ELYZA-JP-8B-AWQ ![Llama-3-ELYZA-JP-8B-image](./key_visual.png) ## Model Description **Llama-3-ELYZA-JP-8B** is a large language model trained by [ELYZA, Inc](https://elyza.ai/). Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it has been enhanced for Japanese usage through additional pre-training and instruction tuning. For more details, please refer to [our blog post](https://note.com/elyza/n/n360b6084fdbd). ## Quantization We have prepared two quantized model options, GGUF and AWQ. This is the [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) model. Here is a table showing the performance degradation due to quantization. | Model | ELYZA-tasks-100 GPT4 score | | :-------------------------------- | ---: | | Llama-3-ELYZA-JP-8B | 3.655 | | Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M) | 3.57 | | Llama-3-ELYZA-JP-8B-AWQ | 3.39 | ## Use with vLLM Install vLLM. ```bash pip install vllm ``` ### vLLM Offline Batched Inference ```python from vllm import LLM, SamplingParams llm = LLM(model="elyza/Llama-3-ELYZA-JP-8B-AWQ", quantization="awq") tokenizer = llm.get_tokenizer() DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=1000) messages_batch = [ [ {"role": "system", "content": DEFAULT_SYSTEM_PROMPT}, {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?"} ], [ {"role": "system", "content": DEFAULT_SYSTEM_PROMPT}, {"role": "user", "content": "クマが海辺に行ってアザラシと友達になり、最終的には家に帰るというプロットの短編小説を書いてください。"} ] ] prompts = [ tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) for messages in messages_batch ] outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: print(output.outputs[0].text) print("=" * 50) ``` ### vLLM OpenAI Compatible Server Start the API server. ```bash python -m vllm.entrypoints.openai.api_server \ --model elyza/Llama-3-ELYZA-JP-8B-AWQ \ --port 8000 \ --host localhost \ --quantization awq ``` Call the API using curl. ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "elyza/Llama-3-ELYZA-JP-8B-AWQ", "messages": [ { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" }, { "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?" } ], "temperature": 0.6, "max_tokens": 1000, "stream": false }' ``` Call the API using Python. ```python import openai client = openai.OpenAI( base_url="http://localhost:8000/v1", api_key = "dummy_api_key" ) completion = client.chat.completions.create( model="elyza/Llama-3-ELYZA-JP-8B-AWQ", messages=[ {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"}, {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?"} ] ) ``` ## Developers Listed in alphabetical order. - [Masato Hirakawa](https://huggingface.co/m-hirakawa) - [Shintaro Horie](https://huggingface.co/e-mon) - [Tomoaki Nakamura](https://huggingface.co/tyoyo) - [Daisuke Oba](https://huggingface.co/daisuk30ba) - [Sam Passaglia](https://huggingface.co/passaglia) - [Akira Sasaki](https://huggingface.co/akirasasaki) ## License [Meta Llama 3 Community License](https://llama.meta.com/llama3/license/) ## How to Cite ```tex @misc{elyzallama2024, title={elyza/Llama-3-ELYZA-JP-8B}, url={https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B}, author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki}, year={2024}, } ``` ## Citations ```tex @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} } ```