TheBloke
/

alpaca-lora-65B-GGML

Model card Files Files and versions Community

TheBloke commited on Apr 29, 2023

Commit

0ad53da

•

1 Parent(s): ee0e806

Update README.md

Browse files

Files changed (1) hide show

README.md +43 -21

README.md CHANGED Viewed

@@ -5,47 +5,47 @@ inference: false
 # Quantised GGMLs of alpaca-lora-65B
-Quantised 4bit and 2bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).
 I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit).
 ## Provided files
 | Name | Quant method | Bits | Size | RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-`alpaca-lora-65B.GGML.q2_0.bin` | q2_0 | 2bit | 23GB | 26GB | Lowest RAM requirements, minimum quality |
-`alpaca-lora-65B.GGML.q4_0.bin` | q4_0 | 4bit | 39GB | 41GB | Superseded and not recommended |
-`alpaca-lora-65B.GGML.q4_2.bin` | q4_2 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
-`alpaca-lora-65B.GGML.q4_3.bin` | q4_3 | 4bit | 47GB | 49GB | Maximum quality, high RAM requirements and slow inference |
 * The q2_0 file requires the least resources, but does not have great quality compared to the others.
   * It's likely to be better to use a 30B model at 4bit vs a 65B model at 2bit.
-* The q4_0 file was using an experimental quantisation method which has been superseded and is no longer recommended.
-* The q4_2 file offers the best combination of performance and quality.
-* The q4_3 file offers the highest quality, at the cost of increased RAM usage and slower inference speed.
-## Creation method, and requirements
-### 4bit q4_0
-This file was created using an alternative q4_0 quantisation method being trialled in [llama.cpp PR 896](https://github.com/ggerganov/llama.cpp/pull/896)
-This quantisation method has now been deprecated, as it's been replaced by better methods.
-This file is no longer recommended.
-### 4bit q4_2 and q4_3
-These files were created using the new q4_2 and q4_3 quantisation methods that have now become standard and recommended.
-They will work with any recent version of [llama.cpp](https://github.com/ggerganov/llama.cpp). If they fail to open for you, try upgrading llama.cpp.
-The q4_2 file is recommended for most users.  The q4_3 file offers the maximum possible quality, but requires more RAM and will provide slower inference.
-### 2bit q2_0
-This file was created using an even newer and more experimental 2bit method being trialled in [llama.cpp PR 1004](https://github.com/ggerganov/llama.cpp/pull/1004).
-This code is not yet merged into the main `llama.cpp` repo.
 To run this file you need to compile and run the same `llama.cpp` code that was used to create it.
@@ -57,6 +57,28 @@ git checkout q2q3
 make
 ```
 # Original model card not provided
 No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b).

 # Quantised GGMLs of alpaca-lora-65B
+Quantised 2bit, 4bit and 5bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).
 I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit).
 ## Provided files
 | Name | Quant method | Bits | Size | RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+`alpaca-lora-65B.ggml.q2_0.bin` | q2_0 | 2bit | 23GB | 26GB | Lowest RAM requirements, minimum quality |
+`alpaca-lora-65B.ggml.q4_0.bin` | q4_0 | 4bit | 39GB | 41GB | Maximum compatibility |
+`alpaca-lora-65B.ggml.q4_2.bin` | q4_2 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
+`alpaca-lora-65B.ggml.q5_0.bin` | q5_0 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
+`alpaca-lora-65B.ggml.q5_1.bin` | q5_1 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
 * The q2_0 file requires the least resources, but does not have great quality compared to the others.
   * It's likely to be better to use a 30B model at 4bit vs a 65B model at 2bit.
+* The q4_0 file provides lower quality, but maximal compatibility. It will work with past and future versions of llama.cpp
+* The q4_2 file offers the best combination of performance and quality. This format is still subject to change and there may be compatibility issues, see below.
+* The q5_0 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_0.
+* The q5_1 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_1.
+## q4_2 compatibility
+q4_2 is a relatively new 4bit quantisation method offering improved quality. However they are still under development and their formats are subject to change.
+In order to use these files you will need to use recent llama.cpp code. And it's possible that future updates to llama.cpp could require that these files are re-generated.
+If and when the q4_2 file no longer works with recent versions of llama.cpp I will endeavour to update it.
+If you want to ensure guaranteed compatibility with a wide range of llama.cpp versions, use the q4_0 file.
+## q5_0 and q5_1 compatibility
+These new methods were released to llama.cpp on 26th April. You will need to pull the latest llama.cpp code and rebuild to be able to use them.
+Don't expect any third-party UIs/tools to support them yet.
+### 2bit q2_0 compatibility
+This file was created using an experimental 2bit method being trialled in [llama.cpp PR 1004](https://github.com/ggerganov/llama.cpp/pull/1004).
+This code is not yet merged into the main `llama.cpp` repo and it is not clear if it ever will be.
 To run this file you need to compile and run the same `llama.cpp` code that was used to create it.
 make
 ```
+## How to run in `llama.cpp`
+I use the following command line; adjust for your tastes and needs:
+```
+./main -t 18 -m alpaca-lora-65B.ggml.q4_2.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
+### Instruction:
+Write a story about llamas
+### Response:"
+```
+Change `-t 18` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
+If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
+## How to run in `text-generation-webui`
+Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
+Note: at this time text-generation-webui will not support the new q5 quantisation methods.
+**Thireus** has written a [great guide on how to update it to the latest llama.cpp code](https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/5) so that these files can be used in the UI.
 # Original model card not provided
 No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b).