TheBloke commited on
Commit
0ad53da
1 Parent(s): ee0e806

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -21
README.md CHANGED
@@ -5,47 +5,47 @@ inference: false
5
 
6
  # Quantised GGMLs of alpaca-lora-65B
7
 
8
- Quantised 4bit and 2bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).
9
 
10
  I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit).
11
 
12
  ## Provided files
13
  | Name | Quant method | Bits | Size | RAM required | Use case |
14
  | ---- | ---- | ---- | ---- | ---- | ----- |
15
- `alpaca-lora-65B.GGML.q2_0.bin` | q2_0 | 2bit | 23GB | 26GB | Lowest RAM requirements, minimum quality |
16
- `alpaca-lora-65B.GGML.q4_0.bin` | q4_0 | 4bit | 39GB | 41GB | Superseded and not recommended |
17
- `alpaca-lora-65B.GGML.q4_2.bin` | q4_2 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
18
- `alpaca-lora-65B.GGML.q4_3.bin` | q4_3 | 4bit | 47GB | 49GB | Maximum quality, high RAM requirements and slow inference |
 
19
 
20
  * The q2_0 file requires the least resources, but does not have great quality compared to the others.
21
  * It's likely to be better to use a 30B model at 4bit vs a 65B model at 2bit.
22
- * The q4_0 file was using an experimental quantisation method which has been superseded and is no longer recommended.
23
- * The q4_2 file offers the best combination of performance and quality.
24
- * The q4_3 file offers the highest quality, at the cost of increased RAM usage and slower inference speed.
 
25
 
26
- ## Creation method, and requirements
27
 
28
- ### 4bit q4_0
29
 
30
- This file was created using an alternative q4_0 quantisation method being trialled in [llama.cpp PR 896](https://github.com/ggerganov/llama.cpp/pull/896)
31
 
32
- This quantisation method has now been deprecated, as it's been replaced by better methods.
33
 
34
- This file is no longer recommended.
35
 
36
- ### 4bit q4_2 and q4_3
37
 
38
- These files were created using the new q4_2 and q4_3 quantisation methods that have now become standard and recommended.
39
 
40
- They will work with any recent version of [llama.cpp](https://github.com/ggerganov/llama.cpp). If they fail to open for you, try upgrading llama.cpp.
41
 
42
- The q4_2 file is recommended for most users. The q4_3 file offers the maximum possible quality, but requires more RAM and will provide slower inference.
43
 
44
- ### 2bit q2_0
45
 
46
- This file was created using an even newer and more experimental 2bit method being trialled in [llama.cpp PR 1004](https://github.com/ggerganov/llama.cpp/pull/1004).
47
-
48
- This code is not yet merged into the main `llama.cpp` repo.
49
 
50
  To run this file you need to compile and run the same `llama.cpp` code that was used to create it.
51
 
@@ -57,6 +57,28 @@ git checkout q2q3
57
  make
58
  ```
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  # Original model card not provided
61
 
62
  No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b).
 
5
 
6
  # Quantised GGMLs of alpaca-lora-65B
7
 
8
+ Quantised 2bit, 4bit and 5bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).
9
 
10
  I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit).
11
 
12
  ## Provided files
13
  | Name | Quant method | Bits | Size | RAM required | Use case |
14
  | ---- | ---- | ---- | ---- | ---- | ----- |
15
+ `alpaca-lora-65B.ggml.q2_0.bin` | q2_0 | 2bit | 23GB | 26GB | Lowest RAM requirements, minimum quality |
16
+ `alpaca-lora-65B.ggml.q4_0.bin` | q4_0 | 4bit | 39GB | 41GB | Maximum compatibility |
17
+ `alpaca-lora-65B.ggml.q4_2.bin` | q4_2 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
18
+ `alpaca-lora-65B.ggml.q5_0.bin` | q5_0 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
19
+ `alpaca-lora-65B.ggml.q5_1.bin` | q5_1 | 4bit | 39GB | 41GB | Best compromise between resources, speed and quality |
20
 
21
  * The q2_0 file requires the least resources, but does not have great quality compared to the others.
22
  * It's likely to be better to use a 30B model at 4bit vs a 65B model at 2bit.
23
+ * The q4_0 file provides lower quality, but maximal compatibility. It will work with past and future versions of llama.cpp
24
+ * The q4_2 file offers the best combination of performance and quality. This format is still subject to change and there may be compatibility issues, see below.
25
+ * The q5_0 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_0.
26
+ * The q5_1 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_1.
27
 
28
+ ## q4_2 compatibility
29
 
30
+ q4_2 is a relatively new 4bit quantisation method offering improved quality. However they are still under development and their formats are subject to change.
31
 
32
+ In order to use these files you will need to use recent llama.cpp code. And it's possible that future updates to llama.cpp could require that these files are re-generated.
33
 
34
+ If and when the q4_2 file no longer works with recent versions of llama.cpp I will endeavour to update it.
35
 
36
+ If you want to ensure guaranteed compatibility with a wide range of llama.cpp versions, use the q4_0 file.
37
 
38
+ ## q5_0 and q5_1 compatibility
39
 
40
+ These new methods were released to llama.cpp on 26th April. You will need to pull the latest llama.cpp code and rebuild to be able to use them.
41
 
42
+ Don't expect any third-party UIs/tools to support them yet.
43
 
44
+ ### 2bit q2_0 compatibility
45
 
46
+ This file was created using an experimental 2bit method being trialled in [llama.cpp PR 1004](https://github.com/ggerganov/llama.cpp/pull/1004).
47
 
48
+ This code is not yet merged into the main `llama.cpp` repo and it is not clear if it ever will be.
 
 
49
 
50
  To run this file you need to compile and run the same `llama.cpp` code that was used to create it.
51
 
 
57
  make
58
  ```
59
 
60
+ ## How to run in `llama.cpp`
61
+
62
+ I use the following command line; adjust for your tastes and needs:
63
+
64
+ ```
65
+ ./main -t 18 -m alpaca-lora-65B.ggml.q4_2.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
66
+ ### Instruction:
67
+ Write a story about llamas
68
+ ### Response:"
69
+ ```
70
+ Change `-t 18` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
71
+
72
+ If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
73
+
74
+ ## How to run in `text-generation-webui`
75
+
76
+ Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
77
+
78
+ Note: at this time text-generation-webui will not support the new q5 quantisation methods.
79
+
80
+ **Thireus** has written a [great guide on how to update it to the latest llama.cpp code](https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/5) so that these files can be used in the UI.
81
+
82
  # Original model card not provided
83
 
84
  No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b).