brucethemoose
commited on
Commit
•
eb39dbf
1
Parent(s):
992011f
Update README.md
Browse files
README.md
CHANGED
@@ -85,7 +85,9 @@ Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you
|
|
85 |
To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
|
86 |
|
87 |
***
|
88 |
-
24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
|
|
|
|
|
89 |
***
|
90 |
|
91 |
Credits:
|
|
|
85 |
To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
|
86 |
|
87 |
***
|
88 |
+
24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
|
89 |
+
|
90 |
+
I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw!
|
91 |
***
|
92 |
|
93 |
Credits:
|