SerialKicked commited on
Commit
3d08f1a
1 Parent(s): 20c701a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -18
README.md CHANGED
@@ -8,40 +8,48 @@ tags:
8
  - discussion
9
  ---
10
 
11
- I'll be using this space to post my (very personal) methodology to test interesting models I come across.
12
- In the meantime, you can check [this topic](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13) to check the first test category I'm releasing (thanks to Lewdiculous for hosting)
 
13
 
14
 
15
  # Testing Environment
16
 
17
- All models are loaded in Q8_0 (GGUF) using KoboldCPP 1.65 for Windows using CUDA 12. Using CuBLAS but not using mmq. All layers are on the GPU (NVidia RTX3060 12GB).
 
 
 
 
 
 
 
18
 
19
- Frontend is staging version of Silly Tavern.
20
 
21
- All models are extended to 16K context length (auto rope from KCPP) with Flash Attention enabled. Response size set to 1024 tokens max.
 
22
 
23
- Fixed Seed for all tests: 123
24
 
25
- # Instruct Format
26
 
27
- All models are tested in whichever instruct format they are supposed to be comfortable with.
28
 
29
- However, mergers (as you are the main culprits of doing that), I'm not going to hunt through tons of parent models to try to determine which instruct format you're using (nor are most people). If it's not on your page, I'll assume L3 instruct for Llama models and ChatML for Mistral ones. If you're using neither of them, nor are you using Alpaca, I'm not testing your model.
30
 
31
- # System Prompt
 
 
32
 
33
- [[TODO: add files to repo]]
34
 
35
- # Available Tests
36
 
37
- [[TODO: add to discussions]]
 
38
 
39
- - [Dog Persona Test](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13) - Testing the ability for the model to follow a card despite user actions (and natural inclination of a LLM). Ability to compartmentalize actions and dialogs.
40
- - Long Context Test - Various tasks to be executed at full context
41
- - Group Coherence Test - Testing models in group settings
42
 
43
  # Limitations
44
 
45
- I'm testing for things I'm interested in. I do not pretend any of this is scientific or accurate. As much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or the smallest change, are bound to give very different results.
46
 
47
- I gave the models a fair shake in more casual settings, regen tons of outputs with random seeds (see individual tests), and while there are (large) variations, it tends to even out to the results shown in testing.
 
8
  - discussion
9
  ---
10
 
11
+ # Why? What? TL;DR?
12
+
13
+ Simply put, I'm making my methodology to evaluate RP models public. While none of this is very scientific, it is consistent. I'm focusing on things I'm *personally* looking for in a model, like its ability to obey a character card and a system prompt accurately. Still, I think most of my tests are universal enough that other people might be interested in the results, or might want to run those tests on their own.
14
 
15
 
16
  # Testing Environment
17
 
18
+ - All models are loaded in Q8_0 (GGUF) with all layers on the GPU (NVidia RTX3060 12GB)
19
+ - Backend is the latest version of KoboldCPP for Windows using CUDA 12.
20
+ - Using **CuBLAS** but **not using QuantMatMul (mmq)**.
21
+ - All models are extended to **16K context length** (auto rope from KCPP) with **Flash Attention** and **ContextShift** enabled.
22
+ - Frontend is staging version of Silly Tavern.
23
+ - Response size set to 1024 tokens max.
24
+ - Fixed Seed for all tests: **123**
25
+
26
 
27
+ # System Prompt and Instruct Format
28
 
29
+ - The exact system prompt and instruct format files can be found in the [file repository](https://huggingface.co/SerialKicked/ModelTestingBed/).
30
+ - All models are tested in whichever instruct format they are supposed to be comfortable with (as long as it's ChatML or L3 Instruct)
31
 
 
32
 
33
+ # Available Tests
34
 
35
+ ### DoggoEval
36
 
37
+ The goal of this test featuring Rex (a dog), and his master (EsKa) is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to.
38
 
39
+ - [Results and discussions are hosted in this thread](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13)
40
+ - [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
41
+ - TODO: Charts and screenshots
42
 
43
+ ### MinotaurEval
44
 
45
+ TODO: The goal of this test is to check if a model is able of following a very specific prompting method and maintain situational awareness in the smallest labyrinth in the world.
46
 
47
+ - Discussions will be hosted here.
48
+ - Files and cards will be available soon (tm).
49
 
 
 
 
50
 
51
  # Limitations
52
 
53
+ I'm testing for things I'm interested in. Do not ask for ERP-specific tests. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results.
54
 
55
+ I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.