Update README.md
Browse files
README.md
CHANGED
@@ -19,7 +19,21 @@ language:
|
|
19 |
|
20 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
21 |
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
## Usage (Sentence-Transformers)
|
25 |
|
@@ -40,13 +54,11 @@ embeddings = model.encode(sentences)
|
|
40 |
print(embeddings)
|
41 |
```
|
42 |
|
43 |
-
|
44 |
-
|
45 |
## Evaluation Results
|
46 |
|
47 |
-
|
48 |
|
49 |
-
|
|
50 |
|------------------------|--------------|
|
51 |
| cos_sim-Accuracy@1 | 0.730206301 |
|
52 |
| cos_sim-Accuracy@3 | 0.859562308 |
|
|
|
19 |
|
20 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
21 |
|
22 |
+
This model is a fine-tune of [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) using the HCA case law in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by Umar Butler. The PDF/OCR cases were not used.
|
23 |
+
|
24 |
+
The cases were split into < 512 context chunks using the bge-small-en tokeniser and [semchunk](https://github.com/umarbutler/semchunk).
|
25 |
+
[mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) was used to generate a legal question for each context chunk.
|
26 |
+
|
27 |
+
129,137 context-question pairs were used for training.
|
28 |
+
14,348 context-question pairs were used for evaluation (see the table below for results).
|
29 |
+
|
30 |
+
Using a 10% subset of the val dataset the following hit-rate performance was reached and is compared to the base model and OpenAI's default ada embedding model.
|
31 |
+
|
32 |
+
| **Model** | **Avg. hit-rate** |
|
33 |
+
|---------------------------|-------------------|
|
34 |
+
| BAAI/bge-small-en | 89% |
|
35 |
+
| OpenAI | 92% |
|
36 |
+
| adlumal/auslaw-embed-v1.0 | **97%** |
|
37 |
|
38 |
## Usage (Sentence-Transformers)
|
39 |
|
|
|
54 |
print(embeddings)
|
55 |
```
|
56 |
|
|
|
|
|
57 |
## Evaluation Results
|
58 |
|
59 |
+
The model was evauluated on 10% of the available data. The automated eval results for the final step are presented below.
|
60 |
|
61 |
+
| Eval | Score |
|
62 |
|------------------------|--------------|
|
63 |
| cos_sim-Accuracy@1 | 0.730206301 |
|
64 |
| cos_sim-Accuracy@3 | 0.859562308 |
|