adlumal
/

auslaw-embed-v1.0

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

adlumal commited on Jan 3

Commit

d9171f8

•

1 Parent(s): 09e4cde

Update README.md

Files changed (1) hide show

README.md +17 -5

README.md CHANGED Viewed

@@ -19,7 +19,21 @@ language:
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-<!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
@@ -40,13 +54,11 @@ embeddings = model.encode(sentences)
 print(embeddings)
 ```
 ## Evaluation Results
-<!--- Describe how your model was evaluated -->
-| Test                   | Score        |
 |------------------------|--------------|
 | cos_sim-Accuracy@1     | 0.730206301  |
 | cos_sim-Accuracy@3     | 0.859562308  |

 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
+This model is a fine-tune of [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) using the HCA case law in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by Umar Butler. The PDF/OCR cases were not used.
+The cases were split into < 512 context chunks using the bge-small-en tokeniser and [semchunk](https://github.com/umarbutler/semchunk).
+[mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) was used to generate a legal question for each context chunk.
+129,137 context-question pairs were used for training.
+14,348 context-question pairs were used for evaluation (see the table below for results).
+Using a 10% subset of the val dataset the following hit-rate performance was reached and is compared to the base model and OpenAI's default ada embedding model.
+| **Model**                 | **Avg. hit-rate** |
+|---------------------------|-------------------|
+| BAAI/bge-small-en         | 89%               |
+| OpenAI                    | 92%               |
+| adlumal/auslaw-embed-v1.0 | **97%**           |
 ## Usage (Sentence-Transformers)
 print(embeddings)
 ```
 ## Evaluation Results
+The model was evauluated on 10% of the available data. The automated eval results for the final step are presented below.
+| Eval                   | Score        |
 |------------------------|--------------|
 | cos_sim-Accuracy@1     | 0.730206301  |
 | cos_sim-Accuracy@3     | 0.859562308  |