Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@
|
|
6 |
|
7 |
<!-- Provide a quick summary of what the model is/does. -->
|
8 |
|
9 |
-
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following [Sentence Transformers](https://huggingface.co/sentence-transformers)
|
10 |
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
|
11 |
|
12 |
The fine-tuning script can be accessed [here](Link).
|
@@ -19,43 +19,62 @@ The fine-tuning script can be accessed [here](Link).
|
|
19 |
|
20 |
<!-- Provide a longer summary of what this model is. -->
|
21 |
|
|
|
|
|
|
|
|
|
|
|
22 |
|
|
|
23 |
|
24 |
-
|
25 |
-
- **Funded by [optional]:** [More Information Needed]
|
26 |
-
- **Shared by [optional]:** [More Information Needed]
|
27 |
-
- **Model type:** [More Information Needed]
|
28 |
-
- **Language(s) (NLP):** [More Information Needed]
|
29 |
-
- **License:** [More Information Needed]
|
30 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
31 |
|
32 |
-
|
|
|
33 |
|
34 |
-
|
35 |
|
36 |
-
- **Repository:** [More Information Needed]
|
37 |
-
- **Paper [optional]:** [More Information Needed]
|
38 |
-
- **Demo [optional]:** [More Information Needed]
|
39 |
|
40 |
-
## Uses
|
41 |
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|
44 |
-
|
|
|
45 |
|
46 |
-
|
|
|
47 |
|
48 |
-
|
|
|
|
|
49 |
|
50 |
-
|
|
|
51 |
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
[More Information Needed]
|
55 |
|
56 |
-
###
|
57 |
|
58 |
-
<!-- This section
|
59 |
|
60 |
[More Information Needed]
|
61 |
|
@@ -63,7 +82,7 @@ The fine-tuning script can be accessed [here](Link).
|
|
63 |
|
64 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
65 |
|
66 |
-
|
67 |
|
68 |
### Recommendations
|
69 |
|
|
|
6 |
|
7 |
<!-- Provide a quick summary of what the model is/does. -->
|
8 |
|
9 |
+
The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
|
10 |
2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
|
11 |
|
12 |
The fine-tuning script can be accessed [here](Link).
|
|
|
19 |
|
20 |
<!-- Provide a longer summary of what this model is. -->
|
21 |
|
22 |
+
- **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
|
23 |
+
- **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
|
24 |
+
- **Language(s) (NLP):** [de_CH, fr_CH, it_CH, rm_CH]
|
25 |
+
- **License:** [More Information Needed]
|
26 |
+
- **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
|
27 |
|
28 |
+
## Use
|
29 |
|
30 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
+
```python
|
33 |
+
import torch
|
34 |
|
35 |
+
from transformers import AutoModel, AutoTokenizer
|
36 |
|
|
|
|
|
|
|
37 |
|
|
|
38 |
|
39 |
+
### German example
|
40 |
+
```python
|
41 |
+
def generate_sentence_embedding(sentence, model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"):
|
42 |
+
# Load swissBERT model
|
43 |
+
model = AutoModel.from_pretrained(model_name)
|
44 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
45 |
+
model.set_default_language("de_CH")
|
46 |
|
47 |
+
# Tokenize input sentence
|
48 |
+
inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)
|
49 |
|
50 |
+
# Set the model to evaluation mode
|
51 |
+
model.eval()
|
52 |
|
53 |
+
# Take tokenized input and pass it through the model
|
54 |
+
with torch.no_grad():
|
55 |
+
outputs = model(**inputs)
|
56 |
|
57 |
+
# Extract average sentence embeddings from the last hidden layer
|
58 |
+
embedding = outputs.last_hidden_state.mean(dim=1)
|
59 |
|
60 |
+
return embedding
|
61 |
+
|
62 |
+
sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.")
|
63 |
+
print(sentence_embedding)
|
64 |
+
```
|
65 |
+
Output:
|
66 |
+
```
|
67 |
+
tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02, 7.4393e-02, -3.1552e-01,
|
68 |
+
1.5213e-01, -1.0258e-01, 2.2790e-01, -3.5968e-02, 3.1769e-01,
|
69 |
+
1.9354e-01, 1.9748e-02, -1.5236e-01, -2.2657e-01, 1.3345e-02,
|
70 |
+
...]])
|
71 |
+
```
|
72 |
|
73 |
[More Information Needed]
|
74 |
|
75 |
+
### Downstream Use [optional]
|
76 |
|
77 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
78 |
|
79 |
[More Information Needed]
|
80 |
|
|
|
82 |
|
83 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
84 |
|
85 |
+
This multilingual model has not been fine-tuned for cross-lingual transfer. It is intended for computing sentence embeddings that can be compared mono-lingually.
|
86 |
|
87 |
### Recommendations
|
88 |
|