jgrosjean commited on
Commit
7e00454
1 Parent(s): 8bd6f24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -8,7 +8,7 @@ language:
8
 
9
  <!-- Provide a quick summary of what the model is/does. -->
10
 
11
- The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
12
  2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
13
 
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
@@ -115,13 +115,13 @@ The sentence swissBERT model has been trained on news articles only. Hence, it m
115
 
116
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
117
 
118
- German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) from 2022.
119
 
120
  ### Training Procedure
121
 
122
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
123
 
124
- This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized. Because of the drop-out, it will be encoded at slightly different positions in the vector space.
125
 
126
  The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
127
 
@@ -130,6 +130,7 @@ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathe
130
  - Number of epochs: 1
131
  - Learning rate: 1e-5
132
  - Batch size: 512
 
133
 
134
  ## Evaluation
135
 
@@ -139,24 +140,24 @@ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathe
139
 
140
  <!-- This should link to a Dataset Card if possible. -->
141
 
142
- The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French and Italian using a Google Cloud API.
143
 
144
  #### Evaluation via Semantic Textual Similarity
145
 
146
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
147
 
148
- Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores between each summary and content embedding pair.
149
 
150
- The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
151
 
152
 
153
  #### Evaluation via Text Classification
154
 
155
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
156
 
157
- Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
158
 
159
- Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
160
 
161
  ### Results
162
 
@@ -169,11 +170,11 @@ Making use of an unsupervised training approach, Swissbert for Sentence Embeddin
169
  | Semantic Similarity FR | 82.30 | - |**92.90** | - | 91.10 | - |
170
  | Semantic Similarity IT | 83.00 | - |**91.20** | - | 89.80 | - |
171
  | Semantic Similarity RM | 78.80 | - |**90.80** | - | 67.90 | - |
172
- | Text Classification DE | 95.76 | 91.99 | 96.36 |**92.11**| 95.61 | 91.20 |
173
- | Text Classification FR | 94.55 | 88.52 | 95.76 |**90.94**| 94.55 | 89.82 |
174
- | Text Classification IT | 93.48 | 88.29 | 95.44 | 90.44 | 95.91 |**92.05**|
175
  | Text Classification RM | | | | | | |
176
 
177
  #### Baseline
178
 
179
- The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model.
 
8
 
9
  <!-- Provide a quick summary of what the model is/does. -->
10
 
11
+ The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
12
  2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
13
 
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
 
115
 
116
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
117
 
118
+ German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) up to 2023.
119
 
120
  ### Training Procedure
121
 
122
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
123
 
124
+ This model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The positive sequence pairs consist of the article body vs. its title and lead, wihout any hard negatives.
125
 
126
  The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
127
 
 
130
  - Number of epochs: 1
131
  - Learning rate: 1e-5
132
  - Batch size: 512
133
+ - Temperature: 0.05
134
 
135
  ## Evaluation
136
 
 
140
 
141
  <!-- This should link to a Dataset Card if possible. -->
142
 
143
+ The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French, Italian using a Google Cloud API and to Romash via a [Textshuttle](https://textshuttle.com/en) API.
144
 
145
  #### Evaluation via Semantic Textual Similarity
146
 
147
  <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
148
 
149
+ Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by maximizing cosine similarity scores between each summary and content embedding pair.
150
 
151
+ The performance is measured via accuracy, i.e. the ratio of correct vs. total matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
152
 
153
 
154
  #### Evaluation via Text Classification
155
 
156
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
157
 
158
+ Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbors approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
159
 
160
+ Note: For French, Italian and Romansh, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
161
 
162
  ### Results
163
 
 
170
  | Semantic Similarity FR | 82.30 | - |**92.90** | - | 91.10 | - |
171
  | Semantic Similarity IT | 83.00 | - |**91.20** | - | 89.80 | - |
172
  | Semantic Similarity RM | 78.80 | - |**90.80** | - | 67.90 | - |
173
+ | Text Classification DE | 95.76 | 91.99 | 96.36 |**92.11**| 96.37 | 96.34 |
174
+ | Text Classification FR | 94.55 | 88.52 | 95.76 |**90.94**| 99.35 | 99.35 |
175
+ | Text Classification IT | 93.48 | 88.29 | 95.44 | 90.44 | 95.91 |**92.05**|
176
  | Text Classification RM | | | | | | |
177
 
178
  #### Baseline
179
 
180
+ The baseline uses mean pooling embeddings from the last hidden state of the original swissbert model and the currently best-performing Sentence-BERT model [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)