jgrosjean commited on
Commit
3a17ec9
1 Parent(s): 6ba6e16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -123,7 +123,7 @@ German, French, Italian and Romansh documents in the [Swissdox@LiRI database](ht
123
 
124
  This model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The positive sequence pairs consist of the article body vs. its title and lead, wihout any hard negatives.
125
 
126
- The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
127
 
128
  #### Training Hyperparameters
129
 
@@ -148,14 +148,14 @@ The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.u
148
 
149
  Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by maximizing cosine similarity scores between each summary and content embedding pair.
150
 
151
- The performance is measured via accuracy, i.e. the ratio of correct vs. total matches. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
152
 
153
 
154
  #### Evaluation via Text Classification
155
 
156
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
157
 
158
- Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbors approach. The script can be found [here](https://github.com/jgrosjean-mathesis/swissbert-for-sentence-embeddings/tree/main).
159
 
160
  Note: For French, Italian and Romansh, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
161
 
 
123
 
124
  This model was finetuned via self-supervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The positive sequence pairs consist of the article body vs. its title and lead, wihout any hard negatives.
125
 
126
+ The fine-tuning script can be accessed [here](https://github.com/jgrosjean-mathesis/sentence-swissbert/tree/main/training).
127
 
128
  #### Training Hyperparameters
129
 
 
148
 
149
  Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by maximizing cosine similarity scores between each summary and content embedding pair.
150
 
151
+ The performance is measured via accuracy, i.e. the ratio of correct vs. total matches. The script can be found [here](https://github.com/jgrosjean-mathesis/sentence-swissbert/tree/main/evaluation).
152
 
153
 
154
  #### Evaluation via Text Classification
155
 
156
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
157
 
158
+ Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbors approach. The script can be found [here](https://github.com/jgrosjean-mathesis/sentence-swissbert/tree/main/evaluation).
159
 
160
  Note: For French, Italian and Romansh, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
161