jgrosjean commited on
Commit
a61b682
1 Parent(s): 25f607a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -34
README.md CHANGED
@@ -6,11 +6,9 @@
6
 
7
  <!-- Provide a quick summary of what the model is/does. -->
8
 
9
- The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
10
  2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
11
 
12
- The fine-tuning script can be accessed [here](Link).
13
-
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
15
 
16
  ## Model Details
@@ -22,7 +20,7 @@ The fine-tuning script can be accessed [here](Link).
22
  - **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
23
  - **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
24
  - **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
25
- - **License:** [More Information Needed]
26
  - **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
27
 
28
  ## Use
@@ -107,16 +105,15 @@ This model has been trained on news articles only. Hence, it might not perform a
107
 
108
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
109
 
110
- [More Information Needed]
111
 
112
  ### Training Procedure
113
 
114
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
115
 
116
- #### Preprocessing [optional]
117
-
118
- [More Information Needed]
119
 
 
120
 
121
  #### Training Hyperparameters
122
 
@@ -130,46 +127,35 @@ Batch size: 512
130
 
131
  ### Testing Data, Factors & Metrics
132
 
133
- #### Testing Data
134
-
135
- <!-- This should link to a Dataset Card if possible. -->
136
-
137
- [More Information Needed]
138
-
139
- #### Factors
140
 
141
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
142
-
143
- [More Information Needed]
144
 
145
- #### Metrics
146
 
147
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
148
 
149
- [More Information Needed]
150
 
151
- ### Results
152
 
153
- [More Information Needed]
154
 
155
- #### Summary
156
 
 
157
 
 
158
 
159
- ## Environmental Impact
160
 
161
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
162
 
163
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
164
 
165
- - **Hardware Type:** [More Information Needed]
166
- - **Hours used:** [More Information Needed]
167
- - **Cloud Provider:** [More Information Needed]
168
- - **Compute Region:** [More Information Needed]
169
- - **Carbon Emitted:** [More Information Needed]
170
 
171
- ## Technical Specifications [optional]
172
 
173
- ### Model Architecture and Objective
174
 
175
  [More Information Needed]
 
6
 
7
  <!-- Provide a quick summary of what the model is/does. -->
8
 
9
+ The [SwissBERT](https://huggingface.co/ZurichNLP/swissbert) model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021) for sentence embeddings, using ~1 million Swiss news articles published in 2022 from [Swissdox@LiRI](https://t.uzh.ch/1hI). Following the [Sentence Transformers](https://huggingface.co/sentence-transformers) approach (Reimers and Gurevych,
10
  2019), the average of the last hidden states (pooler_type=avg) is used as sentence representation.
11
 
 
 
12
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564ab8d113e2baa55830af0/zUUu7WLJdkM2hrIE5ev8L.png)
13
 
14
  ## Model Details
 
20
  - **Developed by:** [Juri Grosjean](https://huggingface.co/jgrosjean)
21
  - **Model type:** [XMOD](https://huggingface.co/facebook/xmod-base)
22
  - **Language(s) (NLP):** de_CH, fr_CH, it_CH, rm_CH
23
+ - **License:** Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
24
  - **Finetuned from model:** [SwissBERT](https://huggingface.co/ZurichNLP/swissbert)
25
 
26
  ## Use
 
105
 
106
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
107
 
108
+ German, French, Italian and Romansh documents in the [Swissdox@LiRI database](https://t.uzh.ch/1hI) from 2022.
109
 
110
  ### Training Procedure
111
 
112
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
113
 
114
+ This model was finetuned via unsupervised [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552). The same sequence is passed to the encoder twice and the distance between the two resulting embeddings is minimized. Due to the drop-out, it will be encoded at slightly different positions in vector space.
 
 
115
 
116
+ The fine-tuning script can be accessed [here](Link).
117
 
118
  #### Training Hyperparameters
119
 
 
127
 
128
  ### Testing Data, Factors & Metrics
129
 
130
+ #### Baseline
 
 
 
 
 
 
131
 
132
+ The first baseline is [distiluse-base-multilingual-cased](https://www.sbert.net/examples/training/multilingual/README.html), a high-performing Sentence Transformer model that is able to process German, French and Italian (and more).
 
 
133
 
134
+ The second baseline uses mean pooling embeddings from the last hidden state of the original swissbert model.
135
 
136
+ #### Testing Data
137
 
138
+ <!-- This should link to a Dataset Card if possible. -->
139
 
140
+ The two evaluation tasks make use of the [20 Minuten dataset](https://www.zora.uzh.ch/id/eprint/234387/) compiled by Kew et al. (2023), which contains Swiss news articles with topic tags and summaries. Parts of the dataset were automatically translated to French and Italian using a Google Cloud API.
141
 
142
+ #### Evaluation via Semantic Textual Similarity
143
 
144
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
145
 
146
+ Embeddings are computed for the summary and content of each document. Subsequently, the embeddings are matched by minimizing cosine similarity scores betweend each summary and content embedding pair.
147
 
148
+ The performance is measured via accuracy, i.e. the ratio of correct vs. incorrect matches.
149
 
 
150
 
151
+ #### Evaluation via Text Classification
152
 
153
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
154
 
155
+ Articles with the topic tags "movies/tv series", "corona" and "football" (or related) are filtered from the corpus and split into training data (80%) and test data (20%). Subsequently, embeddings are set up for the train and test data. The test data is then classified using the training data via a k-nearest neighbor approach.
 
 
 
 
156
 
157
+ Note: For French and Italian, the training data remains in German, while the test data comprises of translations. This provides insights in the model's abilities in cross-lingual transfer.
158
 
159
+ ### Results
160
 
161
  [More Information Needed]