Fairseq
Portuguese
Catalan
fdelucaf commited on
Commit
9507de3
1 Parent(s): 826c747

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -43
README.md CHANGED
@@ -3,38 +3,18 @@ license: apache-2.0
3
  datasets:
4
  - projecte-aina/CA-PT_Parallel_Corpus
5
  language:
6
- - ca
7
  - pt
 
8
  metrics:
9
  - bleu
10
  library_name: fairseq
11
  ---
12
  ## Projecte Aina’s Portuguese-Catalan machine translation model
13
-
14
- ## Table of Contents
15
- - [Model Description](#model-description)
16
- - [Intended Uses and Limitations](#intended-use)
17
- - [How to Use](#how-to-use)
18
- - [Training](#training)
19
- - [Training data](#training-data)
20
- - [Training procedure](#training-procedure)
21
- - [Data Preparation](#data-preparation)
22
- - [Tokenization](#tokenization)
23
- - [Hyperparameters](#hyperparameters)
24
- - [Evaluation](#evaluation)
25
- - [Variable and Metrics](#variable-and-metrics)
26
- - [Evaluation Results](#evaluation-results)
27
- - [Additional Information](#additional-information)
28
- - [Author](#author)
29
- - [Contact Information](#contact-information)
30
- - [Copyright](#copyright)
31
- - [Licensing Information](#licensing-information)
32
- - [Funding](#funding)
33
- - [Disclaimer](#disclaimer)
34
 
35
  ## Model description
36
 
37
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets, which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
38
 
39
  ## Intended uses and limitations
40
 
@@ -54,7 +34,7 @@ Translate a sentence using python
54
  import ctranslate2
55
  import pyonmttok
56
  from huggingface_hub import snapshot_download
57
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-pt-ca", revision="main")
58
 
59
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
60
  tokenized=tokenizer.tokenize("Bem-vindo ao Projeto Aina!")
@@ -64,6 +44,10 @@ translated = translator.translate_batch([tokenized[0]])
64
  print(tokenizer.detokenize(translated[0][0]['tokens']))
65
  ```
66
 
 
 
 
 
67
  ## Training
68
 
69
  ### Training data
@@ -84,19 +68,24 @@ The model was trained on a combination of the following datasets:
84
  | Europarl | 1.692.106 | 1.631.989 |
85
  | **Total** | **15.391.745** | **6.159.631** |
86
 
87
- All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/). The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 
88
 
89
 
90
  ### Training procedure
91
 
92
  ### Data preparation
93
 
94
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
 
95
 
96
 
97
  #### Tokenization
98
 
99
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
100
 
101
  #### Hyperparameters
102
 
@@ -131,13 +120,14 @@ The model was trained for a total of 12.000 updates. Weights were saved every 10
131
  ### Variable and metrics
132
 
133
  We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and
134
- [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets
135
 
136
  ### Evaluation results
137
 
138
- Below are the evaluation results on the machine translation from Portuguese to Catalan compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
 
139
 
140
- | Test set | SoftCatalà | Google Translate |mt-aina-pt-ca|
141
  |----------------------|------------|------------------|---------------|
142
  | Flores 101 dev | 31,9 | **37,8** | 34,4 |
143
  | Flores 101 devtest |33,6 | **38,5** | 35,7 |
@@ -147,29 +137,34 @@ Below are the evaluation results on the machine translation from Portuguese to C
147
  ## Additional information
148
 
149
  ### Author
150
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
151
 
152
- ### Contact information
153
- For further information, please send an email to langtech@bsc.es.
154
 
155
  ### Copyright
156
- Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
157
 
158
- ### Licensing information
159
- This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
160
 
161
  ### Funding
162
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project] (https://projecteaina.cat/).
163
-
164
- ## Limitations and Bias
165
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
166
 
167
  ### Disclaimer
168
 
169
  <details>
170
  <summary>Click to expand</summary>
171
 
172
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
173
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
174
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
 
 
 
 
 
 
 
 
175
  </details>
 
3
  datasets:
4
  - projecte-aina/CA-PT_Parallel_Corpus
5
  language:
 
6
  - pt
7
+ - ca
8
  metrics:
9
  - bleu
10
  library_name: fairseq
11
  ---
12
  ## Projecte Aina’s Portuguese-Catalan machine translation model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets,
17
+ which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
34
  import ctranslate2
35
  import pyonmttok
36
  from huggingface_hub import snapshot_download
37
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-pt-ca", revision="main")
38
 
39
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
40
  tokenized=tokenizer.tokenize("Bem-vindo ao Projeto Aina!")
 
44
  print(tokenizer.detokenize(translated[0][0]['tokens']))
45
  ```
46
 
47
+ ## Limitations and bias
48
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
49
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
50
+
51
  ## Training
52
 
53
  ### Training data
 
68
  | Europarl | 1.692.106 | 1.631.989 |
69
  | **Total** | **15.391.745** | **6.159.631** |
70
 
71
+ All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
72
+ The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
73
 
74
 
75
  ### Training procedure
76
 
77
  ### Data preparation
78
 
79
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
80
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
81
+ The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
82
+ modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
83
 
84
 
85
  #### Tokenization
86
 
87
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
88
+ This model is included.
89
 
90
  #### Hyperparameters
91
 
 
120
  ### Variable and metrics
121
 
122
  We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and
123
+ [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
124
 
125
  ### Evaluation results
126
 
127
+ Below are the evaluation results on the machine translation from Portuguese to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
128
+ [Google Translate](https://translate.google.es/?hl=es):
129
 
130
+ | Test set | SoftCatalà | Google Translate | aina-translator-pt-ca |
131
  |----------------------|------------|------------------|---------------|
132
  | Flores 101 dev | 31,9 | **37,8** | 34,4 |
133
  | Flores 101 devtest |33,6 | **38,5** | 35,7 |
 
137
  ## Additional information
138
 
139
  ### Author
140
+ The Language Technologies Unit from Barcelona Supercomputing Center.
141
 
142
+ ### Contact
143
+ For further information, please send an email to <langtech@bsc.es>.
144
 
145
  ### Copyright
146
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
147
 
148
+ ### License
149
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
150
 
151
  ### Funding
152
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
 
 
 
153
 
154
  ### Disclaimer
155
 
156
  <details>
157
  <summary>Click to expand</summary>
158
 
159
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
160
+
161
+ Be aware that the model may have biases and/or any other undesirable distortions.
162
+
163
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
164
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
165
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
166
+
167
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
168
+ be liable for any results arising from the use made by third parties.
169
+
170
  </details>