AudreyVM commited on
Commit
51e4e23
1 Parent(s): 9dc3467

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -48,7 +48,7 @@ import pyonmttok
48
  from huggingface_hub import snapshot_download
49
  model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-gl-ca", revision="main")
50
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
51
- tokenized=tokenizer.tokenize("Benvido ao proxecto Aina.")
52
  translator = ctranslate2.Translator(model_dir)
53
  translated = translator.translate_batch([tokenized[0]])
54
  print(tokenizer.detokenize(translated[0][0]['tokens']))
@@ -122,24 +122,24 @@ Weights were saved every 1000 updates and reported results are the average of th
122
  ### Variable and metrics
123
  We use the BLEU score for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
124
  ### Evaluation results
125
- Below are the evaluation results on the machine translation from Galician to Catalan compared to [M2M100 1.2B](https://huggingface.co/facebook/m2m100_1.2B), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
126
- | Test set |M2M100 1.2B| NLLB 1.3B | NLLB 3.3 |mt-aina-gl-ca|
127
- |----------------------|-------|-----------|------------------|---------------|
128
- | Flores 200 devtest |32,6| 22,3 | **34,3** | 32,4 |
129
- | TaCON |56,5|32,2 | 54,1 | **58,2** |
130
- | NTREX |34,0|20,4 | **34,2** | 33,7 |
131
- | Average |41,0| 25,0 | 40,9 | **41,4** |
132
  ## Additional information
133
  ### Author
134
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center (langtech@bsc.es)
135
  ### Contact information
136
- For further information, send an email to <aina@bsc.es>
137
  ### Copyright
138
  Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
139
  ### Licensing information
140
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
141
  ### Funding
142
- This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA.
143
  ### Disclaimer
144
  <details>
145
  <summary>Click to expand</summary>
 
48
  from huggingface_hub import snapshot_download
49
  model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-gl-ca", revision="main")
50
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
51
+ tokenized=tokenizer.tokenize("Benvido ao proxecto Ilenia.")
52
  translator = ctranslate2.Translator(model_dir)
53
  translated = translator.translate_batch([tokenized[0]])
54
  print(tokenizer.detokenize(translated[0][0]['tokens']))
 
122
  ### Variable and metrics
123
  We use the BLEU score for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
124
  ### Evaluation results
125
+ Below are the evaluation results on the machine translation from Galician to Catalan compared to [Google Translate](https://translate.google.com/), [M2M100 1.2B](https://huggingface.co/facebook/m2m100_1.2B), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
126
+ | Test set |Google Translate|M2M100 1.2B| NLLB 1.3B | NLLB 3.3 |mt-aina-gl-ca|
127
+ |----------------------|----|-------|-----------|------------------|---------------|
128
+ |Flores 101 devtest |**36,4**|32,6| 22,3 | 34,3 | 32,4 |
129
+ | TaCON |48,4|56,5|32,2 | 54,1 | **58,2** |
130
+ | NTREX |**34,7**|34,0|20,4 | 34,2 | 33,7 |
131
+ | Average |39,0|41,0| 25,0 | 40,9 | **41,4** |
132
  ## Additional information
133
  ### Author
134
+ Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
135
  ### Contact information
136
+ For further information, send an email to <langtech@bsc.es>
137
  ### Copyright
138
  Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
139
  ### Licensing information
140
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
141
  ### Funding
142
+ This work was funded by the SEDIA within the framework of ILENIA
143
  ### Disclaimer
144
  <details>
145
  <summary>Click to expand</summary>