dinalzein commited on
Commit
7f604f5
1 Parent(s): 11f30e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -6
README.md CHANGED
@@ -14,24 +14,44 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # xlm-roberta-base-finetuned-language-detection-new
16
 
17
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
 
18
  It achieves the following results on the evaluation set:
19
  - Loss: 0.0436
20
  - Accuracy: 0.9959
21
 
22
  ## Model description
23
 
24
- More information needed
25
 
26
  ## Intended uses & limitations
27
 
28
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## Training and evaluation data
31
 
32
- More information needed
33
-
34
- ## Training procedure
35
 
36
  ### Training hyperparameters
37
 
 
14
 
15
  # xlm-roberta-base-finetuned-language-detection-new
16
 
17
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [Language Identification dataset](https://huggingface.co/datasets/papluca/language-identification).
18
+
19
  It achieves the following results on the evaluation set:
20
  - Loss: 0.0436
21
  - Accuracy: 0.9959
22
 
23
  ## Model description
24
 
25
+ The model used in this task is XLM-RoBERTa transformer model with a classification head on top.
26
 
27
  ## Intended uses & limitations
28
 
29
+ It identifies the language a document is written in and it supports 20 different langauges:
30
+
31
+ * Arabic (ar)
32
+ * Bulgarian (bg)
33
+ * German (de)
34
+ * Modern greek (el)
35
+ * English (en)
36
+ * Spanish (es)
37
+ * French (fr)
38
+ * Hindi (hi)
39
+ * Italian (it)
40
+ * Japanese (ja)
41
+ * Dutch (nl)
42
+ * Polish (pl)
43
+ * Portuguese (pt)
44
+ * Russian (ru)
45
+ * Swahili (sw)
46
+ * Thai (th)
47
+ * Turkish (tr)
48
+ * Urdu (ur)
49
+ * Vietnamese (vi)
50
+ * Chinese (zh)
51
 
52
  ## Training and evaluation data
53
 
54
+ The model is fine-tuned on the [Language Identification dataset](https://huggingface.co/datasets/papluca/language-identification), a corpus consists of text from 20 different languages. The dataset is split with 7000 sentences for training, 1000 for validating, and 1000 for testing. The accuracy on the test set is 99.5%.
 
 
55
 
56
  ### Training hyperparameters
57