qarib commited on
Commit
4b90729
1 Parent(s): 5e757e5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # QARiB: QCRI Arabic and Dialectal BERT
2
+
3
+ ## About QARiB
4
+ <img src="./Qarib_logo.png" width="100" align="left"/>
5
+ QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
6
+ For Tweets, the data was collected using twitter API and using language filter. `lang:ar`. For Text data, it was a combination from
7
+ [Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
8
+
9
+ QARiB: Is the Arabic name for "Boat".
10
+
11
+ ## Model and Parameters:
12
+
13
+ - Data size: 14B tokens
14
+ - Vocabulary: 64k
15
+ - Iterations: 10M
16
+ - Number of Layers: 12
17
+
18
+ ## Training QARiB
19
+ See details in [Training QARiB](./Training_QARiB.md)
20
+
21
+ ## Using QARiB
22
+
23
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](./Using_QARiB.md)
24
+
25
+ ### How to use
26
+ You can use this model directly with a pipeline for masked language modeling:
27
+
28
+ ```python
29
+ >>>from transformers import pipeline
30
+ >>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
31
+
32
+ >>> fill_mask("شو عندكم يا [MASK]")
33
+ [{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'},
34
+ {'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'},
35
+ {'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'},
36
+ {'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'},
37
+ {'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}
38
+ ]
39
+ >>> fill_mask("وقام المدير [MASK]")
40
+ [
41
+ {'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
42
+ {'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
43
+ {'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
44
+ {'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
45
+ {'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
46
+ ]
47
+ >>> fill_mask("وقامت المديرة [MASK]")
48
+
49
+ [{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
50
+ {'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
51
+ {'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
52
+ {'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
53
+ {'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
54
+
55
+ >>> fill_mask("قللي وشفيييك يرحم [MASK]")
56
+ [{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
57
+ {'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
58
+ {'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
59
+ {'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
60
+ {'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
61
+
62
+
63
+ ```
64
+
65
+ ## Evaluations:
66
+
67
+ |**Experiment** |**mBERT**|**AraBERT0.1**|**AraBERT1.0**|**ArabicBERT**|**QARiB**|
68
+ |---------------|---------|--------------|--------------|--------------|---------|
69
+ |**Dialect Identification | 6.06% | 59.92% | 59.85% | 61.70% | 65.21% |
70
+ |**Emotion Detection | 27.90% | 43.89% | 42.37% | 41.65% | 44.35% |
71
+ |**Named-Entity Recognition (NER) | 49.38% | 64.97% | 66.63% | 64.04% | 61.62% |
72
+ |**Offensive Language Detection | 83.14% | 88.07% | 88.97% | 88.19% | 91.94% |
73
+ |**Sentiment Analysis | 86.61% | 90.80% | 93.58% | 83.27% | 93.31% |
74
+ |---------------------------------------------------------------------------------|
75
+
76
+ ## Model Weights and Vocab Download
77
+
78
+ From Huggingface site: https://huggingface.co/qarib
79
+
80
+ ## Contacts
81
+
82
+ Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
83
+
84
+ ## Reference
85
+ ```
86
+ @article{abdelali2020qarib,
87
+ title={QARiB: QCRI Arabic and Dialectal BERT},
88
+ author={Ahmed, Abdelali and Sabit, Hassan and Hamdy, Mubarak and Kareem, Darwish and Younes, Samih},
89
+ link={https://github.com/qcri/QARIB},
90
+ year={2020}
91
+ }
92
+ ```
93
+