kz-transformers
commited on
Commit
•
6150653
1
Parent(s):
ee7e74a
Update README.md
Browse files
README.md
CHANGED
@@ -16,6 +16,13 @@ widget:
|
|
16 |
|
17 |
Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## Usage
|
20 |
|
21 |
You can use this model directly with a pipeline for masked language modeling:
|
|
|
16 |
|
17 |
Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
|
18 |
|
19 |
+
## Training data
|
20 |
+
|
21 |
+
The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
|
22 |
+
- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
|
23 |
+
- [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)(https://beeline.kz/)
|
24 |
+
|
25 |
+
Together these datasets weigh 25GB of text.
|
26 |
## Usage
|
27 |
|
28 |
You can use this model directly with a pipeline for masked language modeling:
|