kz-transformers commited on
Commit
6150653
1 Parent(s): ee7e74a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -0
README.md CHANGED
@@ -16,6 +16,13 @@ widget:
16
 
17
  Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
18
 
 
 
 
 
 
 
 
19
  ## Usage
20
 
21
  You can use this model directly with a pipeline for masked language modeling:
 
16
 
17
  Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
18
 
19
+ ## Training data
20
+
21
+ The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
22
+ - [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
23
+ - [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)(https://beeline.kz/)
24
+
25
+ Together these datasets weigh 25GB of text.
26
  ## Usage
27
 
28
  You can use this model directly with a pipeline for masked language modeling: