kz-transformers
/

kaz-roberta-conversational

Inference Endpoints

Model card Files Files and versions Community

kz-transformers commited on Apr 19

Commit

6150653

•

1 Parent(s): ee7e74a

Update README.md

Files changed (1) hide show

README.md +7 -0

README.md CHANGED Viewed

@@ -16,6 +16,13 @@ widget:
 Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
 ## Usage
 You can use this model directly with a pipeline for masked language modeling:

 Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
+## Training data
+The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
+- [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
+- [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)(https://beeline.kz/)
+Together these datasets weigh 25GB of text.
 ## Usage
 You can use this model directly with a pipeline for masked language modeling: