--- license: apache-2.0 datasets: - kz-transformers/multidomain-kazakh-dataset language: - kk pipeline_tag: fill-mask library_name: transformers widget: - text: "Әжібай Найманбайұлы — батыр.Албан тайпасының қызылбөрік руынан ." - text: " — Қазақстан Республикасының астанасы." --- # Kaz-RoBERTa (base-sized model) ## Model description Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective. ## Training data The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets: - [MDBKD](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains. - [Conversational data] Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)(https://beeline.kz/) Together these datasets weigh 25GB of text. ## Training procedure ### Preprocessing The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with `` and the end of one by `` ### Pretraining The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. ## Usage You can use this model directly with a pipeline for masked language modeling: ```python >>> from transformers import pipeline >>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational') >>> pipe("Мәтел тура, ауыспалы, астарлы қолданылады") #Out: # {'score': 0.8131822347640991, # 'token': 18749, # 'token_str': ' мағынада', # 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'}, # ... # ...] ``` ### BibTeX entry and citation info