erdiari's picture
Create README.md
2c33b77 verified
|
raw
history blame
No virus
4.35 kB
metadata
datasets:
  - mlsum
  - batubayk/TR-News
  - csebuetnlp/xlsum
  - wiki_lingua
language:
  - tr
results:
  - task: null
    type: text-summarization
    dataset:
      name: mlsum
      type: mlsum
    metrics:
      - name: rogue(r1/r2/rl)
        type: rouge
        value: 45.75/32.71/39.86
  - task: null
    type: text-summarization
    dataset:
      name: batubayk/TR-News
      type: batubayk/TR-News
    metrics:
      - name: rogue(r1/r2/rl)
        type: rouge
        value: 41.97/28.26/36.69
  - task: null
    type: text-summarization
    dataset:
      name: csebuetnlp/xlsum
      type: csebuetnlp/xlsum
    metrics:
      - name: rogue(r1/r2/rl)
        type: rouge
        value: 34.15/17.94/28.03
arxiv: 2403.01308
library_name: transformers
pipeline_tag: text2text-generation

VBART Model Card

Model Description

VBART is the first sequence-to-sequence model trained in Turkish corpora from scratch. It was developed by VNGRS in (Ne zamandı).
This model is capable of text transformation task such as summarization, paraphrasing, title generation with finetuning.

This model is scores better on many tasks while being much smaller than other implementations.

This repository contains fine-tuned weights of VBART for summarization task using Turkish sections of mlsum, TRNews, XLSum and Wikilingua.

  • Developed by: VNGRS
  • Model type: Transformer encoder-decoder based on mBart
  • Language(s) (NLP): Turkish
  • License: [More Information Needed]
  • Finetuned from model: VBART
  • Paper : arxiv

How to Get Started with the Model

Use the code below to get started with the model.
-> Model yüklendikten sonra bir kod çıkar [More Information Needed]

Training Details

Training Data

Base model training data is filtered mixed corpus made of Turkish parts of OSCAR-2201 and mC4 datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found in their respective page. Data then filtered using set of heuristics and certain rules, explained in appendix of our paper.

Fine-tuning dataset is Turkish sections of mlsum, TRNews, XLSum and Wikilingua, as mentioned before.

Limitations

This model in fine-tuned to question answering and question generation task. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model.

Training Procedure

Pretrained for 30 days, resulted in total training of 23 epochs. TODO: Ne kadar token olduğunu yaz.

Hardware

  • GPUs: 8X Nvidia A100-80 GB

Software

  • Tensorflow

Hyperparameters

Pretraining
  • Training regime: fp16 mixed precision
  • Training objective : Sentence permutation and span masking (using mask lengths sampled from poisson distribution $\lambda = 3.5$ and total of %30 data)
  • Optimizer : Adam optimizer ((\beta_{1} = 0.9, \beta_{2} = 0.98, \epsilon = 1e-6))
  • Scheduler: Linear decay scheduler (20.000 warm up steps)
  • Dropout: 0.1 (dropped to 0.05 and 0 in last 160k steps)
  • Learning rate: ( 5e-6 )
Finetuning
  • Training regime: fp16 mixed precision
  • Optimizer : Adam optimizer ((\beta_{1} = 0.9, \beta_{2} = 0.98, \epsilon = 1e-6))
  • Scheduler: Linear decay scheduler
  • Dropout: 0.1
  • Learning rate: (5e-5)

Metrics

image/png

License

Citation

@misc{VBART,
      title={VBART: The Turkish LLM}, 
      author={Melikşah Türker and Mehmet Erdi Arı and Aydın Han},
      year={2024},
      eprint={2403.01308},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}