erdiari commited on
Commit
2c33b77
1 Parent(s): e0e6315

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - mlsum
4
+ - batubayk/TR-News
5
+ - csebuetnlp/xlsum
6
+ - wiki_lingua
7
+ language:
8
+ - tr
9
+ results:
10
+ - task:
11
+ type: text-summarization
12
+ dataset:
13
+ name: mlsum
14
+ type: mlsum
15
+ metrics:
16
+ - name: rogue(r1/r2/rl)
17
+ type: rouge
18
+ value: 45.75/32.71/39.86
19
+ - task:
20
+ type: text-summarization
21
+ dataset:
22
+ name: batubayk/TR-News
23
+ type: batubayk/TR-News
24
+ metrics:
25
+ - name: rogue(r1/r2/rl)
26
+ type: rouge
27
+ value: 41.97/28.26/36.69
28
+ - task:
29
+ type: text-summarization
30
+ dataset:
31
+ name: csebuetnlp/xlsum
32
+ type: csebuetnlp/xlsum
33
+ metrics:
34
+ - name: rogue(r1/r2/rl)
35
+ type: rouge
36
+ value: 34.15/17.94/28.03
37
+ arxiv: 2403.01308
38
+ library_name: transformers
39
+ pipeline_tag: text2text-generation
40
+ ---
41
+ # VBART Model Card
42
+
43
+ ## Model Description
44
+
45
+ VBART is the first sequence-to-sequence model trained in Turkish corpora from scratch. It was developed by VNGRS in (Ne zamandı).
46
+ This model is capable of text transformation task such as summarization, paraphrasing, title generation with finetuning.
47
+
48
+ This model is scores better on many tasks while being much smaller than other implementations.
49
+
50
+ This repository contains fine-tuned weights of VBART for summarization task using Turkish sections of [mlsum](https://huggingface.co/datasets/mlsum), [TRNews](https://huggingface.co/datasets/batubayk/TR-News), [XLSum](https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/turkish) and [Wikilingua](https://huggingface.co/datasets/wiki_lingua).
51
+
52
+ - **Developed by:** [VNGRS](https://vngrs.com/)
53
+ - **Model type:** Transformer encoder-decoder based on mBart
54
+ - **Language(s) (NLP):** Turkish
55
+ - **License:** [More Information Needed]
56
+ - **Finetuned from model:** VBART
57
+ - Paper : [arxiv](https://arxiv.org/abs/2403.01308)
58
+ ## How to Get Started with the Model
59
+ Use the code below to get started with the model.
60
+ -> Model yüklendikten sonra bir kod çıkar
61
+ [More Information Needed]
62
+
63
+ ## Training Details
64
+
65
+ ### Training Data
66
+ Base model training data is filtered mixed corpus made of Turkish parts of [OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) and [mC4](https://huggingface.co/datasets/mc4) datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found in their respective page. Data then filtered using set of heuristics and certain rules, explained in appendix of our [paper](https://arxiv.org/abs/2403.01308).
67
+
68
+ Fine-tuning dataset is Turkish sections of [mlsum](https://huggingface.co/datasets/mlsum), [TRNews](https://huggingface.co/datasets/batubayk/TR-News), [XLSum](https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/turkish) and [Wikilingua](https://huggingface.co/datasets/wiki_lingua), as mentioned before.
69
+
70
+
71
+ ### Limitations
72
+ This model in fine-tuned to question answering and question generation task. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model.
73
+
74
+ ### Training Procedure
75
+ Pretrained for 30 days, resulted in total training of 23 epochs. TODO: Ne kadar token olduğunu yaz.
76
+ #### Hardware
77
+ - **GPUs**: 8X Nvidia A100-80 GB
78
+ #### Software
79
+ - Tensorflow
80
+ #### Hyperparameters
81
+ ##### Pretraining
82
+ - **Training regime:** fp16 mixed precision
83
+ - **Training objective** : Sentence permutation and span masking (using mask lengths sampled from poisson distribution $\lambda = 3.5$ and total of %30 data)
84
+ - **Optimizer** : Adam optimizer (\(\beta_{1} = 0.9, \beta_{2} = 0.98, \epsilon = 1e-6\))
85
+ - **Scheduler**: Linear decay scheduler (20.000 warm up steps)
86
+ - **Dropout**: 0.1 (dropped to 0.05 and 0 in last 160k steps)
87
+ - **Learning rate**: \( 5e-6 \)
88
+ ##### Finetuning
89
+ - **Training regime:** fp16 mixed precision
90
+ - **Optimizer** : Adam optimizer (\(\beta_{1} = 0.9, \beta_{2} = 0.98, \epsilon = 1e-6\))
91
+ - **Scheduler**: Linear decay scheduler
92
+ - **Dropout**: 0.1
93
+ - **Learning rate**: \(5e-5\)
94
+ #### Metrics
95
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/QCef-9yumzG2sHksOGcUs.png)
96
+
97
+ ## License
98
+
99
+
100
+ ## Citation
101
+ ```
102
+ @misc{VBART,
103
+ title={VBART: The Turkish LLM},
104
+ author={Melikşah Türker and Mehmet Erdi Arı and Aydın Han},
105
+ year={2024},
106
+ eprint={2403.01308},
107
+ archivePrefix={arXiv},
108
+ primaryClass={cs.CL}
109
+ }
110
+ ```