meliksahturker
commited on
Commit
•
1768a06
1
Parent(s):
1b9b3c4
Update README.md
Browse files
README.md
CHANGED
@@ -40,8 +40,8 @@ VBART is the first sequence-to-sequence LLM pre-trained on Turkish corpora from
|
|
40 |
The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine-tuned.
|
41 |
It outperforms its multilingual counterparts, albeit being much smaller than other implementations.
|
42 |
|
43 |
-
VBART-XLarge is
|
44 |
-
VBART-XLarge
|
45 |
|
46 |
This repository contains fine-tuned TensorFlow and Safetensors weights of VBART for question-answering and generation tasks described in the [paper](https://doi.org/10.55730/1300-0632.3914).
|
47 |
|
@@ -96,7 +96,7 @@ The fine-tuning dataset is [TQuAD](https://github.com/obss/turkish-question-gene
|
|
96 |
This model is fine-tuned for question-answering and question-generation tasks with specific prompts. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
|
97 |
|
98 |
### Training Procedure
|
99 |
-
Pre-trained for
|
100 |
#### Hardware
|
101 |
- **GPUs**: 8 x Nvidia A100-80 GB
|
102 |
#### Software
|
@@ -107,17 +107,11 @@ Pre-trained for 30 days and for a total of 708B tokens. Further pretrained enlar
|
|
107 |
- **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
|
108 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
109 |
- **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
|
110 |
-
- **
|
|
|
111 |
- **Initial Learning rate**: 5e-6
|
112 |
-
- **Training tokens**: 708B
|
113 |
-
|
114 |
-
##### Experimental Model Enlargement
|
115 |
-
Same as pretraining but;
|
116 |
-
- **Scheduler**: with 5,000 warm-up steps
|
117 |
-
- **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 160k and 80k steps, respectively)
|
118 |
- **Training tokens**: 84B
|
119 |
|
120 |
-
|
121 |
##### Fine-tuning
|
122 |
- **Training regime:** fp16 mixed precision
|
123 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
|
|
40 |
The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine-tuned.
|
41 |
It outperforms its multilingual counterparts, albeit being much smaller than other implementations.
|
42 |
|
43 |
+
VBART-XLarge is created by adding extra Transformer layers between the layers of VBART-Large. Hence it was able to transfer learned weights from the smaller model while doublings its number of layers.
|
44 |
+
VBART-XLarge slightly improves the results compared to VBART-Large albeit in small margins.
|
45 |
|
46 |
This repository contains fine-tuned TensorFlow and Safetensors weights of VBART for question-answering and generation tasks described in the [paper](https://doi.org/10.55730/1300-0632.3914).
|
47 |
|
|
|
96 |
This model is fine-tuned for question-answering and question-generation tasks with specific prompts. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
|
97 |
|
98 |
### Training Procedure
|
99 |
+
Pre-trained for 8 days and for a total of 84B tokens. Finally, finetuned for 55 epochs.
|
100 |
#### Hardware
|
101 |
- **GPUs**: 8 x Nvidia A100-80 GB
|
102 |
#### Software
|
|
|
107 |
- **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
|
108 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
109 |
- **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
|
110 |
+
- **Weight Initialization**: Model Enlargement from VBART-Large. See the related section in the [paper](https://arxiv.org/abs/2403.01308) for the details.
|
111 |
+
- **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 80K and 80k steps, respectively)
|
112 |
- **Initial Learning rate**: 5e-6
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
- **Training tokens**: 84B
|
114 |
|
|
|
115 |
##### Fine-tuning
|
116 |
- **Training regime:** fp16 mixed precision
|
117 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|