erdiari commited on
Commit
15b63a3
1 Parent(s): a1cf0bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -54,7 +54,7 @@ The fine-tuning dataset is a mixture of [OpenSubtitles](https://huggingface.co/d
54
  This model is fine-tuned for paraphrasing tasks and finetuned in sentence level only. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
55
 
56
  ### Training Procedure
57
- Pre-trained for 30 days and for a total of 708B tokens. Finetuned for 25 epoch.
58
  #### Hardware
59
  - **GPUs**: 8 x Nvidia A100-80 GB
60
  #### Software
@@ -65,17 +65,18 @@ Pre-trained for 30 days and for a total of 708B tokens. Finetuned for 25 epoch.
65
  - **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
66
  - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
67
  - **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
68
- - **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 165k and 205k steps, respectively)
 
69
  - **Initial Learning rate**: 5e-6
70
- - **Training tokens**: 708B
71
 
72
  ##### Fine-tuning
73
  - **Training regime:** fp16 mixed precision
74
  - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
75
  - **Scheduler**: Linear decay scheduler
76
  - **Dropout**: 0.1
77
- - **Learning rate**: 1e-5
78
- - **Fine-tune epochs**: 25
79
 
80
  #### Metrics
81
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/nrM_FA3bGk9NAYW_044HW.png)
 
54
  This model is fine-tuned for paraphrasing tasks and finetuned in sentence level only. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
55
 
56
  ### Training Procedure
57
+ Pre-trained for 8 days and for a total of 84B tokens. Finetuned for 25 epoch.
58
  #### Hardware
59
  - **GPUs**: 8 x Nvidia A100-80 GB
60
  #### Software
 
65
  - **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
66
  - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
67
  - **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
68
+ - **Weight Initialization**: Model Enlargement from VBART-Large. See the related section in the [paper](https://arxiv.org/abs/2403.01308) for the details.
69
+ - **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 80K and 80k steps, respectively)
70
  - **Initial Learning rate**: 5e-6
71
+ - **Training tokens**: 84B
72
 
73
  ##### Fine-tuning
74
  - **Training regime:** fp16 mixed precision
75
  - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
76
  - **Scheduler**: Linear decay scheduler
77
  - **Dropout**: 0.1
78
+ - **Learning rate**: 5e-6
79
+ - **Fine-tune epochs**: 55
80
 
81
  #### Metrics
82
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/nrM_FA3bGk9NAYW_044HW.png)