juewang commited on
Commit
ad66547
1 Parent(s): 2d4e402

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -109,8 +109,8 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
109
 
110
  ## UL2 Training Objective
111
 
112
- We train GPT-J using UL2 training objective [1][2].
113
- The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
114
  In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
115
  Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
116
 
@@ -136,7 +136,7 @@ Furthermore, we leverage a large collection of data, including NI, P3, COT, the
136
  - [Natural-Instructions](https://github.com/allenai/natural-instructions)
137
  - [P3](https://huggingface.co/datasets/Muennighoff/P3)
138
  - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
139
- - [the pile](https://huggingface.co/datasets/the_pile)
140
 
141
  Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
142
 
 
109
 
110
  ## UL2 Training Objective
111
 
112
+ We train GPT-JT using UL2 training objective [1][2].
113
+ The original GPT-J uses causal mask (as shown in the lower left) to perform autoregressive generation.So for each token, it can only see its previous context.
114
  In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
115
  Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
116
 
 
136
  - [Natural-Instructions](https://github.com/allenai/natural-instructions)
137
  - [P3](https://huggingface.co/datasets/Muennighoff/P3)
138
  - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
139
+ - [the Pile](https://huggingface.co/datasets/the_pile)
140
 
141
  Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
142