togethercomputer
/

GPT-JT-6B-v1

Text Generation

Inference Endpoints

Model card Files Files and versions Community

juewang commited on Nov 28, 2022

Commit

ad66547

•

1 Parent(s): 2d4e402

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -109,8 +109,8 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
 ## UL2 Training Objective
-We train GPT-J using UL2 training objective [1][2].
-The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
 In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
 Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
@@ -136,7 +136,7 @@ Furthermore, we leverage a large collection of data, including NI, P3, COT, the
 - [Natural-Instructions](https://github.com/allenai/natural-instructions)
 - [P3](https://huggingface.co/datasets/Muennighoff/P3)
 - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
-- [the pile](https://huggingface.co/datasets/the_pile)
 Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.

 ## UL2 Training Objective
+We train GPT-JT using UL2 training objective [1][2].
+The original GPT-J uses causal mask (as shown in the lower left) to perform autoregressive generation.So for each token, it can only see its previous context.
 In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
 Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
 - [Natural-Instructions](https://github.com/allenai/natural-instructions)
 - [P3](https://huggingface.co/datasets/Muennighoff/P3)
 - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
+- [the Pile](https://huggingface.co/datasets/the_pile)
 Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.