AI4PD
/

ZymCTRL

@@ -37,7 +37,9 @@ output = model.generate(input_ids, top_k=8, repetition_penalty=1.2, max_length=1
 This alternative option to the zero-shot generation permits further improve the model's confidence for EC number with few members. User-defined training and validation files containing the sequences of interest are provided to the model. After a short update of the model's weights, ZymCTRL will generate sequences that follow the input properties. This might not be necessary in cases where the model has already seen many sequences per EC class.
-To create the validation and training file, it is necessary to (1) remove the FASTA headers for each sequence, (2) prepare the sequences in the format: EC number<sep><start>S E Q U E N C E<end><|endoftext|> and (3) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5). Then, to finetune the model to the input sequences, we can use the example below. Here we show a learning rate of 1e-06, but ideally, the learning rate should be optimised in separate runs. After training, the finetuned model will be stored in the ./output folder. Lastly, ZymCTRL can generate the tailored sequences as shown in Example 1:
 ```
 python run_clm.py --model_name_or_path nferruz/ZymCTRL --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ZymCTRL

 This alternative option to the zero-shot generation permits further improve the model's confidence for EC number with few members. User-defined training and validation files containing the sequences of interest are provided to the model. After a short update of the model's weights, ZymCTRL will generate sequences that follow the input properties. This might not be necessary in cases where the model has already seen many sequences per EC class.
+To create the validation and training file, it is necessary to
+(1) remove the FASTA headers for each sequence,
+(2) prepare the sequences in the format: `EC number<sep><start>S E Q U E N C E<end><|endoftext|>` and (3) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5). Then, to finetune the model to the input sequences, we can use the example below. Here we show a learning rate of 1e-06, but ideally, the learning rate should be optimised in separate runs. After training, the finetuned model will be stored in the ./output folder. Lastly, ZymCTRL can generate the tailored sequences as shown in Example 1:
 ```
 python run_clm.py --model_name_or_path nferruz/ZymCTRL --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ZymCTRL