nferruz commited on
Commit
a1f80ad
1 Parent(s): 4c0e7d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -37,7 +37,9 @@ output = model.generate(input_ids, top_k=8, repetition_penalty=1.2, max_length=1
37
 
38
  This alternative option to the zero-shot generation permits further improve the model's confidence for EC number with few members. User-defined training and validation files containing the sequences of interest are provided to the model. After a short update of the model's weights, ZymCTRL will generate sequences that follow the input properties. This might not be necessary in cases where the model has already seen many sequences per EC class.
39
 
40
- To create the validation and training file, it is necessary to (1) remove the FASTA headers for each sequence, (2) prepare the sequences in the format: EC number<sep><start>S E Q U E N C E<end><|endoftext|> and (3) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5). Then, to finetune the model to the input sequences, we can use the example below. Here we show a learning rate of 1e-06, but ideally, the learning rate should be optimised in separate runs. After training, the finetuned model will be stored in the ./output folder. Lastly, ZymCTRL can generate the tailored sequences as shown in Example 1:
 
 
41
 
42
  ```
43
  python run_clm.py --model_name_or_path nferruz/ZymCTRL --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ZymCTRL
 
37
 
38
  This alternative option to the zero-shot generation permits further improve the model's confidence for EC number with few members. User-defined training and validation files containing the sequences of interest are provided to the model. After a short update of the model's weights, ZymCTRL will generate sequences that follow the input properties. This might not be necessary in cases where the model has already seen many sequences per EC class.
39
 
40
+ To create the validation and training file, it is necessary to
41
+ (1) remove the FASTA headers for each sequence,
42
+ (2) prepare the sequences in the format: `EC number<sep><start>S E Q U E N C E<end><|endoftext|>` and (3) split the originating dataset into training and validation files (this is often done with the ratio 90/10, 80/20 or 95/5). Then, to finetune the model to the input sequences, we can use the example below. Here we show a learning rate of 1e-06, but ideally, the learning rate should be optimised in separate runs. After training, the finetuned model will be stored in the ./output folder. Lastly, ZymCTRL can generate the tailored sequences as shown in Example 1:
43
 
44
  ```
45
  python run_clm.py --model_name_or_path nferruz/ZymCTRL --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ZymCTRL