agemagician commited on
Commit
09c415d
1 Parent(s): 7c78552

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -3
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ tags:
4
+ - biology
5
+ - protein
6
+ - protein language model
7
+ - protein embedding
8
+ datasets:
9
+ - agemagician/uniref50
10
+ ---
11
+
12
+ # Important
13
+
14
+ The model will be uploaded soon, please stay tuned.
15
+
16
+ # ANKH2-Large model
17
+
18
+ Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
19
+ [this paper](https://arxiv.org/abs/2301.06568) and first released in
20
+ [this repository](https://github.com/agemagician/Ankh). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
21
+
22
+
23
+ ## Model description
24
+
25
+ ANKH2-Large is based on the `ANKH-Large` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
26
+ This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
27
+ publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
28
+
29
+ Two important differences between this ANKH2-Large model and the original ANKH-Large version are:
30
+ 1. The model was trained with more number of epochs.
31
+ 2. The activation function changed to silu.
32
+
33
+ It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape.
34
+ shape.
35
+ This implied learning some of the grammar of the language of life realized in protein sequences.
36
+
37
+ ## Intended uses & limitations
38
+
39
+ The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
40
+ We have noticed in some tasks you can gain more accuracy by fine-tuning the model using lora method rather than using it as a feature extractor.
41
+ We have also noticed that for feature extraction, its better to use the feature extracted from the encoder rather than from the decoder.
42
+
43
+ ### How to use
44
+
45
+ Here is how to use this model to extract the features of a given protein sequence in PyTorch:
46
+
47
+ ```python
48
+ sequence_examples = ["PRTEINO", "SEQWENCE"]
49
+ # tokenize sequences and pad up to the longest sequence in the batch
50
+ ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
51
+ input_ids = torch.tensor(ids['input_ids']).to(device)
52
+ attention_mask = torch.tensor(ids['attention_mask']).to(device)
53
+ # generate embeddings
54
+ with torch.no_grad():
55
+ embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
56
+ # extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
57
+ emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1536)
58
+ print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
59
+ # do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
60
+ emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1536)
61
+ # if you want to derive a single representation (per-protein embedding) for the whole protein
62
+ emb_0_per_protein = emb_0.mean(dim=0) # shape (1536)
63
+ print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
64
+ ```
65
+
66
+ ## Training data
67
+
68
+ The ANKH2-Large model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 60 million protein sequences.
69
+
70
+ ## Training procedure
71
+
72
+ ### Preprocessing
73
+
74
+ The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 25.
75
+ The inputs of the model are then of the form:
76
+
77
+ ```
78
+ Protein Sequence </s>
79
+ ```
80
+
81
+ The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens.
82
+
83
+ The details of the masking procedure for each sequence are as follows:
84
+ - 20% of the amino acids are masked.
85
+ - In 100% of the cases, the masked amino acids are replaced by `<extra_id_num>` token, where "num" is a number in range 0 and 115.
86
+
87
+ ### Pretraining
88
+
89
+ The model was trained on a single TPU Pod V4-256 for 45 epochs in total, using sequence length 512 (batch size 1k).
90
+ It was trained using ANKH-Large model as an initial checkpoint, rather than training from scratch.
91
+ It has a total of approximately 2B parameters and was trained using the encoder-decoder architecture.
92
+ The optimizer used is Adafactor with linear warmup with linear decay learning rate schedule for pre-training.
93
+
94
+
95
+ ## Evaluation results
96
+
97
+ When the model is used for feature extraction "FE" and parameter efficient fine-tuning "Lora", this model achieves the following results:
98
+
99
+ Test results :
100
+
101
+ | Task/Dataset | Method | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane | Solubility | Fluorescence |
102
+ |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
103
+ | CASP12 | FE | comming soon | comming soon | | | | |
104
+ | CASP12 | Lora | comming soon | comming soon | | | | |
105
+ | TS115 | FE | comming soon | comming soon | | | | |
106
+ | TS115 | Lora | comming soon | comming soon | | | | |
107
+ | CB513 | FE | comming soon | comming soon | | | | |
108
+ | CB513 | Lora | comming soon | comming soon | | | | |
109
+ | DeepLoc | FE | | | comming soon | comming soon | |
110
+ | DeepLoc | Lora | | | comming soon | comming soon | | |
111
+ | Solubility | FE | | | | | comming soon | |
112
+ | Solubility | Lora | | | | | 74% | |
113
+ | Fluorescence | FE | | | | | | Comming Soon |
114
+ | Fluorescence | Lora | | | | | | 68% |
115
+
116
+ ### BibTeX entry and citation info
117
+
118
+ ```bibtex
119
+ @article{elnaggar2023ankh,
120
+ title={Ankh☥: Optimized protein language model unlocks general-purpose modelling},
121
+ author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
122
+ journal={bioRxiv},
123
+ pages={2023--01},
124
+ year={2023},
125
+ publisher={Cold Spring Harbor Laboratory}
126
+ }
127
+ ```
128
+
129
+ > Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)