File size: 4,516 Bytes
eec6017
 
f9deec3
 
 
 
 
 
 
 
 
 
 
 
 
 
eec6017
ef880ea
f9deec3
ef880ea
f9deec3
 
 
 
 
382ac21
f9deec3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef880ea
f9deec3
 
 
 
 
 
 
 
 
 
 
ef880ea
f9deec3
 
 
 
 
 
 
 
 
 
 
 
ef880ea
f9deec3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
035c811
f9deec3
 
 
 
 
382ac21
 
 
 
f9deec3
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
metrics:
- accuracy
- bleu
pipeline_tag: text2text-generation
tags:
- chemistry
- biology
- medical
- smiles
- iupac
- text-generation-inference
widget:
- text: CCO
  example_title: ethanol
---
# SMILES2IUPAC-canonical-base

SMILES2IUPAC-canonical-base was designed to accurately translate SMILES chemical names to IUPAC standards. 

## Model Details

### Model Description

SMILES2IUPAC-canonical-base is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder. 
- **Developed by:** Knowladgator Engineering
- **Model type:** Encoder-Decoder with attention mechanism
- **Language(s) (NLP):** SMILES, IUPAC (English)
- **License:** Apache License 2.0

### Model Sources
- **Paper:** coming soon
- **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)

## Quickstart
Firstly, install the library:
```commandline
pip install chemical-converters
```
### SMILES to IUPAC
#### ! Preferred IUPAC style
To choose the preferred IUPAC style, place style tokens before 
your SMILES sequence.

| Style Token | Description                                                                                        |
|-------------|----------------------------------------------------------------------------------------------------|
| `<BASE>`    | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
| `<SYST>`    | The totally systematic style without trivial names                                                 |
| `<TRAD>`    | The style is based on trivial names of the parts of substances                                     |

#### To perform simple translation, follow the example:
```python
from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
```
```text
['ethanol']
['ethanol', 'ethanol', 'ethanol']
```
#### Processing in batches:
```python
from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1, 
                                process_in_batch=True, batch_size=1000))
```
```text
['buta-1,3-diene', 'buta-1,3-diene'...]
```
#### Validation SMILES to IUPAC translations
It's possible to validate the translations by reverse translation into IUPAC
and calculating Tanimoto similarity of two molecules fingerprints.
````python
from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
print(converter.smiles_to_iupac('CCO', validate=True))
````
````text
['ethanol'] 1.0
````
The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.

You can also process validation manually:
```python
from chemicalconverters import NamesConverter

validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
```
```text
1.0
```

## Bias, Risks, and Limitations

This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.

### Training Procedure

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.00001, batch_size=512 for 2 epochs.

## Evaluation

| Model                               | Accuracy | BLEU-4 score | Size(MB) |
|-------------------------------------|---------|------------------|----------|
| SMILES2IUPAC-canonical-small        |75%      |0.93              |23        |
| SMILES2IUPAC-canonical-base         |86.9%    |0.964             |180       |
| STOUT V2.0\*                        |66.65%   |0.92              |128       |
| STOUT V2.0 (according to our tests) |         |0.89              |128       |
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4

## Citation
Coming soon.

## Model Card Authors

[Mykhailo Shtopko](https://huggingface.co/BioMike)

## Model Card Contact

info@knowledgator.com