CNN Based On SMILES Representation of Compounds for Detecting Chemical Motif

University of Toronto Machine Intelligence Team

4 min readJan 26, 2021

Link to paper: https://doi.org/10.1186/s12859-018-2523-5

This paper discusses the use of a deep learning model to screen compounds in support of drug discovery. By predicting certain properties of drug candidates, compounds that are more likely to be effective drugs are thus prioritized for further in vitro screening.

At the most basic level, compounds are typically linearly represented in SMILES (Simplified Molecular Input Line Entry System) notation. Using a specific alphabet and grammar, SMILES is able to describe the atoms in and express the structure of chemical compounds. Most molecular modelling software is able to take the SMILES representation of a compound and output a 2D drawing or a 3D model of the compound. With that being said, the first problem faced by chemical analysis researchers is to tackle how chemical compounds should be represented when input into a machine learning model. Current methods involve generating what is known as a “fingerprint” from SMILES notation. Although there are several fingerprint types and consequently several different algorithms used to generate fingerprints, in general, a chemical fingerprint is a binary vector that encodes the structural properties of a compound.

However, instead of using fingerprints, the authors of this paper explored a novel approach to vectorizing chemical compounds, which in turn will be input to a machine learning model. Following the strategy of established DNA descriptor generation methods, columns of one-hot encodings were used to translate the 1D SMILES representation of a compound to a 2D binary vector representation designated as a “SMILES feature matrix.” This is possible due to a fixed and limited alphabet in SMILES notation. More explicitly, the resulting SMILES feature matrix consists of 42 features (21 symbols for atoms, 21 symbols representing chemical features captured by SMILES) and a length of 400 (maximum SMILES string length in the dataset, compounds with shorter strings had padded feature matrices). Then, similar to a typical image vector, a convolutional neural network is applied to the feature matrix.

The architecture of the CNN is depicted by the following figure from the paper:

To summarize, it consists of alternating convolutional and average pooling layers followed by a global max pooling layer. Most notably, this CNN produces the “SMILES convolution fingerprint” (SCFP), a vector that contains significant structural information selected during training about the input SMILES feature matrix. This novel method of computing a chemical fingerprint has an advantage over conventional computational methods because the ability to define different classification targets for the CNN helps to specify context when generating fingerprints (compared to traditional methods that compute fixed features regardless of drug project parameters)..

The performance of the model was evaluated using ROC-AUC, or area under the receiver operating characteristic curve, a widely used metric that reflects the performance of classifiers. By plotting the false positive rate (ratio between misclassified negatives and the total number of negative data points) on the x-axis against the true positive rate (ratio between correctly classified positives and the total number of positive data points) on the y-axis, the ROC curve illustrates the classification power of the model. Calculating the AUC provides a summary statistic that measures model performance where the greater the area captured by the model, the better the model has performed. When the model outlined in this paper was compared to contemporary compound classification models (including logistic regression, random forest, deep neural network, and graph convolution using conventional fingerprints as input), its performance on a benchmark dataset was superior to the models listed. When compared to a deep neural network tailored to the benchmark dataset, this model performed comparably but wasn’t able to outperform that model.

Touching upon the identification of chemical motifs mentioned in the title of the paper, chemical motifs are defined as functional substructures of the compound. From the SMILES convolution fingerprint, researchers were able to backtrace each dimension to the input feature matrix, thus associating the values in the SCFP with a chemical substructure. By being able to identify chemical motifs that have a large contribution to the classification of the compound, SCFP enables the visualization of functional substructures that play a large role in compound classification. Further analysis of SCFP when compared with conventional fingerprints after dimensional reduction revealed that SCFP is able to differentiate between active and inactive compounds (graph a) while conventional fingerprints are not (graph b), even though SCFP has only 64 dimensions while the conventional fingerprints used in the analysis had 1024.

In conclusion, the CNN proposed in this paper outperformed existing compound classification methods and performed comparably when evaluated against the best model tailored to the benchmark dataset. Furthermore, the novel method of fingerprint generation detailed in this paper allowed for the identification of chemical motifs and also greatly facilitated the interpretability of the model (in a field that generally represents neural networks as a black box).

Hirohara, M., Saito, Y., Koda, Y. et al. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 19, 526 (2018). https://doi.org/10.1186/s12859-018-2523-5

CNN Based On SMILES Representation of Compounds for Detecting Chemical Motif

Written by University of Toronto Machine Intelligence Team

No responses yet