how to develop a random forest ensemble in python

how to develop a random forest ensemble in python

In bagging, a number of decision trees are created where each tree is created from a different bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset where a sample may appear more than once in the sample, referred to as sampling with replacement.

Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different training dataset, and in turn, has a slightly different performance. Unlike normal decision tree models, such as classification and regression trees (CART), trees used in the ensemble are unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make each tree more different and have less correlated predictions or prediction errors.

A prediction on a regression problem is the average of the prediction across the trees in the ensemble. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

Unlike bagging, random forest also involves selecting a subset of input features (columns or variables) at each split point in the construction of trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different.

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. [] But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

The effect is that the predictions, and in turn, prediction errors, made by each tree in the ensemble are more different or less correlated. When the predictions from these less correlated trees are averaged to make a prediction, it often results in better performance than bagged decision trees.

Random forests tuning parameter is the number of randomly selected predictors, k, to choose from at each split, and is commonly referred to as mtry. In the regression context, Breiman (2001) recommends setting mtry to be one-third of the number of predictors.

Another important hyperparameter to tune is the depth of the decision trees. Deeper trees are often more overfit to the training data, but also less correlated, which in turn may improve the performance of the ensemble. Depths from 1 to 10 levels may be effective.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

A smaller sample size will make trees more different, and a larger sample size will make the trees more similar. Setting max_samples to None will make the sample size the same size as the training dataset and this is the default.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You might like to extend this example and see what happens if the bootstrap sample size is larger or even much larger than the training dataset (e.g. you can set an integer value as the number of samples instead of a float percentage of the training dataset size).

The example below explores the effect of the number of features randomly selected at each split point on model accuracy. We will try values from 1 to 7 and would expect a small value, around four, to perform well based on the heuristic.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that a value between three and five would be appropriate, confirming the sensible default of four on this dataset. A value of five might even be better given the smaller standard deviation in classification accuracy as compared to a value of three or four.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Both bagging and random forest algorithms appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The authors make grand claims about the success of random forests: most accurate, most interpretable, and the like. In our experience random forests do remarkably well, with very little tuning required.

Im implementing a Random Forest and Im getting a shifted time-series in the predictions. If I build the model for predicting e.g. 4 steps ahead, my time-series of predictions seems 4 steps shifted to the right comparing to my time-series of observations. If I try to predict 16 steps ahead, it seems 16 steps shifted.

Very nice tutorial of RF usage! It is really practical to know good practices on those models from my experience Random Forests are very competitive in real industrial applications! (often outperforms such competitors as Artificial Neural Networks). Regards!

Hello Jason, Please I have a question I have the following situation that is already programmed with Logistic regression, I have tried the same program with Random Forest in order to check how it could improve the accuracy. Actually, the accuracy was improved, but I dont know if it is logical to use the Random Forest in my problem case.

My case study is as follow : Based on a market dataset, I need to predict if a customer will buy a product or not depending on his prior history. I.e to know how much a customer bought the same product previously, and how much he just check it without buying it

Id clients CurrectProd P1+ P1- P2+ P2- P3+ P3- PN+ PN- Output 10 CL1 P1, P3 6 1 0 0 8 2 0 0 1 11 CL1 P1, P2 7 1 5 2 0 0 0 0 1

with: CurrentProd: means a list of products that I need to know if a customer will purchase, P1+: mean how many time client buy product 1, P1-: refers to the number that a client checked a product 1 without buying it.

columns present all products existing in the market so that I have data with too many features (min 200 PRODUCT) and at each row the most of those row take value 0 (becose there are not belong to CurrentPRod

Id..|..clients..|..CurrectProd..|.P1+.|.P1-.|.P2+.|.P2-.|.P3+.|.P3-.|. .|.PN+.|.PN-.|.Output 10.|.CL1.|P1, P3|6.|..1|0.|0|..8.|2|. .|0|0.|.1 11.|.CL1.|P1, P2|7.|..1|5.|2|0|0|. .|0|0.|.1

The key will to find an appropriate representation for the problem. This may give you ideas (replace site with product):

Do you know how can I get a graphic representation of the trees in the trained model ? I was trying to use export_graphviz in sklearn, but using cross_val_scores function fitting estimator on its own, i dont know how to use export_gaphviz function.

python - how to get best estimator on gridsearchcv (random forest classifier scikit) - stack overflow

python - how to get best estimator on gridsearchcv (random forest classifier scikit) - stack overflow

When the grid search is called with various params, it chooses the one with the highest score based on the given scorer func. Best estimator gives the info of the params that resulted in the highest score.

By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

random forests classifiers in python - datacamp

random forests classifiers in python - datacamp

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases. It lies at the base of the Boruta algorithm, which selects important features in a dataset.

Lets suppose you have decided to ask your friends, and talked with them about their past travel experience to various places. You will get some recommendations from every friend. Now you have to make a list of those recommended places. Then, you ask them to vote (or select one best place for the trip) from the list of recommended places you made. The place with the highest number of votes will be your final choice for the trip.

In the above decision process, there are two parts. First, asking your friends about their individual travel experience and getting one recommendation out of multiple places they have visited. This part is like using the decision tree algorithm. Here, each friend makes a selection of the places he or she has visited so far.

The second part, after collecting all the recommendations, is the voting procedure for selecting the best place in the list of recommendations. This whole process of getting recommendations from friends and voting on them to find the best place is known as the random forests algorithm.

It technically is an ensemble method (based on the divide-and-conquer approach) of decision trees generated on a randomly split dataset. This collection of decision tree classifiers is also known as the forest. The individual decision trees are generated using an attribute selection indicator such as information gain, gain ratio, and Gini index for each attribute. Each tree depends on an independent random sample. In a classification problem, each tree votes and the most popular class is chosen as the final result. In the case of regression, the average of all the tree outputs is considered as the final result. It is simpler and more powerful compared to the other non-linear classification algorithms.

Random forests also offers a good feature selection indicator. Scikit-learn provides an extra variable with the model, which shows the relative importance or contribution of each feature in the prediction. It automatically computes the relevance score of each feature in the training phase. Then it scales the relevance down so that the sum of all scores is 1.

Random forest uses gini importance or mean decrease in impurity (MDI) to calculate the importance of each feature. Gini importance is also known as the total decrease in node impurity. This is how much the model fit or accuracy decreases when you drop a variable. The larger the decrease, the more significant the variable is. Here, the mean decrease is a significant parameter for variable selection. The Gini index can describe the overall explanatory power of the variables.

You will be building a model on the iris flower dataset, which is a very famous classification set. It comprises the sepal length, sepal width, petal length, petal width, and type of flowers. There are three species or classes: setosa, versicolor, and virginia. You will build a model to classify the type of flower. The dataset is available in the scikit-learn library or you can download it from the UCI Machine Learning Repository.

It's a good idea to always explore your data a bit, so you know what you're working with. Here, you can see the first five rows of the dataset are printed, as well as the target variable for the whole dataset.

For visualization, you can use a combination of matplotlib and seaborn. Because seaborn is built on top of matplotlib, it offers a number of customized themes and provides additional plot types. Matplotlib is a superset of seaborn and both are equally important for good visualizations.

You can see that after removing the least important features (sepal length), the accuracy increased. This is because you removed misleading data and noise, resulting in an increased accuracy. A lesser amount of features also reduces the training time.

In this tutorial, you have learned what random forests is, how it works, finding important features, the comparison between random forests and decision trees, advantages and disadvantages. You have also learned model building, evaluation and finding important features in scikit-learn. B

using random forests in python with scikit-learn | oxford protein informatics group

using random forests in python with scikit-learn | oxford protein informatics group

I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling random forest foobar a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I wont be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.

Sklearncomes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. Well be using thevenerableiris dataset for classification and the Boston housingset for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, lets write some Python!

First well look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.

First well load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.

Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset lets just fire on. Well do a bit of pre-processing later when we come to the Boston data set.

First, lets split the data into training and test sets. Well used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here were doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.

Now lets fit a random forest classifier to our training set. For the most part well use the default settings since theyre quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.

If youre used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so its easy to ensure youre using the right parameters if you ever need to compare models.

Lets see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. Well compare this to the actual score obtained on our test data.

Not bad. However, this doesnt really tell us anything about where were doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.

Now lets look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of $50,000 from the set and see if the regression improves at all.

As before well load the data into a pandas dataframe. This time, however, were going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isnt strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time were going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we wont worry about that for now.

As before, weve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While its possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, theres a library for that, but Imlazy so I didnt use it this time.

With the data standardised, lets do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.

Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though were going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.

Now lets see how we do on our test set. As before well compare the out-of-bag estimate (this time its an R-squared score) to the R-squared score for our predictions. Well also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how were doing.

Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If youre lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.

how to save and load random forest from scikit-learn in python? | mljar automated machine learning

how to save and load random forest from scikit-learn in python? | mljar automated machine learning

In this post I will show you how to save and load Random Forest model trained with scikit-learn in Python. The method presented here can be applied to any algorithm from sckit-learn (this is amazing about scikit-learn!).

Lets save the Random Forest. Im using joblib.dump method. The first argument of the method is variable with the model. The second argument is the path and the file name where the resulting file will be created.

To load the model back I use joblib.load method. It takes as argument the path and file name. I will load the forest to new variable loaded_rf. Please notice that I dont need to initilize this variable, just load the model into it.

While saving the scikit-learn Random Forest with joblib you can use compress parameter to save the disk space. In the joblib docs there is information that compress=3 is a good compromise between size and speed. Example below:

python - strange behavior of sklearn random forest classifier - stack overflow

python - strange behavior of sklearn random forest classifier - stack overflow

You did not show the values passed to tree, but it's possible that your model is overfitting. I suggest you retrain rf with a lower max_depth, a greater number of n_estimators or a smaller number of max_features. In fact, you can run a Grid Search to find the best combination of hyperparameters.

By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

random forest classification for predicting lifespan-extending chemical compounds | scientific reports

random forest classification for predicting lifespan-extending chemical compounds | scientific reports

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of Caenorhabditis elegans. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. The best performing classifier, built using molecular descriptors, achieved an area under the curve score (AUC) of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1738 small-molecules. The chemical compounds of the screening database with a predictive probability of0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds.

Ageing is a major health, social and financial challenge, characterised by the deterioration of the physiological processes of an organism1,2. Ageing is a predominant risk factor for many conditions including various types of cancers, cardiovascular and neurodegenerative diseases3,4. Interventions targeting the cellular and molecular process of ageing have the potential to delay and protect against age-related conditions.

Several ageing studies have identified interventions that extend the lifespan of model organisms ranging from nematodes and fruit flies to rodents, using dietary restrictions, genetic modifications and pharmaceutical interventions. Lee et al. (2006) presented the first evidence that long-term dietary deprivation can improve longevity in a multicellular species, Caenorhabditis elegans (C. elegans)5. A few years later, Harrison et al. (2009) found that treating mice with rapamycin, an inhibitor of the mTOR pathway, extended the median and maxim lifespan of the mice6. Additionally, Selman et al. (2009) reported that genetic deletion of S6 protein kinase 1, a component of the mTOR signalling pathway, increased the lifespan of mice and protected against age-related conditions7.

Ye et al. (2014) developed a pharmacological network to identify pharmacological classes related to the ageing of C. elegans8. The network showed that resistance to oxidative stress and lifespan extension clustered in a few pharmacological classes, most of them related to intercellular signalling8. Moreover, Putin et al. (2016) developed a deep learning neural network that predicted human chronological age from a basic blood test9. The study identified the top five most critical blood markers for determining chronological age in humans, which were albumin, glucose, alkaline phosphatase, urea and erythrocytes9. Additionally, Mamoshina et al. (2018) developed a deep learning-based haematological ageing clock using blood samples from Canadian, South Korean, and Eastern European populations, with millions of subjects10. The findings showed that population-specific ageing clocks were more accurate in predicting chronological age and quantifying biological age than generic ageing clocks10.

Barardo et al. (2017) built a random forest model to predict whether a compound would increase the lifespan of C. elegans based on the data of the DrugAge database1,4. The features used to build the model were molecular descriptors and gene ontology terms. Feature selection was performed using random forests feature importance measure. The best performing model, with an AUC score of 0.80, was applied to predict the class of the compounds in the DGIdb database.

This study builds on the work conducted by Barardo et al. (2017) to further explore the use of the DrugAge database for predicting compounds with anti-ageing properties4. Specifically, the random forest algorithm was employed to predict whether a compound will increase the lifespan of C. elegans. This was achieved by building five predictive models, each using different descriptor types, based on the data of the DrugAge database published by Barardo et al. (2017)4. The features of the models were molecular fingerprints and/or molecular descriptors calculated from the structure of the compounds in the DrugAge database. Feature selection was performed using variance and mutual information-based methods. To the best of our knowledge, this is the first implementation of molecular fingerprints for building machine learning models based on the entries of the DrugAge database. The best performing model was applied to predict the class of the compounds in an external database, consisting of 1738 small-molecules extracted from the DrugBank database11.

Random forest is a supervised machine learning technique, consisted of an ensemble of decision trees, where each tree is trained independently using a random subset of the data12. The random forest model is widely used in chemo- and bioinformatics related tasks as it is robust to overfitting on high dimensional databases with a small sample sizes4.

The choice of chemical descriptors can significantly impact the quality and predictions of quantitative structureactivity relationship (QSAR) models. Descriptors represent chemical information of molecules in a digital or numerical way that is suitable for model development and are computer-interpretable13,14. In this study, 2D and 3D molecular descriptors were calculated using the Molecular Operating Environment (MOE) software15. 2D descriptors are calculated from the 2D structure of a molecule and provide information related to its structural, topological and physicochemical properties16. On the other hand, 3D descriptors are generated from the 3D structure of a chemical compound and include electronic parameters (e.g. dipole momentum), quantumchemical descriptors (e.g. HOMO and LUMO energies), and surface:volume descriptors14,17,18.

Molecular fingerprints represent the molecules structure using binary vectors, where 1 corresponds to a particular feature or structural group being present and 0 that it is absent. Herein, extended-connectivity fingerprints (ECFP) of 1024- and 2048-bit lengths and RDKit topological fingerprints of 2048-bit length were generated in the RDKit Python environment19. Lastly, the combination of molecular descriptors with ECFPs was tested.

Variance and mutual information filter-based methods were employed to select a subset of relevant features for predicting the anti-ageing properties of the molecules in the dataset. This was performed only for the training set which contained 80% of compounds in the dataset. Feature selection reduced the number of variables used by each model, making computational calculations less expensive. The median AUC scores and standard deviation of tenfold cross-validation (on the training set) obtained by random forest classification for each feature subset can be found in Supplementary Fig.1, Additional File 1. The feature subset with the highest AUC score in 10-fold cross-validation was selected for classifying the compounds in the test set. In cases where two feature subsets achieved the same AUC score, the subset with the lowest standard deviation was used.

The test set contained 20% of the data not used in training the models. The performances of the random forest classifiers on the test and train set as well as the optimal number of variables obtained by feature selection are shown in Table 1.

As illustrated in Fig.1, the predictive performances of the random forest models built with ECFP and MD features did not significantly drop for classifying the compounds in the test set and were compatible with the spread of the AUC scores from cross-validation. This indicated that overfitting was minimised. On the other hand, the performance of the classifier using topological RDKit fingerprints on the test set dropped by approximately 6%. Therefore, the RDKit5 model was not selected for predicting the effect of the compounds in the screening database on the lifespan of C. elegans.

Box-and-whisker plot displaying the distribution of the AUC for the tenfold cross-validation (CV) on the training set and the AUC score for the single measurement taken on the test set, obtained by random forest classification. Each box represents the cross-validation data for a different model, where ECFP of 1024-bit length is shown in green, ECFP of 2048-bit length in blue, RDKit fingerprints in pink, molecular descriptors in red and the combination of ECFP of 1024-bit length with molecular descriptors are represented in orange colour. The value reported within the boxes is the median AUC value of the tenfold cross-validation. The points outside the boxplots represent possible outliers.

The receiver operating characteristic (ROC) curve is the plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) at varying classification thresholds. The ROC curves, displayed in Fig.2, compare the performances of the descriptor types for classifying the samples of the test set. The figure illustrates that the five random forest models performed better than a random prediction.

The ROC curves comparing the performances of the ECFP_1024 (in grey), ECFP_2048 (in green), RDKit5 (in orange), MD (in purple) and ECFP_1024_MD (in blue) for classifying the compounds in the test set. The AUC scores are reported for each descriptor type. The red dashed line corresponds to a random classifier, that gives random answers, with an AUC value of 0.520.

The best performing model, selected by its ability to correctly classify the compounds in the test set, was used for predicting the class of the compounds in the screening dataset. The classifier built solely with molecular descriptors (MD), achieved the highest AUC score for predicting the class of the compounds in the test set. Combining MD with ECFP_1024, the feature type used to obtain the model with the second-highest predictive ability, did not further improve the performance of the classifier. The ECFP_1024 features could have provided additional information that was not useful to the random forest classifier making the predictions more difficult. Therefore, the MD model, which had an AUC score of 0.815 for classifying the compounds in the test set, was selected for further analysis.

In binary classification, the PPV and NPV are the percentage of correctly classified compounds among all compounds predicted as positives or negatives, respectively. Herein, the PPV and NPV indicate that the random forest model performed better on correctly classifying inactive compounds than active ones. The data used in this study was imbalanced as approximately 79% of the samples were negative entries. Thus, a random prediction that a compound is inactive had a much higher initial probability of being correct. To handle the imbalanced data, the class_weight argument of the random forest algorithm was set to balanced, which penalises misclassification of the minority class (i.e. the positive samples)21. The remaining parameters of the random forest model were left to the default settings of the scikit-learn Python library (please refer to the Random forest settings section in the Methods)22. This increased the PPV for classifying the compounds in the test set from 61.1% (value without balancing the class weights) to 65.6% (score achieved after balancing the class weights), while the NPV was reduced by 0.2%.

Investigating which features are considered more important by black-box models such as random forest can aid understanding of how these models make predictions. In this experiment, the feature relevance was measured using the Gini importance of the random forest algorithm. The selected model, MD, was composed of 69 molecular descriptors calculated by the MOE software23. The table containing the full feature ranking can be found in Additional File 2. The analysis was focused on the top 30 features with the highest Gini importance (Table 2), which contained both 2D and 3D molecular descriptors.

Atom and bond counts are simple descriptors that do not provide any information on molecular geometry or atom connectivity. The highest-ranking atom and bond count descriptors were a_nN, b_single, a_count, opr_brigid, and a_nH. While very simplistic, the atom and bond counts outperformed more complex 2D and 3D molecular descriptors. This is because atom and bond counts can partially capture the overall properties of a compound such as size, hydrogen bonding and polarity, which often impact the activity of a drug24. The number of nitrogen atoms, a_nN, was the top-ranking feature of the MD random forest model with a Gini importance score of 0.062. This is consistent with the results of Barardo et al. (2017) where a_nN was also ranked highest for predicting the class of the compounds in the DrugAge database4. Nitrogen atoms could have affected the physicochemical properties of the drugs as well as the interactions and binding of the molecules with target residues.

The highest-ranking topological descriptors included chi0_C, chi0v_C, zagreb, weinerPol, Kier3, chi0 and Kier2. Topological descriptors take into account atom connectivity. The descriptors are computed from molecular graphs, where atoms are represented by vertices and the bonds by edges25. These descriptors can provide information on the degree branching of the structure as well as molecular size and shape25. Although topological descriptors are extensively used in predictive modelling, they are usually hard to interpret26. Topological descriptors may have provided information on how well a molecule fits in the binding site and along with atom counts the interactions with the binding residues.

Top ranking partial charge descriptors were PEOE_VSA+2, PEOE_VSA-4, PEOE_VSA+4, PEOE_VSA-6, PEOE_VSA_PPOS, Q_VSA_PNEG, PEOE_VSA_POL, Q_VSA_PPOS and PEOE_VSA_PNEG. The PEOE_ prefix denotes descriptors calculated using the partial equalization of orbital electronegativity (PEOE) algorithm for quantification of partial charges in the \(\sigma\)-system27,28. On the other hand, descriptors prefixed with Q_ were calculated using the Amber10:EHT force field23. In a ligand-receptor system, partial charges can play a key role in the binding properties of the molecule as well as molecular recognition.

The MD random forest model was applied to predict the class compounds in an external database, consisting of 1738 small-molecules obtained from the DrugBank database11. The top-ranking compounds with a predictive probability of \(\ge 0.80\) for increasing the lifespan of C. elegans are shown in Table 3. The full ranking of the molecules in the screening database can be found in Additional File 2. The compounds were broadly separated into the following categories; (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds. The compounds were classified based on categories Class and Sub Class of the chemical taxonomy section of the DrugBank database (provided by Classyfire) or assigned manually if not present29.

Flavonoids are a group of secondary metabolites in plants that are common polyphenols in the human diet30. Major nutritional sources include tea, soy, fruits, vegetables, wine and nuts30,31. Flavonoids are separated into subclasses based on their chemical structure, including flavones, flavonols, flavanones, and isoflavones30.

Flavonoids have been associated with health benefits for age-related conditions such as metabolic diseases, cancer, inflammation and cognitive decline30,31. Possible mechanisms of action include antioxidant activity, scavenging of radicals, central nervous system effects, alteration of the intestinal transport, sequestration and processing of fatty acids, PPAR activation and increase of insulin sensitivity30.

Diosmin was the top-hit molecule in the screening database, with a predictive probability of 0.96. Diosmin is a flavonol glycoside that is either extracted from plants such as Rutaceae or obtained synthetically32. It has anti-inflammatory, free radical scavenging, and anti-mutagenic properties and has been used medically to treat pain and bleeding of haemorrhoids, chronic venous disease and lymphedema33. Nevertheless, diosmin has poor aqueous solubility, which is a challenge for oral administration34. Kamel et al. (2017) found that a combination of diosmin with essential oils showed skin antioxidant, anti-ageing and sun-blocking effects on mice34. The underlying mechanisms for diosmins anti-ageing and photo-protective effects include enhancing lymphatic drainage, ameliorating capillary microcirculation inflammation and preventing leukocyte activation, trapping, and migration34,35.

Other flavonoids that ranked high for increasing the lifespan of C. elegans were rutin and hesperidin with a predictive probability of 0.95 and 0.94, respectively. Rutin (or quercetin-3-rutinoside), is a flavonol glycoside that is abundant in many plants such as passionflower, apple, tea, buckwheat seeds and citrus fruits36,37. It possesses a range of biological properties including antioxidant, anticancer, neuroprotective, cardio-protective and skin-regenerative activities36,37. Rutin had a high structural similarity to other flavonoids in the DrugAge database and particularly with quercetin 3-O--d-glucopyranoside-(41)--d-glucopyranoside (Q3M). The Tanimoto coefficient between the RDKit fingerprints of Q3M and rutin was 0.99. The similarity map between the two compounds is shown in Fig.4.

Similarity map for ECFP fingerprint with a default radius of 2 (a) structure of reference molecule Q3M from the DrugAge database (b) similarity map of rutin. In green colour are bits that if removed will decrease the similarity, whereas removing bits represented in pink colour will increase the similarity between the two compounds38. The figure was created in the RDKit Python environment19.

Q3M is a flavonoid abundant in onion peel that was found to extend the lifespan of C. elegans39. In the same study, even although rutin was found to improve the tolerance of C. elegans to oxidative stress, which is desirable for longevity, rutin (20g/mL) did not significantly affect the worm's lifespan39. On the other hand, a more recent study by Cordeiro et al. (2020) found that treatment of C. elegans with 15, 30 and 120M rutin increased the lifespan of the nematodes from 28 (control) to 30 days40.

Hesperidin has shown reactive oxygen species (ROS) inhibition and anti-ageing effects in the yeast species Saccharomyces cerevisiae41. Fernndez-Bedmar et al. (2011) found that hesperidin extracted from orange juice had a positive influence on the lifespan of D. melanogaster42. Wang et al. (2020) showed that orange extracts, where hesperidin was the predominant phenolic compound, increased the mean lifespan of C. elegans43. In the same study, orange extracts were also found to promote longevity by enhancing motility and reducing the accumulation of age pigment and ROS levels43.

Soy isoflavones include genistein, glycitein, and daidzein. Genistein, a compound of the DrugAge, has been found to prolong the lifespan of C. elegans and increase its tolerance to oxidative stress44. Gutierrez-Zepeda et al. (2005) found that C. elegans fed with soy isoflavone glycitein had an improved resistance towards oxidative stress45. However, in comparison to control worms, the lifespan of C. elegans fed with glycitein (100g/ml) was not significantly affected45. The effect of daidzein (100M) on the lifespan of C. elegans in the presence of pathogenic bacteria was investigated by Fischer et al. (2012)46. The study found that daidzein had an estrogenic effect that extended the lifespan of the nematode in presence of pathogenic bacteria and heat46. Herein, we applied the MD random forest model to predict the effect of 6''-O-malonyldaidzin on the lifespan of C. elegans. 6''-O-Malonyldaidzin is an o-glycoside derivative of daidzein found in food products such as soybean, miso, soy milk and soy yoghurt47. Its predicted probability for extending the lifespan of the worm was 0.84.

Lipid metabolism has an essential role in many biological processes of an organism. Lipids are used as energy storage in the form of triglycerides and can therefore aid survival under severe conditions48. Additionally, lipids have a key role in intercellular and intracellular signalling as well as organelle homeostasis49. Research on both invertebrates and mammals indicates that alterations in lipid levels and composition are associated with ageing and longevity48,49.

A recent review by Johnson and Stolzing (2019), on lipid metabolism and its role in ageing, summarised key lipid-related interventions that promote longevity in C. elegans50. Some of the studies presented in the review are reported here. ORourke et al. (2013), showed that supplementing C. elegans with the \(\omega\)-6 polyunsaturated fatty acids (PUFAs) arachidonic acid and dihomolinoleic acid (DGLA) increased the worms starvation resistance and prolonged its lifespan by stimulating autophagy51. Similarly, Shemesh et al. (2017) found that DGLA extended the lifespan of C. elegans and maintained protein homeostasis in adulthood52. Additionally, Qi et al. (2017), found that treating C. elegans with \(\omega\)-3 PUFA \(\alpha\)-linolenic acid extended the nematodes lifespan53. The study indicated that the \(\omega\)-3 fatty acid underwent oxidation to generate a group of molecules known as oxylipins. The findings suggested that the increase of the worms lifespan could be a result of the combined effects of the -linolenic acid and oxylipin metabolites53. Sugawara et al. (2013) found that a low dose of fish oils, which contained PUFAs eicosapentaenoic acid and docosahexaenoic acid, significantly increased the lifespan of C. elegans54. The authors proposed that a low dose of fish oils induces moderate oxidative stress that extended the lifespan of the organism. In contrast, large amounts of fish oils had a diminishing effect on the worms lifespan54.

Gamolenic acid or \(~\gamma\)-linolenic acid (GLA) was the second top-hit molecule of the screening database with a predictive probability of 0.95. GLA is an \(\omega\)-6 PUFA, composed of an 18-carbon chain with three double bonds in the 6th, 9th and 12th position55. Rich sources of GLA include evening primrose oil (EPO), black currant oil, and borage oil56. In mammals, GLA is synthesized from linoleic acid (dietary) via the action of the enzyme \(\delta\)-6 desaturase55,56. GLA is a precursor for other essential fatty acids such as arachidonic acid55,56. Conditions such as hypertension and diabetes as well as stress and various aspects of ageing, reduce the capacity of \(\delta\)-6 desaturase to convert linoleic acid to GLA57. This may lead to a deficiency of long-chain fatty acid derivatives and metabolites of GLA. GLA has been used as a constituent of anti-ageing supplements and has shown to possess various therapeutic effects in humans including improvement of age-related anomalies55.

Sodium aurothiomalate, with a lifespan increase probability of 0.82, is a thia short-chain fatty acid used for the treatment of rheumatoid arthritis and has potential antineoplastic activities29,58. In preclinical models, sodium aurothiomalate inhibited protein kinase C iota (PKC) signalling, which is overexpressed in non-small cell lung, ovarian and pancreatic cancers58.

Lactose, with a lifespan increase probability of 0.89, is a disaccharide found in milk and other dairy product. In the human intestine, lactose is hydrolysed to glucose and galactose by the enzyme lactase. Out of the compounds in the DrugAge database, lactose had the highest structural similarity with trehalose. Trehalose has been found to increase the mean lifespan of C. elegans by over 30%, without showing any side effects59. The Tanimoto coefficient between the RDKit fingerprint representations of trehalose and lactose was 0.85. Even though lactose has a high (Tanimoto) similarity to trehalose, Xing et al. (2019) found that lactose treatment at 10, 25, 50 and 100mM concentrations shortened the lifespan of C. elegans60.

Sucrose, with a lifespan increase probability of 0.83, is a disaccharide composed of glucose and fructose61. It is used as the main form of transporting carbohydrates in fruits and vegetables61. Other sugars such as trehalose, galactose and fructose have been found to extend the lifespan of C. elegans59,62,63. Zheng et al. (2017) found the treating C. elegans with sucrose (55M, 111M, or 555M) had no significant effect on the organisms mean lifespan63. On the other hand, a more recent study by Wang et al. (2020) found that treatment with 50M sucrose significantly increased the lifespan of the nematodes, while a concentration of 400M significantly shortened their lifespan64.

Lactulose, with a lifespan increase probability of 0.83, is a synthetic disaccharide composed of monosaccharides lactose and galactose65. Lactulose has shown to be an effective treatment for chronic constipation in elderly patients as well as improve the cognitive function in patients with hepatic encephalopathy65,66.

Other compounds with a predictive probability of0.80 for increasing the lifespan of C. elegans included alloin, a constituent of aloe vera with a predictive probability of 0.81, as well as the antibiotics fidaxomicin (predictive probability=0.84), rifapentine (predictive probability=0.81) and chlortetracycline (predictive probability=0.80).

Rifapentine is a macrolactam antibiotic approved for the treatment of tuberculosis67. Macrolactams are a small class of compounds that consist of cyclic amides having unsaturation or heteroatoms replacing one or more carbon atoms in the ring29. Other macrolactams such as rifampicin and rifamycin have been found to increase the lifespan of C. elegans68.

Golegaonkar et al. (2015) showed that rifampicin reduced AGE products and extended the mean lifespan of C. elegans by 60%68. Advanced glycation end (AGE) products are formed from the non-enzymatic reaction of sugars, such as glucose, with proteins, lipids or nucleic acids68. AGE products have been implicated in ageing and age-related diseases such as diabetes, atherosclerosis, and neurodegenerative diseases68. The effect of two other macrolactams, rifamycin SV and rifaximin, on the lifespan of the nematode was also investigated. Rifamycin SV was found to exhibit similar activity to rifampicin, while rifaximin lacked anti-glycating activity and did not extend the lifespan of C. elegans. The authors suggested that the anti-glycation properties of rifampicin and rifamycin could be attributed to the presence of a para-dihydroxyl moiety, which was not present in rifaximin68. As shown in Fig.5, this functional group is also present in rifapentine. Experimental testing would be required to investigate whether rifapentine possesses similar properties to rifampicin and rifamycin.

Chemical structure of (a) rifamycin SV (b) rifampicin (c) rifaximin and (d) rifapentine. The para-dihydroxynaphthyl moiety possessed by rifamycin SV, rifampicin and rifapentine is highlighted in blue. Rifaximin possesses a para-aminophenyl moiety incorporated in a ring system, highlighted in red68. The figure was designed in ChemDraw and redrawn from Golegaonkar et al. (2015)68.

Pharmaceutical interventions that modulate ageing-related genes and pathways are considered the most effective approach for combating human ageing and age-related diseases. Widely used strategies for identifying active compounds include screening existing drugs with potential anti-ageing properties.

In this study, the random forest algorithm was built to predict whether a compound would increase the lifespan of C. elegans using the entries of the DrugAge database, which contains molecules with known anti-ageing properties such as metformin, spermidine, bacitracin, and taxifolin. Our results provide an update on the findings of Barardo et al. (2017), by employing the latest version of the DrugAge database, which includes many more entries. Specifically, five random forest models were built using molecular fingerprints and/or molecular descriptors as features. Feature selection and dimensionality reduction were performed using variation and mutual information-based pre-selection methods. The best performing classifier, the MD model, was built using molecular descriptors and achieved an AUC score of 0.815 for classifying the compounds in the test set. Combining molecular descriptors with ECFPs did not further improve the models performance. The features of the MD model were ranked using random forests Gini importance measure. Among the 30 highest important features were molecular descriptors related to atom and bond counts, topological and partial charge properties.

The highest performing model was applied to predict the class of the compounds in the screening database which consisted of 1738 small-molecules from DrugBank. The compounds with a predictive probability of0.80 for increasing the lifespan of C. elegans were broadly separated into (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds.

This study elucidated several molecules such as orange extracts, rutin, lactose and sucrose, that have been experimentally evaluated on C. elegans but were not entries of the predictive database. A limitation of our algorithm is that it does not consider the substances dose. For example, at certain concentrations sugars such as sucrose can promote longevity in C. elegans, whereas at higher concentrations such sugars have a detrimental effect on the lifespan of the nematodes64. Moreover, lactose, which received a predictive probability of 0.89 for increasing the lifespan of C. elegans by our model, was found to reduce the lifespan of C. elegans at 10100mM concentrations by Xing et al. (2019)60. Nevertheless, the compound could have a beneficial health effect at a different concentration.

Future work would involve in vivo testing of promising compounds such as \(\gamma\)linolenic acid, rutin, lactulose and rifapentine to investigate their effect on the lifespan of C. elegans, as well as, reevaluate the effect of lactose at lower concentrations. Finally, further work would also explore how the predicted probability of lifespan increase is affected when testing two structurally similar compounds that promote longevity at different concentrations.

The dataset published in the study by Barardo et al. (2017) contains positive entries, which are compounds that increase the lifespan of C. elegans and negative entries, compounds that do not increase the lifespan of C. elegans4. In particular, the dataset contains 1392 compounds of which 229 are positive and 1163 are negative entries4. The positive entries of this dataset were obtained from the DrugAge database of ageing-related drugs, (Build 2, release date: 01/09/2016), available on the Human Ageing Genomic Resources website1,69. DrugAge provides information on drugs, compounds and supplements with anti-ageing properties that have been found to extend the lifespan of model organisms1. The species include worms, mice and flies, with the majority of data representing C. elegans4. Data has been obtained from studies performed under standard conditions and contain information relevant to ageing, such as average/median lifespan, maximum lifespan, strain, dosage and gender where available1. The negative entries of the database used in the study of Barardo et al. (2017) were obtained from the literature.

At the time of writing, the latest version of the DrugAge database, Build 3 (release date: 19/07/2019), corrects for small errors and adds hundreds of new entries. Herein, the positive entries in the database used in Barardo et al. (2017) were replaced with the data from the newest version of DrugAge, Build 3. The same negative entries as Barardo et al. (2017) were used4. The modified database contained a total of 1558 compounds with 395 positive entries and 1163 negative ones. In this study, the term DrugAge database refers to the modified dataset with a total of 1558 compounds.

The chemical structures of the DrugAge dataset were converted into canonical SMILES strings using the Python package PubChemPy70. The SMILES strings were standardised by the Standardiser tool developed by Francis Atkinson in 201471. Standardisation removed inorganic compounds, salt/solvent components and metal species as well as neutralised the compounds by adding or removing hydrogen atoms71. Stereoisomers, even if biologically may have different activities, were treated as duplicates as they had identical SMILES strings. For two or more stereoisomers in the same class, only one was kept. For duplicates in different classes, both were removed72. After standardisation and duplicate removal, the number of molecules in the DrugAge database was reduced to a total of 1430 compounds with 304 positive and 1126 negative entries. The predictive database used in this study can be found in Additional File 2.

The standardised SMILES strings were converted into mol files in the RDKit environment and opened in the MOE software19,23. The chemical structures were energy minimised in the Energy Minimize General mode of MOE using Amber10:EHT force field23. A total of 354 descriptors were calculated including all 2D, internal i3D and external x3D coordinate depended on 3D descriptors. Due to software limitation, few 3D descriptors ('AM1_E', 'AM1_Eele', 'AM1_HF', 'AM1_HOMO', 'AM1_IP', 'AM1_LUMO', 'MNDO_E', 'MNDO_Eele', 'MNDO_HF', 'MNDO_HOMO', 'MNDO_IP', 'MNDO_LUMO', 'PM3_E', 'PM3_Eele', 'PM3_HF', 'PM3_HOMO', 'PM3_IP', 'PM3_LUMO') could not be calculated for ten chemical structures. The missing values were replaced with the average value of the remaining chemical structures for the given descriptor.

Molecular fingerprints were generated in the Python RDKit environment from the standardised SMILES strings19. ECFP of 1024-bits and 2048-bits length were calculated with an atomic radius of 2. These were represented as ECFP_1024 and ECFP_2048, respectively. In addition to the ECFPs, RDKit topological fingerprints of 2048-bits length were generated with a maximum path length of 5 bonds and denoted as RDKit5.

Five random forest models were built using five different feature types and trained with the data of the DrugAge database. The feature types explored in this study, ECFP_1024, ECFP_2048, RDKit5, MD and ECFP_1024_MD, are summarised in Supplementary Table 1, Additional File 1. The ECFP_1024_MD feature was a combined descriptor type consisting of ECFPs of 1024 bit-length and molecular descriptors.

Feature selection was implemented in the scikit-learn Python library22. Features with low variance were removed first, creating three feature subsets var_100, var_95 and var_90. The filters removed features with the same value in all entries (var_100), features that had greater than 95% of constant values (var_95) and features with more than 90% constant values, respectively (var_90)73.

where \(p\left( {x,y} \right)\) is the joint probability mass function and \(p\left( x \right)\) and \(p\left( y \right)\) are the marginal probability mass functions for \(X\) and \(Y\), respectively74. Herein, Adjusted Mutual Information (AMI) was calculated between each feature and the class labels. AMI was used as it is a variation of MI that adjusts for chance75.

For each of the feature subset, AMI was applied using the adjusted_mutual_info_score function of scikit-learn to order the features based on their AMI score22. The following settings were tested: using 5%, 10%, 25%, 50%, 75% and 100% of the features with the highest AMI score73. For example, if var_100 for MD contained 349 features, the database with 5% of the features would consist only of the 17 highest-ranking features. This process is outlined in Supplementary Fig.2, Additional File 1.

Cross-validation was performed in the scikit-learn Python library using the cross_val_score function22. The predictive database was randomly split into 80% training and 20% test set. The tenfold cross-validation was performed only on the training set. The performance of the models was evaluated using the AUC measure. Cross-validation was repeated 10 times, yielding 10 AUC scores. The predictive accuracy reported was the median AUC value of the 10 measurements obtained by cross-validation. The median, rather than average, AUC score was calculated as the former is more robust to outliers4.

The random forest classifiers were built in the scikit-learn Python module22. To handle the unbalanced data used in this study, the random forest parameter class_weight was set to balanced. The remaining parameters of the random forest classifier were set to their default settings. The models were run with 100 estimators (number of trees in the forest) and the maximum number of features considered in each tree node was the square root of the total number of features. The AUC scores were calculated with the roc_auc_score matrix of scikit-learn using the predict_proba method22.

The best performing model on the test set was applied to predict the class of the compounds in an external database, where the effect of the compounds on the lifespan of C. elegans was mostly unknown. The external database consisted of small-molecules obtained from the External Drug Links database of DrugBank (version 5.1.5, released on 2020-01-03)11. The External Drug Links database contained a list of drugs and links to other databases, such as PubChem and UniProt, providing information on these compounds11,76,77.

Generation of SMILES strings, standardisation and descriptor calculation was performed in the same method used for the training (DrugAge) database, described in the above sections. Some of the entries of the DrugBank database were substances composed of more than one molecule, such as vegetable oils. These entries were either removed from the database or replaced by one of their main active ingredients. For example, borage oil was replaced with gamolenic acid. In the case of soy isoflavones, the major soy isoflavones (genistein, glycitein, and daidzein) had already been experimentally evaluated on the lifespan of C. elegans. Therefore, the entry was replaced with 6''-O-malonyldaidzin, a derivative of daidzein with unknown activity. Stereoisomers were treated as duplicates and only one of them was kept. Substances and stereoisomers present in both the DrugBank and DrugAge databases were removed from the screening database. The resulting database consisted of a total of 1738 small-molecules.

The Tanimoto coefficients and similarity maps were computed in the Python RDKit environment19. The Tanimoto similarity is calculated between a reference molecule, which is known to be active, and a compound of interest with unknown activity.

Herein, the reference molecules were the positive entries of the DrugAge database. The compound with unknown activity was a selected entry from the screening database. The Tanimoto coefficient between the compound of interest with each of the reference molecules was calculated. The highest score achieved and the reference molecule used to obtain that score was reported. The Tanimoto coefficients were computed using the RDKit fingerprint representations of the compounds. Similarity maps were generated using ECFP fingerprint representations.

Mamoshina, P. et al. Population specific biomarkers of human aging: A big data study using South Korean, Canadian, and Eastern European patient populations. J. Gerontol. A. Biol. Sci. Med. Sci. 73, 14821490 (2018).

Gaba, V., Rani, K. & Gupta, M. K. QSAR study on 4-alkynyldihydrocinnamic acid analogs as free fatty acid receptor 1 agonists and antidiabetic agents: Rationales to improve activity. Arab. J. Chem. 12, 17581764 (2019).

Roy, K., Kar, S. & Das, R. N. Chapter 2: Chemical Information and Descriptors. in Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment 4780 (Academic Press, 2015).

Bender, A. & Glen, R. C. A discussion of measures of enrichment in virtual screening: Comparing the information content of descriptors with increasing levels of sophistication. J. Chem. Inf. Model. 45, 13691375 (2005).

Ramelet, A. A. Venoactive drugs. In Sclerotherapy: Treatment of Varicose and Telangiectatic Leg Veins (eds Goldman, M. P. et al.) 369377 (W.B. Saunders, 2011).

Mangoni, A. A. Drugs acting on the cerebral and peripheral circulations. In A Worldwide Yearly Survey of New Data in Adverse Drug Reactions and Interactions Vol. 34 (ed. Aronson, J. K.) 311316 (Elsevier, 2012).

Cordeiro, L. M. et al. Rutin protects Huntingtons disease through the insulin/IGF1 (IIS) signaling pathway and autophagy activity: Study in Caenorhabditis elegans model. Food Chem. Toxicol. 141, 111323 (2020).

Fernndez-Bedmar, Z. et al. Role of citrus juices and distinctive components in the modulation of degenerative processes: Genotoxicity, antigenotoxicity, cytotoxicity, and longevity in drosophila. J. Toxicol. Environ. Heal. Part A 74, 10521066 (2011).

Shemesh, N., Meshnik, L., Shpigel, N. & Ben-Zvi, A. Dietary-induced signals that activate the gonadal longevity pathway during development regulate a proteostasis switch in caenorhabditis elegans adulthood. Front. Mol. Neurosci. 10, 254 (2017).

Rezapour-Firouzi, S. Chapter 24: Herbal oil supplement with hot-nature diet for multiple sclerosis. In Nutrition and Lifestyle in Neurological Autoimmune Diseases (eds Watson, R. R. & Killgore, W. D. S.) 229245 (Academic Press, 2017).

Yahia, E. M., Carrillo-Lpez, A. & Bello-Perez, L. A. Carbohydrates. In Postharvest Physiology and Biochemistry of Fruits and Vegetables (ed. Yahia, E. M.) 175205 (Woodhead Publishing, 2019).

Vinh, N. X., Epps, J. & Bailey, J. Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? in Proceedings of the 26th Annual International Conference on Machine Learning 10731080 (Association for Computing Machinery, 2009).

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Get in Touch with Mechanic
Related Products
Recent Posts
  1. lami high quality new brick and tile classifier manufacturer

  2. efficient new basalt spiral classifier sell in damascus

  3. ismailia low price brick and tile classifier

  4. high quality new iron ore classifier sell in brasilia

  5. raw material in case of classifier

  6. bowl classifier online bangalore

  7. spiral ham

  8. spiral recipes

  9. mineral process spiral classifier in indonesia

  10. spiral classifier weir

  11. how much cement is placed for a wind mill

  12. tangible benefits environmental copper mine briquetting plant sell at a loss in zanzibar

  13. stone crusher working principle

  14. efficient portable river pebble dust catcher for sale in zanzibar

  15. high end large river pebble hammer crusher manufacturer in nice

  16. kolkata high end diabase briquette making machine manufacturer

  17. tangible benefits new salt sand maker price in sri lanka

  18. gravel crusher especially

  19. sandwich

  20. crusher cone combination