On this post i try to analyze concerning the charge descriptors fingerprints as one may call it . David Winkler 2009 published an article on it Towards Novel Universal Descriptors: Charge Fingerprints . It looks very interesting and may provide insights into the electrostatic properties of molecules, which play a very important role in molecular interactions and binding affinity. By encoding the partial charges of atoms, charge fingerprints will help in comparing the similarities and differences between molecules based on their charge distribution, and may assist in predicting the activity of compounds in biological systems or their physicochemical properties.
The calculation of charge fingerprints typically involves two predominant steps. First, the partial charges of atoms within the molecule are computed using a charge model, similar to Gasteiger or MMFF94 in open babel there are several ways you possibly can compute. One other quantum chemistry package is psi4 which might be utilized to calculate mulliken charges for every atom. These models are based on various approximations and empirical rules derived from quantum chemistry calculations and experimental data. Methods like Gasteiger’s approach to charge equalization and in addition the newer electronegativity equalization method (EEM) based on Sanderson’s
equation. Other methods similar to semiempirical molecular orbital methods, DFT, or ab initio methods will also be used to calculate atom charges if the bin boundaries are set appropriately. Nevertheless i didnt try to duplicate the paper but i attempted to make use of the thought to construct a fingerprints.
Once the partial charges are obtained, they might be encoded right into a fingerprint, which is generally a binary vector of fixed length. A standard approach to generate the fingerprint is by discretizing the charge values into bins and assigning each atom to a selected bin. This leads to a sparse binary vector, where each element corresponds to a specific atom and charge bin combination. The presence of a ‘1’ at a selected position within the vector indicates that the corresponding atom has a partial charge throughout the range of the associated bin. By comparing the charge fingerprints of various molecules, one can assess their similarity by way of electrostatic properties, which is crucial for various cheminformatics tasks similar to virtual screening, similarity searching, and property prediction.
Code below shows the best way the you possibly can generate this fingerprints with mmff94 force feild.
import numpy as np
import openbabel as ob
from openbabel import openbabeldef tanimoto_similarity(fp1, fp2):
common_bits = np.bitwise_and(fp1, fp2).sum()
total_bits = np.bitwise_or(fp1, fp2).sum()
return common_bits / total_bits
def generate_charge_fingerprint(smiles, n_bits=2048, bin_min=-1.0, bin_max=1.0, nbins=32):
# Initialize Open Babel objects
ob_conversion = ob.OBConversion()
ob_conversion.SetInFormat("smi")
ob_mol = ob.OBMol()
# Convert SMILES to Open Babel molecule
ob_conversion.ReadString(ob_mol, smiles)
ob_mol.AddHydrogens()
ob_charge_model = ob.OBChargeModel.FindType("mmff94")
ob_charge_model.ComputeCharges(ob_mol)
charges = [ob_mol.GetAtom(i+1).GetPartialCharge() for i in range(ob_mol.NumAtoms())]
# Initialize the fingerprint vector
fingerprint = np.zeros(n_bits, dtype=np.uint8)
# Create bins for the partial charges
bins = np.linspace(bin_min, bin_max, nbins + 1)
# Set the corresponding bits for every atom's partial charge
for idx, charge in enumerate(charges):
bin_index = np.digitize(charge, bins) - 1
bit_index = idx * nbins + bin_index
if bit_index < n_bits:
fingerprint[bit_index] = 1
return fingerprint
Adding hydrogens to a molecular structure before calculating charge descriptors is very important because hydrogen atoms play a big role within the distribution of charges inside a molecule. Most molecular representations, similar to SMILES or SDF, don’t explicitly include hydrogen atoms, as they are sometimes omitted for brevity and ease. Nevertheless, hydrogen atoms are involved in various chemical interactions, similar to hydrogen bonding and protonation/deprotonation, which may significantly impact a molecule’s charge distribution and its physicochemical properties. When calculating charge descriptors, the underlying charge models, like Gasteiger or MMFF94, need accurate information concerning the molecular structure to supply reliable partial charge estimates. By adding hydrogens explicitly to the molecule, you be sure that the charge models consider the proper bonding environment of every atom, leading to more accurate charge descriptors.
Then the subsequent part is pretty easy when you get the fingerprints and see whether those fingerprints does is sensible or not by training a model. I used xgboost here with 5 fold CV . The dataset i used to be enthusiastic about was herg which i considered from tdc benchmark study. Nevertheless i haven’t studied much on other datasets but results with these dataset looks this descriptor has something in it. The common auc got here around ROC-AUC: 0.7929
import numpy as np
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn.metrics import roc_auc_score, confusion_matrixfrom tdc.single_pred import Tox
data = Tox(name = 'hERG')
split = data.get_split()
train,test = split['train'],split['test']
smiles_list = train['Drug'].tolist()
y = train['Y'].values
# Generate charge fingerprints for every molecule within the dataset
fingerprints = np.array([generate_charge_fingerprint(smiles,n_bits=2048, nbins=32) for smiles in smiles_list])
# Generate charge fingerprints for every molecule within the dataset
#fingerprints = np.array([generate_charge_fingerprint(smiles) for smiles in smiles_list])
# Cross-validation parameters
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=123)
params = {
'colsample_bynode': 0.8,
'learning_rate': 0.01,
'max_depth': 12,
'alpha':0.5,
'lambda':0.5,
'min_child_weight':1,
'num_parallel_tree': 100,
'objective': 'binary:logistic',
'subsample': 0.8,
'scale_pos_weight':1,
'tree_method':'hist',
'eval_metric':['auc','error'],
'n_jobs': -1,
'random_state': 123}
# Perform cross-validation
roc_auc_scores = []
for train_index, test_index in kf.split(fingerprints):
X_train, X_test = fingerprints[train_index], fingerprints[test_index]
y_train, y_test = y[train_index], y[test_index]
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
bst = xgb.train(params, dtrain, num_boost_round=100, early_stopping_rounds=10, evals=[(dtest, 'test')])
y_pred_proba = bst.predict(dtest)
y_pred = y_pred_proba > 0.5
roc_auc = roc_auc_score(y_test, y_pred_proba)
roc_auc_scores.append(roc_auc)
# Calculate the common ROC-AUC rating
average_roc_auc = np.mean(roc_auc_scores)
print(f"Average ROC-AUC: {average_roc_auc:.4f}")
# Train the model on the total dataset
dtrain_full = xgb.DMatrix(fingerprints, label=y)
bst_full = xgb.train(params, dtrain_full, num_boost_round=100)
The test set results i tested goes below , it does look these features could possibly be a invaluable approach to use them into models.
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_scoretest = split['test']
test_list = test['Drug'].tolist()
y_test = test['Y'].values
# # Generate charge fingerprints for every molecule within the dataset
test_fp = np.array([generate_charge_fingerprint(smiles,n_bits=2048, nbins=32) for smiles in test_list])
test = xgb.DMatrix(test_fp, label=y_test)
y_pred_proba = bst_full.predict(test)
y_pred = y_pred_proba > 0.5
roc_auc = roc_auc_score(y_test, y_pred_proba)
confusion_matrix(y_test, y_pred)
print('Precision: %.3f' % precision_score(y_test, y_pred))
print('Recall: %.3f' % recall_score(y_test, y_pred))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
print('F1 Rating: %.3f' % f1_score(y_test, y_pred))
Precision: 0.839 Recall: 0.959 Accuracy: 0.832 F1 Rating: 0.895
Charge Fingerprints might be highly invaluable in modeling various molecular properties and activities, as they supply insights into the electrostatic behavior of compounds, which is a key think about many chemical and biological interactions. By incorporating charge information into molecular models, it becomes possible to higher capture the nuances of molecular recognition, binding, and reactivity, resulting in more accurate predictions and improved understanding of the underlying molecular mechanisms.
Please leave comments when you find this concept useful and would like to explore more on this topic.