Histograms for molecular
weight (a) and experimental free binding energy ∆G (b) of the training
sets used
for building the final COMBINE models of trypsin, thrombin, and
urokinase. The molecular
weight of the ligands were distributed between 100 and 650 Da and the
experimental
∆G values (in kcal/mol) spanned in total a range of 11 log units with a
maximum
of 9 log units for a single training set. The training sets of thrombin
and
trypsin showed a larger distribution of the weight and the binding
affinity
than those of urokinase.
(some
more histograms)
For the
training set, used for building the COMBINE model of thrombin, a
matrix of Tanimoto similarity values were generated. The Tanimoto
similarity is
a value between 0 and 1 based 2D structure information of two ligands.
The best coefficients of
determination R2 and predictive correlation Q2 of
the COMBINE
models of thrombin, trypsin, and urokinase were tabulated in respect
the latent
variables (LV).
The best Q2-LOO (leave one out) and Q2-LTO (leave
two
out) values for thrombin, trypsin and urokinase were 0.89 (LV5), 0.83
(LV3),
and 0.68 (LV4), respectively. In trypsin, variable selection did not
improve
the model, but in thrombin and urokinase the models could be improved
according
to internal cross-validation by using D-optimal pre-selection (D-opt)
and
fractional factorial design (FFD) variable selection at LV4. For
thrombin LV4
and LV5 resulted in nearly the same values. Due to the risk of over
fitting, a
lower latent variable was chosen. The COMBINE model of thrombin could
be
slightly improved by using four highly conserved water molecules in the
active
site as additional ‘residues’ (X-variables).
a) The R2 and Q2 values
of internal cross-validation of the different COMBINE models were
plotted in
dependency of the number of latent variables (LV). (For more details see legend of
table 1). b) Predicted versus
experimental binding free energy ∆G in kcal/mol. For the COMBINE models
of urokinase
and thrombin latent variable 4 were chosen before (blue dots) and after
variable
selection (red dots). For trypsin the best model could be reached at
latent
variable 3 without any variable selection. The R2 values based on the
plots are
given in the figures and in table 1.
In the Regression
Error Characteristic (REC) curves
the cumulative proportion were plotted versus the error tolerance
of the absolute difference between the experimental and predicted ∆G
values. Ligands
of the ‘pseudo’ test set were docked ten times in the corresponding
receptor
models of thrombin, trypsin and urokinase. For each docking solution a
∆G value
were predicted and were ranked according to RMSD, GoldS, ∆∆Gexper‑pred,
(best abs(exper-pred)), ∆∆Gdesolv (best dG bind elec) and ∆Gpred
(for more details see results).
a) The
different curves shows the cumulative distribution of best RMSD (dotted
blue
line), GoldS (green line), ∆∆Gexper‑pred (dotted dark red),
∆∆Gdesolv
(brown) ranked and the 5th ranked (red) ∆G values against
the error of
prediction in kcal/mol. In addition, curves based predicted ∆G values
of ligand
conformations taken from X-ray structures before (blue) and after
(purple) variable
selection are given.
b) The same
ranking was used to plot the cumulative proportion versus the RMSD (in
Å) of
the docking solutions.
The 5th ranked predicted ∆G value of
the docking solution for trypsin was divided by the 5th
ranked
predicted ∆G value of the docking solution for thrombin, which gives
the
predicted selectivity. The predicted selectivity was plotted against
the
experimental selectivity (experimental ∆G of trypsin/ experimental ∆G
of
thrombin). The inhibitors of the Klebe data set were not used. The five
points
in the lower right part were not used for calculating the R2,
because the absolute difference between the experimental and predicted
∆G
values for thrombin were greater than 3 log units. Although the
prediction for
the Klebe data set was quite good, the experimental ∆G values were
within the
noise of the prediction. No prediction for selectivity could be given.
The real coefficient for van der Waals and
electrostatic interactions of the COMBINE models of thrombin and
trypsin were plotted for the different residues.
The surface of the active-site
of thrombin and trypsin were coloured according the real coefficient
for van
der Waals and electrostatic interactions. The labeling of the residues
based on
chymotrypsin numbering.