PROTEINS: Structure, Function, and Bioinformatics 66:588–599 (2007)

Learning About Protein Hydrogen Bonding by Minimizing Contrastive Divergence Alexei A. Podtelezhnikov,1 Zoubin Ghahramani,2 and David L. Wild1,3* 1 Keck Graduate Institute of Applied Life Sciences, Claremont, California 91711 2 Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, United Kingdom 3 Systems Biology Centre, University of Warwick, Coventry CV4 7AL, United Kingdom

ABSTRACT Defining the strength and geometry of hydrogen bonds in protein structures has been a challenging task since early days of structural biology. In this article, we apply a novel statistical machine learning technique, known as contrastive divergence, to efficiently estimate both the hydrogen bond strength and the geometric characteristics of strong interpeptide backbone hydrogen bonds, from a dataset of structures representing a variety of different protein folds. Despite the simplifying assumptions of the interatomic energy terms used, we determine the strength of these hydrogen bonds to be between 1.1 and 1.5 kcal/mol, in good agreement with earlier experimental estimates. The geometry of these strong backbone hydrogen bonds features an almost linear arrangement of all four atoms involved in hydrogen bond formation. We estimate that about a quarter of all hydrogen bond donors and acceptors participate in these strong interpeptide hydrogen bonds. Proteins 2007;66:588– 599. VC 2006 Wiley-Liss, Inc. Key words: hydrogen bond; machine learning; contrastive divergence; Metropolis Monte Carlo INTRODUCTION It is hardly possible to identify a concept more fundamental in protein structure than hydrogen bonding. Hydrogen bonds between backbone atoms play a central role in stabilizing the secondary structure of proteins.1,2 Based on surveys of the Protein Data Bank (PDB),3 important reviews of hydrogen bonding in globular proteins have formulated the basics of our present understanding of hydrogen bond geometry and networking.4–7 In these reviews, the geometric definitions of hydrogen bonds were designed to capture as many reasonable hydrogen bonds as possible. Although similar to each other, these geometric criteria remain empirical and even idiosyncratic. The strength of hydrogen bonds relative to other interactions is surrounded by even greater controversy (as reviewed by Rose and Wolfenden8 and Fleming and Rose9). Although most researchers agree that the formation of interpeptide hydrogen bonds is enthalpically favorable, some argue that the stabilizing effect of hydrogen bonds is marginal.10 C 2006 WILEY-LISS, INC. V

Fueled by the growing PDB, more recent hydrogen bond studies have concentrated on inferring a knowledge-based hydrogen bond potential.11,12 This potential is usually obtained from the frequencies at which a certain interaction angle or distance is observed in a dataset, while assuming that the dataset correctly represents thermodynamic equilibrium with an underlying Boltzmann distribution.13–15 The main goal of this study was to evaluate backbone hydrogen bond strength and geometric criteria for use in an ab initio modeling procedure, from a dataset of structures representing a variety of different protein folds. Winther and Krogh have previously demonstrated that optimized potential functions learned from training data can stabilize a set of native conformations.16 In this work, we used interatomic energy terms that were inspired by the classical force fields but greatly simplified for the purposes of our simulation. We applied a novel statistical machine learning technique known as contrastive divergence (CD)17 to estimate backbone hydrogen bond parameters. This method essentially optimizes the parameters to minimize the difference between the original dataset and the structures resulting from small perturbations by a Metropolis method in the force field of hydrogen bonds and van der Waals repulsions. We use a diverse set of 247 protein structures that represents a variety of different protein folds extracted from the SCOP/ASTRAL database.18,19 To perturb the structures on each CD iteration, we draw on a high-performance Metropolis procedure described in detail elsewhere.20 To facilitate the interpretability of results, we approximate hydrogen bonding by a square-well potential with well-defined hydrogen bond strength and cutoffs for angles and distances between the atoms. To the best of our knowledge, the geometric criteria and the strength of hydrogen bond formation have never been estimated simultaneously without a prior assumption about one or the other.

Grant sponsor: National Institutes of Health; Grant number: 1 P01 GM63208. *Correspondence to: David L. Wild, Keck Graduate Institute of Applied Life Sciences, 935 Watson Drive, Claremont, CA 91711. E-mail: [email protected] Received 7 June 2006; Revised 11 August 2006; Accepted 8 September 2006 Published online 15 November 2006 in Wiley InterScience (www. interscience.wiley.com). DOI: 10.1002/prot.21247

589

PROTEIN HYDROGEN BONDS AND CONTRASTIVE DIVERGENCE

The model of the polypeptide backbone used in this work is characterized by absolutely rigid planar peptide bonds with explicit hydrogen atoms. The bond lengths and angles are fixed and correspond to the classical averages from Engh and Huber.21 Interestingly, despite the simplified force-field used and the reduced number of degrees of freedom in our model, our estimates of hydrogen bond geometry and strength agree with traditional views on hydrogen bonds. We also demonstrate that the CD learning technique employed in this work is capable of evaluating physical interactions in terms of arbitrarily selected parameters.

METHODS Polypeptide Model We modeled the polypeptide as a chain of absolutely rigid peptide groups elastically connected at a-carbons, with the valence angles constrained to 109.58  2.88. The positions of all peptide bond atoms including hydrogen were specified by the orientations of the peptide bonds and were consistent with trans-conformation. We fixed the peptide bond lengths and angles at standard values.21,22 The distance between adjacent a-carbons was ˚ in our model. The b carbon positions were fixed at 3.8 A stipulated by the tetrahedral geometry of the a-carbon atoms and corresponded to L-amino acids. A more detailed description of the model was given in our previous work.20 No other side chain atoms, besides b-carbons, were considered in the model. Glycine residues were short of a b-carbon. Proline residues lacked peptide bond hydrogens and their dihedral angles, u, were elastically constrained to 608  78.23 Therefore, our model incorporated only very limited sequence information. The local elastic bending interactions mentioned above were introduced as a harmonic potential, EB i , pertaining to each amino acid i. Global interactions included only van der Waals repulsions between colliding atoms separated by at least three chemical bonds and hydrogen bonding between the amide hydrogens and carbonyl oxygens of the peptide backbone. As in our previous work,20 we mimicked van der Waals repulsions with hard-sphere potentials: ¼ nc W EvdW ij

ð1Þ

where nc is the number of overlaps between the atoms of amino acids i and j. The collision cost W was set at a very high value of 15 kcal/mol, effectively prohibiting overlaps between atoms during simulations. We used values of hard-sphere atomic radii close to a lower limit of the range ˚ , r(C) ¼ found in the literature24–27: r(Ca) ¼ r(Cb) ¼ 1.57 A ˚ , r(O) ¼ 1.29 A ˚ , r(N) ¼ 1.29 A ˚ . The atomic radii and 1.42 A the collision cost were constant parameters in the course of this work. In our model, backbone hydrogen atoms were excluded from the collision analysis but were important in identifying hydrogen bonds. The energy of the hydrogen bond (see Fig. 1) was similarly described by a square-well potential,

Fig. 1. (A) Hydrogen bond geometry. The distance and two angular parameters of hydrogen bonds are shown. (B) Schematic one-dimensional approximation of hydrogen bond energy with a square-well potential. This approximation sharply discriminates between strong and weak hydrogen bonds. Weak bonds do not contribute to the total energy and are dropped from consideration in this work. The hydrogen bond strength H corresponds to an average strength of hydrogen bonds.

EHB ij ¼ nh H

ð2Þ

where H is the strength of each hydrogen bond, and nh is the number of hydrogen bonds between the amino acids i and j. We considered the hydrogen bond formed when three distance and angular conditions were satisfied: r(O, H) < d, \OHN > Y, and \COH > C, where r(O, H) is the distance between oxygen and hydrogen, and symbol \ denotes the angle between the three atoms (see Fig. 1). The lower bound on the separation between the atoms ˚ ) was implicitly set by the hard-sphere col(r(O, H) > 1.8 A lision between oxygen and nitrogen. We used the same hydrogen bond potential regardless of the secondary structure adopted by the peptide backbone. The primary focus of this work was to determine the strength of the hydrogen bonds, H, as well as the three cutoff parameters, {d, Y, C}. To summarize, the total energy of a polypeptide chain conformation X, given the set of model parameters y ¼ {H, d, Y, C}, was calculated as follows EðX; uÞ ¼

N X i¼1

EB i þ

N X i X ðEvdW þ EHB ij ij Þ

ð3Þ

i¼1 j¼1

Our model force field does not include other energy terms such as electrostatic and hydrophobic interactions and other interactions with solvent. In particular, we reduced hydrogen bond interactions to square-well

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

590

A.A. PODTELEZHNIKOV ET AL.

potentials. Therefore, our model parameters only describe the hydrogen bonding in terms of its average strength and the cutoff geometry for relatively strong backbone hydrogen bonds under these assumptions [Fig. 1(B)]. Contrastive Divergence Learning The energy given by Eq. (3) defines the probability of a particular conformation X via the Boltzman distribution: PðXjuÞ ¼

1 exp½EðX; uÞ ZðuÞ

Z ZðuÞ ¼

ð4Þ

dX exp½EðX; uÞ

where Z(y) is the partition function. Here, the energy is expressed in units of RT, the product of the molar gas constant and absolute temperature. Assuming that X0 is a native conformation with energy near the minimum, the inverse problem of estimating the values of the parameters, y, can be solved by maximum likelihood (ML) optimization with the gradient ascent method:28 uðiþ1Þ

uðiÞ þ h

@lnPðX0 juÞ @u

ð5Þ

where h is a positive learning rate, which needs to be small enough for the algorithm to converge. Taking the derivative of the log-likelihood from Eq. (4): @lnPðX0 juÞ 1 @ZðuÞ @EðX0 ; uÞ ¼  @u ZðuÞ @u @u Z 1 @EðX; uÞ @EðX0 ; uÞ ¼ dX exp½EðX; uÞ  ZðuÞ @u @u   @EðX; uÞ @EðX0 ; uÞ  ð6Þ ¼ @u @u where angular brackets denote the expectation value under the distribution P(X|y). While the second term in this equation can be easily computed for a given conformation, X0, the evaluation of the expectation value in the first term requires extensive Monte Carlo sampling from the canonical distribution of conformations. As an alternative to the gradient evaluation according to Eq. (6), Hinton17 proposed an approximate ML algorithm called contrastive divergence (CD) learning. Instead of extensively sampling conformations, the CD method lets the system evolve for a very small number of steps, K, from the initial conformation, X0, to a conformation XK, using, for example, a Metropolis procedure. The gradient is then estimated as @lnPðX0 juÞ @EðXK ; uÞ @EðX0 ; uÞ   @u @u @u

ð7Þ

In the case of hydrogen bond strength H, the meaning of the negative partial derivative, @E(X,H)/@H, is easy to PROTEINS: Structure, Function, and Bioinformatics

grasp, since it corresponds to the total number of hydrogen bonds in the structure [see Eq. (2)]. If, as a result of the short system evolution, the number of the hydrogen bonds decreased, then the strength of the hydrogen bond was too weak and would need to be increased. We do not need to determine the average number of the hydrogen bonds to come to this conclusion. The hydrogen bond cutoff parameters were updated according to changes in corresponding partial derivatives that were also evaluated numerically, as products of hydrogen bond strength and the observed density of states near the boundaries. To let the system evolve for K steps starting from the fixed initial conformation X0, as required on each CD learning iteration, we used a high-performance Metropolis sampler. The procedure utilizes local crankshaft and pivotal rotations of several adjacent peptide bonds on each step.20 In our Metropolis procedure, the root-meansquare step (an average change in a peptide bond orientation) was about 4.58, with approximately half of the trial moves accepted. We chose to perform K ¼ 4096 Metropolis steps to perturb the conformation of each protein on each iteration of CD learning. This ensures that in a 100-residue long protein each peptide bond orientation was acceptably perturbed about 80 times. In such a protein, the root-mean-square-deviation between the ini˚. tial and perturbed Ca positions was about 0.8 A A single native protein conformation, X0, is hardly representative of the natural variety of protein structures. Given a large set of proteins, CD learning can be done by maximizing the product of individual likelihoods. It can be shown that, in this case, the energy gradients in Eq. (7) should be substituted with their averages over the entire training set of proteins.17 Dataset To prepare a training set of proteins of known structure, we used ASTRAL 1.69.19 We initially downloaded the 945 highest SPACI scoring PDB-style structures that represent different folds according to SCOP classification.18 SPACI (Summary PDB Astral Check Index) is an approximate measure of structure quality, which incorporates resolution, R-factor, and stereochemical checks.19 A large portion of these structures was eliminated from this dataset. We kept only representatives of the a, b, a/b, and a þ b classes. We dropped all structures with gaps or cis-residues, because our model software cannot presently treat such structures correctly. Finally, we removed the structures with SPACI scores of less than 0.4 and NMR structures. This left us with 247 high-quality diverse X-ray structures. The structures in the training set are listed in the Table I, along with the corresponding number of residues and the SPACI score. Our polypeptide model featured absolutely planar peptide bonds and equidistant a-carbons. The backbone conformations provided by PDB entries are not always ideal in this sense. While parsing the PDB entries, we closely followed the a-carbon trace and peptide bond orientations. The a-carbon positions and the peptide bond ori-

DOI 10.1002/prot

591

PROTEIN HYDROGEN BONDS AND CONTRASTIVE DIVERGENCE

TABLE I. Each Structure in the Dataset is Described by its SCOP Fold Class, the Number of Amino Acids (N), the Number of Strong Interpeptide Hydrogen Bonds (Nh), the Fraction of Strong Hydrogen Bonds, SPACI Structure Quality Score, RMSE of Our Modeling, and the Primary Refinement Program ID

Class

N

Nh

Nh/N

SPACI

RMSE

Prog.

d1a1x_ d1a3aa_ d1a6m_ d1a6q_2 d1aa7a_ d1af7_1 d1aie_ d1ai1_ d1ako_ d1axn_ d1b25a2 d1b3aa_ d1b67a_ d1bd8_ d1bgf_ d1bkra_ d1bm8_ d1boua_ d1bxya_ d1c1da2 d1c1ka_ d1c75a_ d1cipa1 d1cq3a_ d1csei_ d1cy5a_ d1d4oa_ d1d8ia_ d1dcia_ d1dd3a1 d1dfup_ d1di8a_ d1d15a2 d1duvg1 d1dvoa d1dzfa2 d1e19a_ d1e58a_ d1ed1a_ d1e16a_ dlew4a_ dleyqa_ dlez3a_ dlf3ua_ dlf7ta_ dlf86a_ dlfid_ dlfk5a_ dlflma_ dlfn9a_ dlfs1a1 d1fs1b1 d1fw9a_ dlg2ra_ dlg6la_ dlg66a_ dlg6ga_ dlg7sa3 d1g8ea_

b aþb a aþb a a a a aþb a aþb aþb a aþb a a aþb a aþb a/b a a a b aþb a a/b aþb a/b a b a aþb a/b a aþb a/b a/b a aþb aþb aþb a b aþb b aþb a b aþb a a aþb aþb aþb a/b b a/b a

106 145 151 295 158 81 31 70 268 323 210 67 68 156 124 108 99 132 60 148 217 71 121 224 63 92 177 288 275 57 94 79 104 150 152 72 313 247 114 208 106 212 124 118 119 115 258 93 122 365 41 55 164 94 225 207 127 131 98

25 32 65 76 42 17 12 26 76 100 55 13 22 39 36 32 27 22 14 40 65 0 27 54 19 28 49 92 73 17 23 14 22 46 39 16 101 67 21 36 44 60 44 21 33 34 42 24 30 86 5 15 41 29 61 49 34 36 19

0.24 0.22 0.43 0.26 0.27 0.21 0.39 0.37 0.28 0.31 0.26 0.19 0.32 0.25 0.29 0.30 0.27 0.17 0.23 0.27 0.30 0.00 0.22 0.24 0.30 0.30 0.28 0.32 0.27 0.30 0.24 0.18 0.21 0.31 0.26 0.22 0.32 0.27 0.18 0.17 0.42 0.28 0.35 0.18 0.28 0.30 0.16 0.26 0.25 0.24 0.21 0.27 0.25 0.31 0.27 0.24 0.27 0.27 0.19

0.43 0.51 1.04 0.41 0.42 0.41 0.62 0.52 0.56 0.51 0.50 0.60 0.63 0.51 0.66 0.91 0.52 0.42 0.45 0.61 0.61 1.06 0.59 0.45 0.79 0.77 0.66 0.40 0.63 0.41 0.49 0.42 0.52 0.53 0.47 0.43 0.64 0.85 0.41 0.44 0.69 0.44 0.47 0.48 0.49 0.92 0.44 0.77 0.77 0.52 0.48 0.48 0.68 0.73 0.74 1.16 0.54 0.41 0.46

0.023 0.021 0.028 0.032 0.023 0.022 0.021 0.022 0.020 0.030 0.026 0.028 0.022 0.024 0.027 0.028 0.032 0.013 0.062 0.046 0.013 0.029 0.025 0.026 0.050 0.020 0.028 0.061 0.021 0.017 0.019 0.025 0.020 0.029 0.023 0.024 0.031 0.033 0.023 0.016 0.016 0.028 0.013 0.020 0.021 0.044 0.026 0.025 0.034 0.021 0.029 0.022 0.034 0.029 0.026 0.037 0.021 0.016 0.028

X-PLOR X-PLOR SHELXL X-PLOR X-PLOR ARP X-PLOR REFMAC X-PLOR PROLSQ X-PLOR SHELXL SHELXL REFMAC X-PLOR SHELXL X-PLOR X-PLOR X-PLOR TNT X-PLOR SHELXL X-PLOR CNS EREF SHELXL SHELXL REFMAC REFMAC CNS CNS CNS CNS CNS CNS X-PLOR REFMAC SHELXL X-PLOR CNS CNS CNS CNS CNS CNS SHELXL X-PLOR SHELXL SHELXL CNS CNS CNS CNS REFMAC SHELXL SHELXL REFMAC CNS CNS

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

592

A.A. PODTELEZHNIKOV ET AL.

TABLE I. (Continued) ID

Class

N

Nh

Nh/N

SPACI

RMSE

Prog.

d1g8qa_ d1g8ta_ d1g9za_ d1gk8i_ d1gmua1 d1goia1 d1gp0a_ d1gpqa_ d1gw1ya_ d1gxja_ d1h09a1 d1h2ca_ d1h4ax1 d1h6wa1 d1h99a1 d1hufa_ d1hw1a2 d1i27a_ d1i2ta_ d1i4ja_ d1i4ma_ d1i6pa_ d1ig0a2 d1ihra_ d1io0a_ d1iq4a_ d1itxa2 d1iw0a_ d1ixh_ d1j09a1 d1j0pa_ d1jf8a_ d1jh6a_ d1jida_ d1jnra1 d1jo0a_ d1jo8a_ d1josa_ d1jr2a_ d1jsda_ d1jyha_ d1k0ra4 d1k20a_ d1k3xa1 d1k3xa2 d1k3ya1 d1k6ka_ d1k8ke_ d1kafa_ d1keka4 d1khda1 d1kid_ d1kjqa1 d1kkoa2 d1knma_ d1kpta_ d1kq1a_ d1kqpa_ d1kyfa2 d115oa_

a aþb aþb aþb b b b aþb b aþb b b b b a aþb a a a aþb aþb a/b a/b aþb a/b aþb aþb a a/b a a a/b aþb aþb a aþb b aþb a/b b aþb aþb a/b a b a a a aþb a/b a a/b b aþb b aþb b a/b aþb a/b

90 241 152 125 70 52 133 127 175 161 149 124 85 82 115 123 152 73 61 110 108 214 221 73 166 179 72 207 321 163 108 130 181 114 141 97 58 100 260 317 155 104 310 89 124 142 142 173 108 253 69 193 74 159 129 105 60 271 114 346

29 41 31 29 20 6 32 28 45 32 27 34 19 6 25 29 57 18 26 28 24 55 49 11 44 54 2 66 85 33 15 21 51 19 30 36 17 11 61 66 50 1 76 24 34 39 39 32 33 52 22 55 20 35 24 28 25 89 32 79

0.32 0.17 0.20 0.23 0.29 0.12 0.24 0.22 0.26 0.20 0.18 0.27 0.22 0.07 0.22 0.24 0.38 0.25 0.43 0.25 0.22 0.26 0.22 0.15 0.27 0.30 0.03 0.32 0.26 0.20 0.14 0.16 0.28 0.17 0.21 0.37 0.29 0.11 0.23 0.21 0.32 0.01 0.25 0.27 0.27 0.27 0.27 0.18 0.31 0.21 0.32 0.28 0.27 0.22 0.19 0.27 0.42 0.33 0.28 0.23

0.56 0.92 0.48 0.71 0.59 0.67 0.62 0.58 0.49 0.41 0.41 0.60 0.85 0.44 0.57 0.41 0.64 1.00 0.98 0.49 0.44 0.46 0.47 0.64 0.63 0.47 0.93 0.72 1.05 0.50 1.13 0.85 0.50 0.51 0.59 0.73 0.78 0.51 0.48 0.48 0.50 0.47 0.64 0.79 0.79 0.78 0.50 0.40 0.55 0.46 0.46 0.55 0.88 0.74 0.86 0.53 0.55 1.03 0.81 0.61

0.031 0.023 0.017 0.020 0.025 0.047 0.028 0.025 0.027 0.014 0.017 0.020 0.025 0.044 0.010 0.029 0.010 0.022 0.015 0.021 0.024 0.013 0.013 0.036 0.011 0.027 0.028 0.019 0.034 0.009 0.032 0.030 0.033 0.036 0.016 0.035 0.034 0.38 0.044 0.019 0.018 0.007 0.014 0.022 0.026 0.024 0.014 0.022 0.017 0.023 0.011 0.025 0.037 0.031 0.017 0.027 0.036 0.023 0.032 0.019

REFMAC REFMAC X-PLOR REFMAC CNS SHELXH CNS CNS CNS CNS CNS REFMAC REFMAC REFMAC CNS CNS CNS SHELXL SHELXL CNS CNS CNS CNS SHELXL X-PLOR CNS REFMAC REFMAC SHELXL CNS SHELXL CNS CNS CNS CNS SHELXL REFMAC CNS REFMAC CNS CNS CNS CNS SHELXL SHELXL SHELXL CNS REFMAC CNS X-PLOR CNS REFMAC TNT REFMAC REFMAC X-PLOR CNS SHELXL SHELXL CNS

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

593

PROTEIN HYDROGEN BONDS AND CONTRASTIVE DIVERGENCE

TABLE I. (Continued) ID

Class

N

Nh

Nh/N

SPACI

RMSE

Prog.

d1191a_ d11bu_1 d11c0a2 d11kka_ d11ria_ d11yva_ d1m15a1 d1m44a_ d1m7ja1 d1mc2a_ d1me4a_ d1mixa1 d1mk0a_ d1moga_ d1mqoa_ d1mvfd_ d1n5ua1 d1n62a1 d1n62b1 d1n62c1 d1n62c2 d1n81a_ d1n8va_ d1nbua_ d1ng6a_ d1nh2b_ d1nkd_ d1nkpb_ d1nm8a1 d1nox_ d1nppa1 d1nqua_ d1nrza_ d1nwwa_ d1o0wa1 d1o26a_ d1o7qa_ d1oa8a_ d1oaia_ d1ocya_ d1od3a_ d1ok7a1 d1on2a2 d1oo0a_ d1or7c_ d1os1a2 d1ospo_ d1ou8a_ d1ovnal d1ow1a2 d1ax0al d1p57a d1p60a_ d1p9ya_ d1pcfa_ d1pda_2 d1pdo_ d1pfval d1pm4a_ d1ptf_

a a aþb aþb a a/b a aþb b a aþb a aþb aþb aþb b a a aþb aþb aþb a a aþb a a a a a/b aþb b a/a a/b aþb a aþb a/b b a aþb b aþb a aþb a a/b b b a a/b a/b aþb a/b aþb aþb aþb a/b a b aþb

74 83 118 105 98 293 94 177 55 122 215 114 97 67 221 44 195 82 141 109 117 186 101 118 148 46 59 83 377 200 81 154 163 145 169 219 287 128 59 198 131 122 74 144 66 224 251 106 107 202 256 110 156 117 66 88 129 162 117 87

27 12 28 24 27 79 21 42 13 33 40 30 23 22 60 7 69 26 24 33 42 53 31 29 49 16 35 25 86 40 25 46 49 45 62 67 64 36 22 15 38 36 14 52 23 44 66 26 33 50 70 26 48 31 22 31 36 39 36 32

0.36 0.14 0.24 0.23 0.28 0.28 0.22 0.24 0.24 0.27 0.19 0.26 0.24 0.33 0.27 0.16 0.35 0.32 0.17 0.30 0.24 0.28 0.31 0.25 0.33 0.35 0.59 0.30 0.23 0.20 0.31 0.30 0.30 0.31 0.37 0.31 0.22 0.28 0.37 0.08 0.29 0.30 0.19 0.36 0.35 0.20 0.26 0.25 0.31 0.25 0.27 0.24 0.31 0.26 0.33 0.35 0.28 0.24 0.31 0.37

1.10 0.50 0.75 1.00 0.67 0.75 0.85 0.59 0.65 1.23 0.88 0.52 0.59 0.52 0.66 0.53 0.42 0.92 0.92 0.92 0.92 0.42 0.71 0.56 0.68 0.47 0.90 0.52 0.56 0.60 0.40 0.58 0.50 0.84 0.46 0.58 0.80 0.50 1.03 0.66 1.01 0.54 0.57 0.45 0.45 0.49 0.42 0.57 0.43 0.51 0.78 0.52 0.91 0.41 0.54 0.55 0.55 0.55 0.52 0.61

0.032 0.034 0.018 0.020 0.023 0.026 0.019 0.031 0.024 0.033 0.048 0.013 0.016 0.036 0.027 0.022 0.014 0.057 0.073 0.083 0.067 0.023 0.021 0.034 0.011 0.014 0.031 0.011 0.026 0.029 0.031 0.024 0.021 0.035 0.022 0.017 0.029 0.042 0.018 0.026 0.029 0.030 0.011 0.038 0.023 0.021 0.021 0.018 0.048 0.016 0.033 0.046 0.34 0.021 0.022 0.062 0.028 0.013 0.050 0.032

SHELXL PROLSQ CNS SHELXL SHELXL SHELXL SHELXL CNS CNS SHELXL SHELXL CNS CNS CNS CNS CNS CNX REFMAC REFMAC REFMAC REFMAC CNS REFMAC SHELXL CNS CNS SHELXL CNS CNS REFMAC CNS REFMAC REFMAC REFMAC CNS CNS SHELXL REFMAC REFMAC REFMAc REFMAC CNS CNS CNS REFMAC X-PLOR X-PLOR CNS REFMAC CNS REFMAC CNS SHELXL CNS REFMAC X-PLOR X-PLOR CNS REFMAC X-PLOR

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

594

A.A. PODTELEZHNIKOV ET AL.

TABLE I. (Continued) ID

Class

N

Nh

Nh/N

SPACI

RMSE

Prog.

d1puc_ d1pz4a_ d1q5za_ d1q9ia3 d1qhva_ d1qwza_ d1r0va3 d1r29a_ d1r2ma_ d1r89a1 d1r89a2 d1rj1a_ d1r1ha_ d1rv9a_ d1rwha2 d1rwha3 d1ry9a_ d1s95a_ d1sdia_ d1seda_ d1seia_ d1sfda_ d1sy1a_ d1szha_ d1t15al d1t1ja_ d1t2da2 d1t3ta3 d1t7ra_ d1t8ka_ d1t95a2 d1tbfa_ d1tfe_ d1tjya_ d1tuaa1 d1tvfa1 d1tx4a_ d1u55a_ d1u94a2 d1uc2a_ d1ucda_ d1udxal d1udxa3 d1ufya_ d1umwa1 d1uowa_ d1uq5a_ dlutg_ d1uyla d1v33a_ d1v54h_ d1v74a_ d1vcc_ d1vcla3 d1vf6a_ d1vh5a_ d1vioa2_ d1vk5a_ d1vkia_ d1vkka_

aþb aþb a aþb b b aþb aþb b a aþb a aþb aþb b b aþb aþb a a aþb b b a a/b a/b aþb aþb a a aþb a aþb a/b aþb b a aþb aþb aþb aþb b aþb aþb aþb b aþb a aþb aþb a aþb aþb aþb a aþb aþb a aþb aþb

101 113 145 146 195 235 75 122 70 115 142 148 151 242 113 272 133 324 213 112 130 105 184 147 109 119 165 152 250 77 76 326 142 316 84 68 196 188 60 480 190 156 76 121 128 156 263 70 207 346 79 107 77 149 58 138 58 121 165 137

7 24 35 31 43 56 18 41 3 30 27 51 40 59 30 59 42 71 61 35 27 23 40 41 27 37 45 36 68 24 23 106 39 105 21 13 47 55 22 112 49 48 21 22 33 38 66 21 54 89 15 24 13 46 22 53 16 36 37 49

0.07 0.21 0.24 0.21 0.22 0.24 0.24 0.34 0.04 0.26 0.19 0.34 0.26 0.24 0.27 0.22 0.32 0.22 0.29 0.31 0.21 0.22 0.22 0.28 0.25 0.31 0.27 0.24 0.27 0.31 0.30 0.33 0.27 0.33 0.25 0.19 0.24 0.29 0.37 0.23 0.26 0.31 0.28 0.18 0.26 0.24 0.25 0.30 0.26 0.26 0.19 0.22 0.17 0.31 0.38 0.38 0.28 0.30 0.22 0.36

0.44 0.64 0.52 0.60 0.69 0.53 0.45 0.79 1.00 0.49 0.49 0.51 0.52 0.59 0.83 0.83 0.49 0.59 0.62 0.47 0.42 1.02 1.01 0.62 0.45 0.53 0.77 0.47 0.69 0.93 0.45 0.78 0.52 0.77 0.62 0.47 0.56 0.49 0.49 0.43 0.71 0.40 0.40 0.93 0.42 0.93 0.68 0.65 0.64 0.48 0.50 0.46 0.55 0.55 0.41 0.71 0.58 0.64 0.62 0.73

0.025 0.041 0.013 0.022 0.021 0.016 0.056 0.018 0.021 0.023 0.030 0.020 0.021 0.032 0.044 0.034 0.018 0.039 0.041 0.017 0.037 0.026 0.033 0.018 0.024 0.024 0.018 0.017 0.011 0.018 0.039 0.032 0.025 0.021 0.019 0.046 0.023 0.016 0.012 0.027 0.030 0.020 0.013 0.037 0.021 0.037 0.046 0.021 0.030 0.021 0.024 0.016 0.032 0.027 0.014 0.027 0.013 0.025 0.028 0.048

TNT TNT CNS REFMAC REFMAC CNS CNS SHELXL REFMAC REFMAC REFMAC CNS CNS CNS REFMAC REFMAC REFMAC REFMAC REFMAC CNS X-PLOR SHELXL REFMAC CNS REFMAC CNS REFMAC CNS CNS REFMAC REFMAC REFMAC X-PLOR REFMAC CNS CNS REFMAC CNS CNS CNS CNS CNS CNS SHELXL REFMAC REFMAC REFMAC PROLSQ REFMAC CNS X-PLOR CNS X-PLOR REFMAC CNS REFMAC REFMAC REFMAC REFMAC REFMAC

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

595

PROTEIN HYDROGEN BONDS AND CONTRASTIVE DIVERGENCE

TABLE I. (Continued) ID

Class

N

Nh

Nh/N

SPACI

RMSE

Prog.

d1whi_ d2cpl_ d2end_ d2igd_ d21isa_ d2pvba_ d3ezma_ d4ubpa_

b b a aþb a a b aþb

122 164 137 61 131 107 101 100

28 33 26 22 41 26 19 25

0.23 0.20 0.19 0.36 0.31 0.24 0.19 0.25

0.62 0.57 0.70 0.97 0.78 1.15 0.61 0.64

0.021 0.036 0.029 0.028 0.022 0.026 0.022 0.025

X-PLOR X-PLOR PROLSQ SHELXL SHELXL SHELXL CNS REFMAC

entations were sufficient to reconstruct the positions of CO and NH groups in our model polypeptides.20 The inevitable inaccuracies of this conversion from PDB entries to our model coordinates were well below the resolution of the PDB structures: the average root-meansquare-error (RMSE) in a-carbon positions was around ˚ and never exceeded 0.083 A ˚ . The RMSE for 0.026 A each structure in the set is listed in the Table I and can be used as another measure of the structure quality and compatibility with our procedure. On a 3 GHz Intel Pentium PC, the CD learning procedure on the set of 247 structures converged reasonably fast, within 24 h. RESULTS Hydrogen Bonds in the Dataset and Model In this study, we used a set of 247 high-quality X-ray structures consisting of 74 a-proteins, 131 ab-proteins, and 42 b-proteins. The number of residues in the structures ranged from 31 to 480 with an average of 145.8. The total number of residues in the set was 36,015. The structures in the set are listed in the Table I. According to the PDB annotations, eight programs were used for refinement of the structures: 95 structures were refined with CNS,29 31 with X-PLOR,22 69 with REFMAC,30 42 with SHELXL,31 4 with PROLSQ,32 4 with TNT,33 1 with ARP,34 and 1 with EREF.35 The primary refinement method for each structure in the dataset is listed in the Table I. For all the proteins in the dataset, the refinement methods optimized covalent bond lengths and angles. We note that none of the programs optimized hydrogen bond geometry in the structures besides pushing apart atoms that are too close, in some cases. Therefore, it is unlikely that the distribution of the hydrogen bonds in the dataset and, in turn, our estimation of the hydrogen bond geometry and strength from the dataset are significantly influenced by the different refinement methods that were used. Instead of assaying the distribution of distances and angles between the hydrogen-bonded atoms in the available structures, we used the CD machine learning technique (see Methods section) to establish the geometric criteria for hydrogen bonding. In our model, we only considered interpeptide hydrogen bonds when certain geometric criteria on the mutual position and orientation of CO and NH groups were satisfied. In aqueous solution, many

groups that did not form interpeptide hydrogen bonds would form them with water molecules. Therefore, in our model without explicit water, it seems appropriate to interpret the hydrogen bond strength as the energetic difference between interpeptide and peptide–water hydrogen bonds. Our model explicitly accounted for covalent and van der Waals interactions between the atoms in the immediate vicinity of hydrogen bond donors and acceptors. Therefore, significant interference between these interactions and our evaluation of hydrogen bond potential is unlikely. Our model polypeptide chain, however, lacked explicit side chains and interactions between them. It should be noted as a further limitation of this study that our estimates of the model hydrogen bonding potential may be influenced by side chain interactions. Hydrogen Bond Geometry Defining hydrogen bond geometry has been a challenging task since early days of structural biology. The definition of hydrogen bonds between the CO group of one amino acid and the NH group of another amino acid [Fig. 1(A)] was the basis for the classic work on protein secondary structure by Pauling et al.1,2 Their geometric ˚ criteria for hydrogen bonds corresponds to r(O, H) 1.92 A and \OHN > 1358, emphasizing the colinearity of N HO. Baker and Hubbard4 in their extensive review ˚ , \OHN > established wider ranges: r(O, H) < 2.50 A 1208, and \COH > 908, stressing the importance of the hydrogen approach angle. A similar extensive set of criteria was employed by Stickle et al.6 and corresponded ˚ , \OHN > 1358, and \COH > 908, to r(O, H) < 2.50 A i.e. both the colinearity and the approach angle were taken into account. In recent studies, which explored the preferential directions of hydrogen bonds, the cutoff ˚ , \OHN > parameters corresponded to r(O, H) < 3.5 A 1108, and \COH > 908.11,36 Our geometric criteria for hydrogen bonding correspond to an approximating square-well potential [Fig. 1(B)]. We conducted learning experiments on an entire set of 247 structures and separately on subsets of 74 aproteins, 42 b-proteins, and 131 mixed proteins. The learning curves for hydrogen bonding parameters are shown in Figure 2. The initial guess for the parameters was set arbitrarily in the range close to commonly accepted values. Five hundred steps of CD learning were performed in the space of four parameters, includ-

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

596

A.A. PODTELEZHNIKOV ET AL.

TABLE II. Hydrogen Bond Strength and the Cutoff Parametersa H/RT

˚) d (A

Y

W

2.06 1.86 2.12 2.47 13.00

2.14 2.19 2.14 2.09 1.92 2.5 2.5 3.5

151.98 148.38 152.08 151.68 1358.0 1208.0 1358.0 1108.0

142.18 141.68 141.88 140.48

All a-Proteins ab-Proteins b-Proteins Pauling (1951)1,2 Baker and Hubbard (1984)4 Stickle et al. (1992)6 Kortemme et al. (2003)11 Schellman (1955)38 Myers and Pace (1996)39

908.0 908.0 908.0

2.50 2.00

a

The values of parameters as learned with contrastive divergence on the entire data set and its subsets are given. Selected values from the literature are also listed.

ing the strength of the hydrogen bond. The algorithm converged by the 100th step. The resulting values were estimated by averaging the last 400 steps and are collected in Table II. The resulting values of the cutoff parameters are noticeably stricter than previous theoretical estimates. For the whole set of proteins, we concluded that all four atoms in the hydrogen bond N HO¼ ¼C were almost colinear, ˚ , \OHN>151.98, and \COH>142.18. with r(O,H)<2.14 A It is important to emphasize that these criteria correspond to strong hydrogen bonds as illustrated in Figure 1(B), whereas the criteria employed by other authors were often designed to capture all hydrogen bonds, regardless of their strength.* Even more importantly, these values were largely independent of the chosen subsets of protein structures containing only a-proteins, only b-proteins, and only mixed proteins. This observation validated our model assumption that the interpeptide hydrogen bonding potential did not depend on the secondary structure adopted by the polypeptide backbone. Hydrogen Bond Strength The strength of hydrogen bond is a subject of ongoing discussions in the literature (see recent review9). Pauling et al.1 suggested that the strength of hydrogen bond is about 8 kcal/mol. Some experimental evidence suggests that the strength is about 1.5 kcal/mol.38,39 Others suggest that hydrogen bonding has a negligible or even a destabilizing effect.10 At present, the consensus is that the strength of the hydrogen bond is in the range of 1–2 kcal/mol.9 The hydrogen bond strength, or more exactly the dimensionless quantity H/RT, was learned by CD along with the geometric parameters (see Fig. 2). The initial guess for the strength was arbitrarily set based on previous theoretical estimates. The procedure normally con-

verged in 100 steps. The averaged estimates for the hydrogen bond strength are presented in Table II. The hydrogen bond strength H/RT found by CD ranged from 1.86 to 2.47. Slightly different values were found in CD learning for different subsets of proteins. Notably, bproteins appeared to require higher values of hydrogen bond strength than a-proteins. It is well known that bsheets feature strong interactions between lateral neighbors.40,41 Since our model lacked these detailed amino acid residues, stronger hydrogen bonds on b-sheets might be required to compensate for the missing interactions. Assuming that the crystal structures in our training set correspond to snapshots taken at room temperature (RT ¼ 0.6 kcal/mol), the hydrogen bond strength determined for the entire set of structures (H ¼ 1.2 kcal/mol) is in very good agreement with the net stabilization of a protein as a result of a single hydrogen bond formation in the range of 1–2 kcal/mol (see review9).

Hydrogen Bond Saturation Using the geometric criteria of hydrogen bond formation determined for the entire training set, we calculated the number of hydrogen bonds satisfied in each protein of the dataset (see Table I). The maximum theoretically possible number of hydrogen bonds is approximately equal to the number of available hydrogen bond donors or acceptors (or the number of residues in the polypeptide chain). Figure 3 demonstrates proportionality between the number of residues, N, and the number of observed hydrogen bonds, Nh. The ratio of the two numbers, or the fraction of satisfied hydrogen bonds, is listed in the Table I for each individual structure in the dataset. The average observed ratio was p ¼ 0.258 with a standard error of 0.005. Figure 3 also shows that this observation was consistent for the subsets of structures representing different protein classes. It is important to note that this is the fraction of strong hydrogen bonds, as a consequence of the squarewell approximation for the hydrogen bonding potential used in this work [Fig. 1(B)]. This finding does not contradict previous estimates of a 90% overall fraction of hydrogen bonds,7 because the latter includes all hydrogen bonds, regardless of their strength. We further tested the hypothesis that the distribution of hydrogen bonds in our dataset corresponds to a thermodynamic equilibrium at a given temperature. The homogeneous equilibrium distribution must be described by a single value of p, which is related to the equilibrium constant. Figure 3 demonstrates significant variance of the observed values of Nh around their expectation values, Np. Assuming that the distribution of Nh is described by a binomial distribution, its variance can be expressed as follows varðNh Þ ¼ fNpð1  pÞ

*It should be noted that our definition of ‘‘strong’’ hydrogen bonds in this context differs from that of some authors, such as Perrin and Nielson,37 who use it to refer to hydrogen bonds characterized by short distances and with a strength of  10 kcal/mol. PROTEINS: Structure, Function, and Bioinformatics

ð8Þ

The coefficient f ¼ 1 corresponds to independent Bernoulli trials, while f > 1 corresponds to correlated

DOI 10.1002/prot

PROTEIN HYDROGEN BONDS AND CONTRASTIVE DIVERGENCE

597

Fig. 2. Learning hydrogen bond parameters with contrastive divergence. The four panels in the figure represent the training sequences on the full set of 247 proteins as well as subsets of a-, b-, and ab-proteins. Noisy curves correspond to the four parameters of hydrogen bonds in our model: the strength H/RT in black and the geometrical cutoff parameters d in red, cosY in green, and cosW in blue.

Fig. 3. The number of satisfied hydrogen bonds in the structures vs. the number of residues. a-proteins are in black, b-proteins are in red, a/ b-proteins are in blue, and a þ b-proteins are in green. The slope of the dashed line represents an average fraction of satisfied hydrogen bonds, p ¼ 0.258. The dotted lines show the 95% confidence interval for the observed number of hydrogen bonds assuming a binominal distribution according to Eq. (8) with f ¼ 3.

trials.42–45 We determined that f ¼ 3 provides the best fit for the observed variance of Nh. Figure 3 shows the 95% confidence interval according to Eq. (8) with f ¼ 3. In other words, the observed variance corresponds to 3 hydrogen bonds formed in a concerted fashion (see Appendix). The concerted formation of hydrogen bonds in peptides during the helix-coil transition is well-known and has been extensively studied both theoretically and experimentally.20,27,46,47 To summarize, the distribution of Nh indeed appears to be a homogenous binomial distribution with p ¼ 0.258 and f ¼ 3. This successful parameterization of the distribution by a single value of p supports the view that the training set of conformations corresponds to an equilibrium distribution of protein conformations at a given temperature. This is an important requirement for applicability of the CD procedure used in this work (see Methods section).15 It is worth adding that we did not detect any noticeable correlation between the structure quality measures, such as SPACI or RMSE, and the fraction of hydrogen bonds in our training set (see Table I). This is another indication that the observed hydrogen bonding reflects the equilibrium distribution rather than the quality of each structure in the dataset.

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

598

A.A. PODTELEZHNIKOV ET AL.

DISCUSSION Our results demonstrate that the CD method is an effective technique for estimating the parameters of continuous potential functions, which could well be applied to other biological problems in which relative free energies need to be estimated from equilibrium samples. Despite the simplifying assumptions of the force field used in this work, we report a good agreement between our CD learning experiments and traditional perceptions of the geometry and the strength of the hydrogen bond. The estimated strength of interpeptide backbone hydrogen bonds of 1.2 kcal/mol is within the range of 1–2 kcal/mol obtained from experimental evidence.9 We also find that all four atoms seem to be approximately colinear when they form a strong hydrogen bond in this model system, in contrast to the established view that only N HO are colinear. We determined that the same hydrogen bonding potential works reasonably well in both a-helices and b-sheets, validating the assumption that the interpeptide hydrogen bonding is independent of secondary structure. Other interactions omitted from our analysis may compensate for small dissimilarity and stability differences between secondary structure elements. The fraction of satisfied hydrogen bonds found in earlier PDB surveys was estimated to be 90%.7 The hydrogen bonding criteria used in those studies were designed to capture as many reasonable hydrogen bonds as possible. We had a different goal in this study. We tried to connect the hydrogen bond geometry with its strength, using a convenient square-well approximation for the hydrogen bonding potential. As a result, we determined the criteria for strong backbone hydrogen bonds and came to the conclusion that about a quarter of all backbone hydrogens form these strong hydrogen bonds. Although our model lacked explicit water and protein– water hydrogen bonds, the perturbations of protein structure in CD learning were rather small and unlikely to change the environment around obviously exposed peptide bonds. Our study focused only on interpeptide hydrogen bonds. The CD perturbations were only sufficient to break almost broken bonds and to form almost formed bonds, effectively probing the static distribution provided by the PDB structures. It appears that high-quality diverse X-ray structures reasonably represent an equilibrium distribution of interpeptide hydrogen bonds. To obtain the strength of the hydrogen bond, we assumed that the distribution corresponds to the room temperature. The evaluation of hydrogen bond potentials from the frequencies of different hydrogen bond geometries (e.g. Kortemme et al.11) is based on an assumption of a uniform distribution of peptide bond orientations. This is, strictly speaking, not the case for a connected polypeptide chain where the orientations of consecutive peptide bonds are correlated, especially as a part of secondary structure elements. The CD learning method presented here is free of this assumption and allows direct evaluation of the interaction parameters for a known physical system. For the first time, to our knowledge, we describe PROTEINS: Structure, Function, and Bioinformatics

a technique for the simultaneous optimization of hydrogen bond geometry and strength. Such simultaneous optimization is important because the hydrogen bond geometry cutoffs may be related to the loss of polypeptide entropy upon formation of hydrogen bonds. REFERENCES 1. Pauling L, Corey RB, Branson R. The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci USA 1951;37:205–211. 2. Pauling L, Corey RB. The pleated sheet, a new layer configuration of polypeptide chains. Proc Natl Acad Sci USA 1951;37: 251–256. 3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000;28:235–242. 4. Baker EN, Hubbard RE. Hydrogen bonding in globular proteins. Prog Biophys Mol Biol 1984;44:97–179. 5. Savage HJ, Elliott CJ, Freeman CM, Finney JM. Lost hydrogen bonds and buried surface area: rationalising stability in globular proteins. J Chem Soc Faraday Trans 1993;89:2609–2617. 6. Stickle DF, Presta LG, Dill KA, Rose GD. Hydrogen bonding in globular proteins. J Mol Biol 1992;226:1143–1159. 7. McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in proteins. J Mol Biol 1994;238:777–793. 8. Rose GD, Wolfenden R. Hydrogen bonding, hydrophobicity, packing, and protein folding. Annu Rev Biophys Biomol Struct 1993;22: 381–415. 9. Fleming PJ, Rose GD. Do all backbone polar groups in proteins form hydrogen bonds? Protein Sci 2005;14:1911–1917. 10. Baldwin RL. In search of the energetic role of peptide hydrogen bonds. J Biol Chem 2003;278:17581–17588. 11. Kortemme T, Morozov AV, Baker D. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 2003;326:1239–1259. 12. Morozov AV, Kortemme T, Tsemekhman K, Baker D. Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proc Natl Acad Sci USA 2004;101:6946–6951. 13. Pohl FM. Empirical protein energy maps. Nat New Biol 1971; 234:277–279. 14. Sippl MJ. Knowledge-based potentials for proteins. Curr Opin Struct Biol 1995;5:229–235. 15. Shortle D. Propensities, probabilities, and the Boltzmann hypothesis. Protein Sci 2003;12:1298–1302. 16. Winther O, Krogh A. Teaching computers to fold proteins. Phys Rev E Stat Nonlin Soft Matter Phys 2004;70(3 Part 1): 030903. 17. Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Comput 2002;14:1771–1800. 18. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995;247:536–540. 19. Brenner SE, Koehl P, Levitt M. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000;28:254–256. 20. Podtelezhnikov AA, Wild DL. Exhaustive Metropolis Monte Carlo sampling and analysis of polyalanine conformations adopted under the influence of hydrogen bonds. Proteins 2005;61:94–104. 21. Engh RA, Huber R. Accurate bond and angle parameters for Xray protein-structure refinement. Acta Crystallogr Sect A 1991; 47:392–400. 22. Brunger AT. X-PLOR, version 3.1: a system for X-ray crystallography and NMR. New Haven, CT: Yale University Press; 1992. 23. Ho BK, Coutsias EA, Seok C, Dill KA. The flexibility in the proline ring couples to the protein backbone. Protein Sci 2005;14: 1011–1018. 24. Ramachandran GN, Ramakrishnan C, Sasisekharan V. Stereochemistry of polypeptide chain configurations. J Mol Biol 1963;7: 95–99. 25. Hopfinger AT. Conformational properties of macromolecules. New York: Academic Press; 1973. x, p 339.

DOI 10.1002/prot

599

PROTEIN HYDROGEN BONDS AND CONTRASTIVE DIVERGENCE 26. Word JM, Lovell SC, LaBean TH, Taylor HC, Zalis ME, Presley BK, Richardson JS, Richardson DC. Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. J Mol Biol 1999;285:1711–1733. 27. Pappu RV, Srinivasan R, Rose GD. The Flory isolated-pair hypothesis is not valid for polypeptide chains: implications for protein folding. Proc Natl Acad Sci USA 2000;97:12565– 12570. 28. Gilks WR, Richardson S, Spiegelhalter DJ. Markov chain Monte Carlo in practice. Boca Raton, FL: CRC; 1998. 29. Brunger AT, Adams PD, Clore GM, DeLano WL, Gros P, GrosseKunstleve RW, Jiang JS, Kuszewski J, Nilges M, Pannu NS, Read RJ, Rice LM, Simonson T, Warren GL. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr D Biol Crystallogr 1998; 54(Part 5):905–921. 30. Murshudov GN, Vagin AA, Dodson EJ. Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr D Biol Crystallogr 1997;53(Part 3):240–255. 31. Sheldrick GM, Schneider TR. SHELXL: high-resolution refinement. Methods Enzymol 1997;277:319–343. 32. Konnert JH, Hendrickson WA. A restrained-parameter thermalfactor refinement procedure. Acta Crystallogr Sect A 1980;36: 344–350. 33. Tronrud DE. TNT refinement package. Methods Enzymol 1997; 277:306–319. 34. Lamzin VS, Wilson DS. Automated refinement for protein crystallography. Methods Enzymol 1997;277:269–305. 35. Jack A, Levitt M. Refinement of large structures by simultaneous minimization of energy and R factor. Acta Crystallogr Sect A 1978;34:931–935. 36. Fabiola F, Bertram R, Korostelev A, Chapman MS. An improved hydrogen bond potential: impact on medium resolution protein structures. Protein Sci 2002;11:1415–1423. 37. Perrin CL, Nielson JB. ‘‘Strong’’ hydrogen bonds in chemistry and biology. Annu Rev Phys Chem 1997;48:511–544. 38. Schellman JA. The stability of hydrogen-bonded peptide structures in aqueous solution. C R Trav Lab Carlsberg [Chim] 1955; 29(14/15):230–259. 39. Myers JK, Pace CN. Hydrogen bonding stabilizes globular proteins. Biophys J 1996;71:2033–2039. 40. Hutchinson EG, Sessions RB, Thornton JM, Woolfson DN. Determinants of strand register in antiparallel b-sheets of proteins. Protein Sci 1998;7:2287–2300. 41. Fooks HM, Martin AC, Woolfson DN, Sessions RB, Hutchinson EG. Amino acid pairing preferences in parallel b-sheets in proteins. J Mol Biol 2006;356:32–44. 42. Gabriel KR. The distribution of the number of successes in a sequence of dependent trials. Biometrika 1959;46:454–460.

43. Viveros R, Balasubramanian K, Balakrishnan N. Binomial and negative binomial analogues under correlated Bernoulli trials. Am Stat 1994;48:243–247. 44. Katz RW. Comments on ‘‘Binomial and negative binomial analogues under correlated Bernoulli trials’’, by R.Viveros et al. (with reply). Am Stat 1995;49:325, 326. 45. Clay O. Standard deviations and correlations of GC levels in DNA sequences. Gene 2001;276(1/2):33–38. 46. Nguyen HD, Marchut AJ, Hall CK. Solvent effects on the conformational transition of a model polyalanine peptide. Protein Sci 2004;13:2909–2924. 47. Scheraga HA, Vila JA, Ripoll DR. Helix-coil transitions revisited. Biophys Chem 2002;101/102:255–265. 48. Feller W. An introduction to probability theory and its applications. New York: Wiley; 1968. v. p.

APPENDIX Let us consider a model of N trials with concerted binary outcomes, where the outcomes can be combined into sequence of M groups with f trials in each group, N ¼ fM. All outcomes are always the same within each group, whereas the groups are independent from each other. Since the groups are independent, the distribution of the number of successful groups, Mh, is a classical binomial distribution, with the mean of Mp and the variance given by the following expression: varðMh Þ ¼ Mpð1  pÞ

ð9Þ

where p is the equilibrium probability of success.48 The distribution of the total number of successes, Nh ¼ fMh, can be obtained directly from the distribution of Mh. The expectation value of Nh is fMp ¼ Np, and its variance can be expressed as follows: varðNh Þ ¼ f2 varðMh Þ ¼ f2 Mpð1  pÞ ¼ fNpð1  pÞ

ð10Þ

This is the same equation as Eq. (8). Again, in the framework of this statistical model the meaning of the coefficient f is the number of simultaneous concerted trials.

PROTEINS: Structure, Function, and Bioinformatics

DOI 10.1002/prot

Learning about protein hydrogen bonding by ... - Wiley Online Library

ABSTRACT. Defining the strength and geome- try of hydrogen bonds in protein structures has been a challenging task since early days of struc- tural biology. In this article, we apply a novel statis- tical machine learning technique, known as con- trastive divergence, to efficiently estimate both the hydrogen bond strength and ...

371KB Sizes 0 Downloads 199 Views

Recommend Documents

The shielding effect of glycerol against protein ... - Wiley Online Library
Most commercial recombinant proteins used as molecular biology tools, as well as many academi- cally made preparations, are generally maintained in the presence of high glycerol concentrations after purification to maintain their biological activity.

ELTGOL - Wiley Online Library
ABSTRACT. Background and objective: Exacerbations of COPD are often characterized by increased mucus production that is difficult to treat and worsens patients' outcome. This study evaluated the efficacy of a chest physio- therapy technique (expirati

Learning Times for Large Lexicons Through ... - Wiley Online Library
In addition to his basic finding that working cross-situational learning algorithms can be provided ... It seems important to explore whether Siskind's finding ...... As t fi Ґ, we necessarily have that P1(t) fi 1, and hence that to leading order,.

Distributing Learning Over Time: The Spacing ... - Wiley Online Library
is likely to be remembered to a greater degree. Alternatively, spaced learning may deter com- plex generalization. It may be the case that the spacing of learning events across time promotes simple generalizations but not complex generaliza- tions. I

poly(styrene - Wiley Online Library
Dec 27, 2007 - (4VP) but immiscible with PS4VP-30 (where the number following the hyphen refers to the percentage 4VP in the polymer) and PSMA-20 (where the number following the hyphen refers to the percentage methacrylic acid in the polymer) over th

Recurvirostra avosetta - Wiley Online Library
broodrearing capacity. Proceedings of the Royal Society B: Biological. Sciences, 263, 1719–1724. Hills, S. (1983) Incubation capacity as a limiting factor of shorebird clutch size. MS thesis, University of Washington, Seattle, Washington. Hötker,

Kitaev Transformation - Wiley Online Library
Jul 1, 2015 - Quantum chemistry is an important area of application for quantum computation. In particular, quantum algorithms applied to the electronic ...

PDF(3102K) - Wiley Online Library
Rutgers University. 1. Perceptual Knowledge. Imagine yourself sitting on your front porch, sipping your morning coffee and admiring the scene before you.

Standard PDF - Wiley Online Library
This article is protected by copyright. All rights reserved. Received Date : 05-Apr-2016. Revised Date : 03-Aug-2016. Accepted Date : 29-Aug-2016. Article type ...

Authentic inquiry - Wiley Online Library
By authentic inquiry, we mean the activities that scientists engage in while conduct- ing their research (Dunbar, 1995; Latour & Woolgar, 1986). Chinn and Malhotra present an analysis of key features of authentic inquiry, and show that most of these

TARGETED ADVERTISING - Wiley Online Library
the characteristics of subscribers and raises advertisers' willingness to ... IN THIS PAPER I INVESTIGATE WHETHER MEDIA TARGETING can raise the value of.

Verbal Report - Wiley Online Library
Nyhus, S. E. (1994). Attitudes of non-native speakers of English toward the use of verbal report to elicit their reading comprehension strategies. Unpublished Plan B Paper, Department of English as a Second Language, University of Minnesota, Minneapo

PDF(270K) - Wiley Online Library
tested using 1000 permutations, and F-statistics (FCT for microsatellites and ... letting the program determine the best-supported combina- tion without any a ...

Phylogenetic Systematics - Wiley Online Library
American Museum of Natural History, Central Park West at 79th Street, New York, New York 10024. Accepted June 1, 2000. De Queiroz and Gauthier, in a serial paper, argue that state of biological taxonomy—arguing that the unan- nointed harbor “wide

PDF(270K) - Wiley Online Library
ducted using the Web of Science (Thomson Reuters), with ... to ensure that sites throughout the ranges of both species were represented (see Table S1). As the ...

Standard PDF - Wiley Online Library
Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN 37996, USA,. 3Department of Forestry and Natural. Resources, Purdue University ...

PDF(118K) - Wiley Online Library
“legitimacy and rationality” of a political system results from “the free and ... of greater practical import and moral legitimacy than other models of democracy.

Strategies for online communities - Wiley Online Library
Nov 10, 2008 - This study examines the participation of firms in online communities as a means to enhance demand for their products. We begin with theoretical arguments and then develop a simulation model to illustrate how demand evolves as a functio

Effect of quantum nuclear motion on hydrogen bonding
a signature of quantum electronic character. • An excited state (the “twin state”) in UV (300 nm) with large transi\on dipole moment. • D-‐H vibra\onal frequency is.

Hydrogen-Bonding Interactions between Formic Acid ...
controversy. These carboxylic acid-pyridine systems may adopt two different ... both P4VPy and P2VPy were ionic, while their data for the corresponding PMAA ...