Applications of Knowledge-Tunnelling to ...

Viewer
Transcript

Applications of Knowledge-Tunnelling to Crystallographic Renement of Macromolecules Swanand Gore

A thesis submitted for the degree of Doctor of Philosophy

Queens' College Cambridge - United Kingdom 2007

Declaration This is a summary of research carried out in the Department of Biochemistry, University of Cambridge, from October 2003 to August 2007. The work described in the dissertation is my own, unless otherwise stated in the text. It has not, either in part or as a whole, been submitted for a degree, diploma or other qualication at any other university. The length of this dissertation does not exceed the word limit.

Swanand Gore August 2007 Cambridge, United Kingdom

i

dedicated to the loving memory of my father and grandparents

ii

Contents Declaration

i

Acknowledgements

vi

Abbreviations

viii

Abstract

ix

1 Introduction

1

1.1

Role of structural biology

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Proteins: Primary and Secondary Structure . . . . . . . . . . . . . . . . . .

2

1.2.1

Amino acids and polypeptides . . . . . . . . . . . . . . . . . . . . . .

2

1.2.2

Ramachandran map . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2.3

Sidechains and rotamers . . . . . . . . . . . . . . . . . . . . . . . . .

7

rna Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.3.1

Polynucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.3.2

Base interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.3.3

Sugar pucker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.3.4

Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.4

Interactions within macromolecules . . . . . . . . . . . . . . . . . . . . . . .

14

1.5

On folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.6

3D modelling of protein structures . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.1

Model representation . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.2

Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.6.2.1

Statistical potentials . . . . . . . . . . . . . . . . . . . . . .

17

1.6.2.2

Molecular mechanics force elds . . . . . . . . . . . . . . . .

18

Sampling strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.6.3.1

Direct methods . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.6.3.2

Molecular dynamics and simulated annealing (mdsa) . . . .

19

1.3

1.6.3

iii

1.6.3.3

Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . .

20

1.6.3.4

Tree search . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Comparative and ab initio approaches to modelling . . . . . . . . . .

21

Experimental structure determination . . . . . . . . . . . . . . . . . . . . . .

21

1.7.1

X-ray Crystallography . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.7.2

NMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

1.7.3

Cryo EM

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

1.7.4

SAXS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Crystallographic renement . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

1.8.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

1.8.2

Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

1.8.3

Errors and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

1.8.4

Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

1.6.4 1.7

1.8

1.9

2 Rapper tk: a versatile engine for discrete restraint-based conformational sampling of macromolecules 33 2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.2.1

Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.2.2

Samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.2.3

Restraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.2.4

Builders and followers . . . . . . . . . . . . . . . . . . . . . . . . . .

39

2.2.5

Sampling Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

2.2.5.1

Ordering of builders and restraints . . . . . . . . . . . . . .

39

2.2.5.2

Execution of builders and restraints . . . . . . . . . . . . . .

42

2.2.6

Spatial grid for checking clashes . . . . . . . . . . . . . . . . . . . . .

42

2.2.7

pdb reader, model renderer etc. . . . . . . . . . . . . . . . . . . . . .

42

2.2.8

Application scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

2.3.1

Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

2.3.1.1

Mainchain modelling . . . . . . . . . . . . . . . . . . . . . .

43

2.3.1.2

All atom modelling . . . . . . . . . . . . . . . . . . . . . . .

44

2.3.1.3

Computational Cost and Quality Check . . . . . . . . . . .

45

Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

2.3.2.1

50

2.3

2.3.2

Protein-ligand interface sampling in electron density . . . .

iv

2.3.2.2

Protein-rna interface sampling . . . . . . . . . . . . . . . .

51

2.3.2.3

Sampling β sheets . . . . . . . . . . . . . . . . . . . . . . .

53

2.3.2.4

All-atom model generation from approximate secondary structure information and particle shape . . . . . . . . . . . . . .

2.4

Discussion and Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Optimal sidechain packing in proteins and crystallographic renement

53 58

61

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.2

Sidechain reassignment procedure . . . . . . . . . . . . . . . . . . . . . . . .

63

3.3

Single- and multi-conformer renement . . . . . . . . . . . . . . . . . . . . .

64

3.3.1

Renement protocols . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.3.2

Single-conformer renement . . . . . . . . . . . . . . . . . . . . . . .

70

3.3.3

Multiconformer renement . . . . . . . . . . . . . . . . . . . . . . . .

74

3.4

Sequence assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

3.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4 Modelling protein loops and their heterogeneity

82

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.2

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.2.1

Overview of iterative renement . . . . . . . . . . . . . . . . . . . . .

85

4.2.2

Electron Density Ranker . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.2.3

Symmetry-related clash cheking . . . . . . . . . . . . . . . . . . . . .

86

4.2.4

Loop closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.2.5

Both-sided sampling . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.2.6

Multiconformer sampling . . . . . . . . . . . . . . . . . . . . . . . . .

88

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

4.3.1

Reproducing

renement . . . . . . . . . . . . . . . . . .

88

4.3.2

Rebuilding missing loops . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.3.3

Framework and loops . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.3.4

Variation of Rf ree with collection size . . . . . . . . . . . . . . . . . .

91

4.3.5

Mistakes in the composite protocol . . . . . . . . . . . . . . . . . . .

96

4.3.6

Modelling loop heterogeneity with collections and ensembles . . . . .

98

Concluding Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

4.3

4.4

rapper/cns

5 rna sampling and crystallographic renement 5.1

106

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

5.1.1

107

Role of rna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

5.2

5.3

5.4

5.1.2

rna structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107

5.1.3

rna structure prediction . . . . . . . . . . . . . . . . . . . . . . . . .

108

5.1.4

rna crystallographic renement . . . . . . . . . . . . . . . . . . . . .

109

5.1.5

This work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

rna tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

110

5.2.1

Sampling styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

5.2.2

Initial observations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

5.2.3

Sampling performance . . . . . . . . . . . . . . . . . . . . . . . . . .

114

Foray into crystallographic renement . . . . . . . . . . . . . . . . . . . . . .

118

5.3.1

About trna structure . . . . . . . . . . . . . . . . . . . . . . . . . .

118

5.3.2

Composite renement protocol . . . . . . . . . . . . . . . . . . . . . .

118

5.3.3

Rening a helical fragment . . . . . . . . . . . . . . . . . . . . . . . .

120

5.3.4

Rening the T ψC loop . . . . . . . . . . . . . . . . . . . . . . . . . .

120

5.3.5

Rening the anticodon loop . . . . . . . . . . . . . . . . . . . . . . .

124

5.3.6

Typical problems with renement protocols . . . . . . . . . . . . . .

124

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

6 Conclusion and Future Directions

129

6.1

Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

6.2

Methodological challenges in knowledge-tunnelling . . . . . . . . . . . . . . .

130

6.3

Short-term objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

6.3.1

Sequence assignment on approximate mainchains in low resolution crystallographic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3.2

Combining secondary structure and loose positional restraints for protein crystallographic renement . . . . . . . . . . . . . . . . . . . . .

6.4

6.5

131 131

6.3.3

Including sidechain heterogeneity in protein crystallographic renement 132

6.3.4

Characterizing protein loop heterogeneity . . . . . . . . . . . . . . . .

132

6.3.5

Knowledge-based crystallographic renement of trna structures . . .

132

6.3.6

rna secondary structure to 3D models . . . . . . . . . . . . . . . . .

132

Long-term goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

6.4.1

Heterogeneity annotation of the pdb . . . . . . . . . . . . . . . . . .

133

6.4.2

Addressing assemblies . . . . . . . . . . . . . . . . . . . . . . . . . .

133

6.4.3

Combining data from various sources . . . . . . . . . . . . . . . . . .

133

Final Comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

134

Bibliography

135

vi

Acknowledgements First and foremost, I thank my supervisor Prof Sir Tom Blundell for invaluable guidance and patient support he has provided to me during my PhD. I have learnt a lot from him. I am grateful to Dr Anil Seal also for oering me a prestigious scholarship and for patiently pursuing me to use it fruitfully. I thank Dr Michele Vendruscolo for oering me a chance to work on an NMR project and introducing me to very interesting issues therein. I thank Prof N Srinivasan at IISc for introducing me to the world of biocomputing and also suggesting me to pursue my PhD in this lab. Thanks to Dr David Burke for patiently guiding me initially during my work, to Dr Dimitri Chirgadze for addressing crystallography doubts, and Mr Graham Eli, the everhelpful systems administrator of the lab. Thanks to Dr Kenji Mizuguchi and Dr Ben Luisi also for helpful discussions and advice. Work in this thesis would not have been possible without the availability of computing clusters at Chemistry Department and University HPC facility: I thank the system managers Dr Catherine Pitt, Dr Stuart Rankin and their teams for keeping the systems available and highly usable. Thanks to the rapper team! Thanks to Paul and Mark started for starting this project and for a few thought-provoking meetings. Thanks to Nick for frequent brainstorming and reading my writeups. Thanks to Anjum for benchmarking Rapper tk and many helpful suggestions. Besides rapper people, I owe thanks to labmates Rinaldo, Catherine, Richard, Tammy, Sock Hoon. I have been very lucky in meeting and befriending exceptional people kind to me in many dierent ways. The list is rather long: my school friends Mandar, Nikhil, Amit, Shailesh, Rajesh, Ashutosh, Makarand, Avinash, Devendra, Sachin, Vaibhav, Yogesh, Amol, Shrikrishna, Umesh; friends from engineering days: Ishan, Prashant, Nitin, Medha, Narkhede, Amogh, Narendra, Ravi, Ajit, Chetan, Krunal; friends from IISc Anand, Manish, Sujatha, Vikrant, Srikanta, GVSK, Balaji; friends from Strand: Unni, Janaki, Nimisha, Dr Narasimha, Mahesh, Abhi; and of course in Cambridge: Anand, Shilpa, Muzaar, Deepti, Lekshmi, Pradeep, Neelam, Saroj, Xavier, Sonia, James, Rose, Shreyas, Mair, Esther. One has to be fortunate to get instructed at good Institutes by inspirational teachers. vii

I must thank my teachers, especially at Somalwar High School and IISc for whetting my appetite to seek and know. Squash have been a great way to relax, thanks to Anant Krishnan for introducing me to this wonderful game. University library has been a fun place for wondering at the wealth of books there and occasionally reading some random books. Lastly and most importantly, I must try thanking my immediate and extended family. No words would suce to thank my mother and brother Deodatta who have almost always encouraged my endeavours and been reliable pillars of strength through all ups and downs. My extended family has also been very kind and supportive: especially Shweta, Sheelamavshi, Sharadkaka, Andumama, Smitamami, Latamavshi, Nimamavshi, Sudhirkaka, Madhavitai, Sampadatai, Praulla, Kedar, Devashri, Abhijit, Ashwini, Ashish, Ashokkaka, Veenakaku, Shashikaka, Devyanikaku. No research is possible without nancial support for which I heartily thank the Cambridge Commonwealth Trust and Universities UK ORS funding program.

viii

Abbreviations AA all-atom (excluding hydrogens unless indicated otherwise), or amino acid BB backbone

B-factor The Debye-Waller factor for modelling uctuations around equilibrium coordinates CNS Crystallography and NMR system, a popular software by Axel Brunger EM Electron microscopy GABB genetic cum branch and bound algorithm MC mainchain

MDSA Molecular Dynamics Simulated Annealing MSA multiple sequence alignment

PDB Protein Data Bank

Rapper tk

rapper-toolkit

RMSD root mean squared deviation between two ordered sets of 3D coordinates of the same size, unsuperposed unless indicated

(D)RNA (De-oxy) Ribonucleic acid

ix

Abstract The vastness and complexity of the conformational landscapes present a major challenge to structure determination of macromolecules. Use of prior structural knowledge in various forms such as scoring functions, dierentiable energy functions and statistical preferences can make that task more tractable. Procedures using energy functions and statistical preferences modify a conformation in contrasting ways (continuous and discrete respectively). Knowledge-tunnelling is a discrete knowledge-based modication that can rescue a conformation from a local minimum of an energy function. The complementarity between knowledge-tunnelling and kinetic sampling has been exploited by the rapper program for ab initio protein loop prediction and automatic crystallographic renement of an approximate protein model. But rapper's applicability is limited to sampling proteins only in a way dictated by an algorithm that is hard to modify. This thesis addresses these limitations to extend the promise of knowledge-tunnelling to new challenges in macromolecular crystallographic renement. rapper has been reformulated as Rapper tk, a modular and scriptable software framework

that facilitated the conformational sampling tasks undertaken later. Along with rapper's genetic branch-and-bound algorithm, a combinatorial optimization algorithm for sidechains is implemented in Rapper tk. Ligand and rna sampling are incorporated in addition to restraint-sensitive protein sampling. For crystallographic renement of proteins, optimal sidechain placement, loop sampling and multiconformer sampling techniques are developed by using the discrete residue-specic conformational preferences of mainchain and sidechains. For rna renement, rotameric preferences of sugar-phosphate backbone are used. In all cases, the value added by knowledgetunnelling is assessed by comparing the cns/Rapper tk composite protocol with the cns-only renement protocol. I observed that irrespective of sampling algorithms or conformational preferences used, knowledge-tunnelling improves the renement statistics and is useful in sequence assignment, all-atom renement and estimation of structural heterogeneity.

x

Chapter 1 Introduction 1.1 Role of structural biology Biological research has experienced such fundamental transformations in recent decades that they seem almost revolutionary when contrasted against the centuries prior to them. Many technological, scientic and computing breakthroughs turned biological sciences upside down, changing the perspectives and ambitions of researchers. Modern biotechnology enterprise generates, catalogues, researches and integrates various kinds of data about biological systems. Cheminformatics (Agraotis et al. (2007)), genomics (Miller et al. (2004)), metabolomics (Kell (2004)), systems biology (Kritikou et al. (2006)) and synthetic biology (Andrianantoandro et al. (2006)) are some of the major current growth areas which contribute uniquely to the drug discovery process and understanding of life. Two major trends have emerged in this explosion of data and knowledge: (a) the search for microscopic explanations of macroscopic phenomena has become very common and (b) reductionist approaches are being replaced by integrative approaches. Biologists wish to identify the molecular basis of biological phenomena and also to understand the network of causation from molecular to ecological levels. Structural biology has been a part and driver of the reductionist style. That the atomic structures of components of cellular machinery are key to its understanding and manipulation is an undisputed consensus within the biomedical community. Rational identication of small drug-like molecules that bind a pocket in a protein domain is perhaps the most common impression of structural biology in the non-structural-biology community. Indeed rational drug discovery for well-identied targets has been one of the most visible success stories of structural biology, but there is much more to the eld as it spans various levels of detail - from investigating detailed reaction mechanisms of enzymes, exibility of parts of macromolecules and protein folding, it extends up to macromolecular complexes, 1

multi-component assemblies and massive particles like ribosomes and viruses. Like any other vibrant discipline, structural biology is about application of existing structure-investigation methods and development of new ones, as well as introspection i.e. analysis of known structural data. The established methods of structure determination (X-ray crystallography and NMR) are at the heart of reductionist investigations, involving moderate particle sizes. Relatively new methods (e.g. cryo electron microscopy) are aimed at elucidating structures of large particles to provide understanding of relationships between component macromolecules. For integrative investigations, such understanding is of vital importance. Thus it becomes clear that structural biology is of immense value to the biotechnology enterprise in both the reductionist and integrative pursuits. An important distinction between structural and non-structural biology research is that the latter tends to be concerned with the what, when,

where questions whereas the former is inspired by the why. This thesis is about using knowledge-based sampling in crystallographic renement, and can be classied under the method development category in structural biology research. The remaining part of this introduction is a small collection of facts that forms the context for rest of the thesis based on my understanding of macromolecular structure and its determination. In-depth comprehensive review of the issues discussed here can be obtained from many excellent texts in molecular biology and biochemistry (e.g. Lehninger et al. (1995), Voet and Voet (1995)), protein structure and function (Branden and Tooze (1999), Creighton (1993)), crystallography (Blundell and Johnson (1976), Drenth (1999)) and various review journals (e.g. Current Opinion, Nature Reviews, Annual Reviews, Trends). I begin by describing briey protein and rna structures. I then discuss the energy and forces involved in the structure and function of macromolecules and discuss their simulation. Methods of predicting and modelling are briey described, followed by major experimental techniques for macromolecular structure determination. X-ray crystallography is discussed in greater detail than others. I conclude the introduction by describing the concept and importance of knowledge-tunnelling, and an overview of the following chapters which are attempts to apply that concept in the context of X-ray crystallography.

1.2 Proteins: Primary and Secondary Structure 1.2.1 Amino acids and polypeptides As the name implies, an amino acid consists of an acid group (COOH ) and amino group (N H3 ) linked together by a sp3 -hybridized carbon atom (Cα ). The remaining two single bonds of Cα atom are made with a hydrogen atom and another chemical group, called the 2

sidechain. Two amino acids can undergo condensation and form a peptide bond (Fig.1.1). An unbranched polymer formed with repeated condensations is called a polypeptide. Polypeptides of respectable size (> 30 amino acid residues long) are known as proteins. This simple chemical architecture gives rise to a diverse set of properties like catalysis (proteases), recognition (antibodies), strength (collagen), elasticity (elastin), etc. A peptide bond consists of two partial double bonds C − O and C − N . Due to this, the dihedral angle ω (Cα − C − N − Cα ) has a strong tendency to be in one of the two planar congurations (Fig.1.1), cis (ω = 0o ) or trans (ω = 180o ). Since the cis conformation increases the chance of steric interactions between consecutive residues, the probability of it occurring is very small (0.04%). But when the next residue is proline, this probability increases significantly to 6.02% (Jabs et al. (1999)) due to the non-standard sidechain-mainchain covalent linkage in proline. The restricted nature of ω is very useful for sampling as it almost removes a degree of freedom. The sidechain group distinguishes an amino acid from another. The genetic code species 20 frequently occurring (listed in Table 1.1) and 2 rarely occurring (selenocysteine and pyrrolysine) amino acids. Translational and post-translational in-vivo and in-vitro modications can be performed on sidechains, e.g. a common modication performed for crystallographic phasing is replacement of sulphur by selenium in methionine by growing a microorganism on a substrate containing seleno-methionine instead of methionine. Due to the unbranched nature of the polypeptide chain and limited number of sidechain types, proteins can be described as sequences. As Cα atom is chiral, an amino acid can exist as two isomeric forms L and D. The majority of proteins synthesized by cells consist of L-form only, with rare exceptions such as peptidoglycans in bacterial cell walls or exotic animals like cone snails.

1.2.2 Ramachandran map Repeating peptide units in proteins are collectively termed as mainchain. Dihedral angles in the mainchain are φ (C i−1 −N i −Cαi −C i ), ψ (N i −Cαi −C i −N i+1 ) and ω (Cαi −C i −N i+1 −Cαi+1 ) (Fig.1.1). Although ω is highly restricted, φ and ψ confer signicant exibility upon the mainchain. Importantly, since the variation in covalent geometry is negligible, φ−ψ trajectory is an almost lossless compression of mainchain's Cartesian coordinates because the latter can be modelled accurately using the former. φ−ψ maps are useful in protein structure validation, sampling, prediction and determination. From Linus Pauling's famous prediction of secondary structures (Pauling et al. (1951)), it was clear that some φ − ψ pairs would be used in proteins more than others due to the requirement of secondary structure formation. Using a simple hard-sphere model of atoms, G. 3

Figure 1.1: A 3-residue section containing 2 peptide bonds from PDB entry 9ILB is shown. Mainchain dihedrals φ, ψ, ω and sidechain dihedral χ1 , χ2 , χ3 , χ4 are marked. The ω dihedral angle between Tyr-90 and Pro-91 is in the cis conformation, that between Pro-91 and Lys-92 is in the trans conformation. Note the change in inter-Cα distance due to ω .

NZ

CE χ4

OH CD

χ3 CG

CZ CE1

LYS−92

χ2

CE2

CB χ1 C

CD1

CD2 CG χ2

N CA

N

ϕ

φ

C O

ω

φ

O ϕ CA

N

χ1

PRO−91

CB χ2 CD

4

O

ω

C 2.8

CB χ1

TYR−90

CA

3.8

CG

Table 1.1: Commonly occurring amino acids.

Amino Acid Glycine Alanine Valine Leucine Isoleucine Proline Phenylalanine Tyrosine Tryptophan Aspartic Acid Glutamic Acid Histidine Lysine Arginine Cysteine Methionine Serine Threonine Asparagine Glutamine

Abbreviations Physicochemical properties GLY G ALA A VAL V LEU L ILE I PRO P PHE F TYR Y TRP W ASP D GLU E HIS H LYS K ARG R CYS C MET M SER S THR T ASN N GLN Q

aliphatic, hydrophobic aliphatic, hydrophobic aliphatic, hydrophobic aliphatic, hydrophobic aliphatic, hydrophobic aliphatic, hydrophobic aromatic, hydrophobic aromatic aromatic hydrophilic, acidic hydrophilic, acidic hydrophilic, basic, imidazole hydrophilic, basic, amine hydrophilic, basic sulfurous, hydrophobic sulfurous, hydrophobic alcohol alcohol hydrophilic hydrophilic

5

Figure 1.2: Various secondary structures in proteins mapped onto the Ramachandran map. Regions corresponding to α helix, β sheet, γ turn, type II' turn, 310 helix etc. can be conveniently viewed. Copied from Lovell et al. (2003).

N. Ramachandran recognized that all possible φ − ψ pairs would not be energetically feasible due to the interatomic steric clashes they could cause. In their seminal work, Ramachandran and Sasisekharan (1968) divided the φ − ψ space into allowed and forbidden regions. These early insights have become cornerstones of protein structure studies and inspired many researchers to revisit φ − ψ maps to understand their ner features. Residue-specic maps with detailed analysis of the boundary between allowed and forbidden regions have since been derived (Kleywegt and Jones (1996b), Laskowski et al. (1993), Lovell et al. (2003), Walther and Cohen (1999)). Many types of secondary structures in proteins (Fig.1.2) can be neatly viewed on the Ramachandran map. Residue-specic φ − ψ maps reect the distinctive nature of glycine and proline from the other 18 standard residues (Fig.1.3). Proline cannot adopt +φ conformation and has a 6

signicantly reduced −φ region. Glycine does not have a sidechain group, hence it is free to adopt both φ, ψ and−φ, −ψ conformations to yield a nearly symmetric map. Another important deviation from the usual map is pre-proline residues which lose some of their conformational freedom due to steric hindrance from the following proline. Generally the higher resolution structure of a protein has a smaller number of residues in low-propensity regions than its low-resolution counterpart. This makes the Ramachandran map an eective tool to diagnose errors in structure (Laskowski et al. (1993),Lovell et al. (2003)). Some analyses (Gunasekaran et al. (1996)) do nd that outliers in the Ramachandran map of a protein may be genuinely strained conformations necessary for function, but also note that such distortions are extremely rare and must be strongly supported by experimental data. Another important use of Ramachandran maps is in protein structure prediction to restrain and bias the mainchain towards protein-like conformations. Fine-grained, propensityweighted sampling of φ − ψ distributions has been used by many researchers (Abagyan and Totrov (1994), DePristo et al. (2003), Sali and Blundell (1993), Evans et al. (1995)). In contrast, researchers have also tried exhaustive mainchain sampling by deriving representative coarse-grained φ − ψ sets (Park and Levitt (1995), Moult and James (1986), Rooman et al. (1991), Deane and Blundell (2000)). Choice of granularity of φ − ψ maps is an example of the classic tradeo between model accuracy and sampling eciency. For restraint-rich scenarios requiring accurate sampling, ne-grained φ−ψ sets are suitable whereas in scenarios where discrimination function itself is coarse, coarse-grained maps would increase coverage of conformational space during sampling.

1.2.3 Sidechains and rotamers Sidechains impart physicochemical personality to a protein. Their conformations are described using the χ dihedral angles (Fig.1.1), e.g. the most common of them is the χ1 dened over N, Cα , Cβ and the atom in γ position (Cγ in valine, Sγ in cysteine, Oγ in serine etc.) and also the major determinant of sidechain conformation with respect to the mainchain. Much like the mainchain, sidechain conformational freedom is also restricted by steric clashes. Consistent with small-molecule observations, sidechains show preference for gauche and trans congurations over eclipsed. This leads to clustering of sidechain dihedral angles observed in protein crystal structures. A representative of such a cluster is called a rotamer (rotational isomer). Many rotamers together with their observed probabilities form a rotamer library that represents the local energy minima in the χ space. Interestingly, secondary structure strongly inuences rotameric propensities. For instance, in α-helices, χ1 is rarely around

+60o due to the steric clashes that would follow. 7

Figure 1.3: Ramachandran Maps generated by Rampage (http://wwwcryst.bioc.cam.ac.uk/rampage) from the analysis of Top500 dataset. Although φ − ψ surface is toroidal, for convenience it is depicted on a planar grid. The four maps shown are general, glycine, pre-proline and proline. Darkly shaded regions are favoured, lightly shaded are allowed and rest of the maps are forbidden. (Copied from Depristo (2003))

8

O-rotameric sidechains are likely to be due to non-local intervention in the local energy landscape of the sidechain, but a signicant proportion can be artefacts of structure determination process (Petrella and Karplus (2001)). Similar to the mainchain, sidechain rotamericity increases in higher resolution structures (Lovell et al. (2000)), making it a diagnostic tool to spot errors in protein structure determination (Laskowski et al. (1993)). Modelling programs prefer using sidechain libraries as they reduce the sampling requirements. This has led to many compilations of rotameric libraries (Dunbrack (2002)). Backbone-dependent rotamer libraries of Dunbrack and Karplus (1993) and Dunbrack and Cohen (1997) have been used extensively in SCWRL (Bower et al. (1997), Canutescu et al. (2003)). The penultimate rotamer library of Lovell et al. (2000) has been used in rapper (DePristo et al. (2003)) because it uses strict ltering criteria to avoid rotamers likely to be artefacts of crystallography. Shetty et al. (2003) have derived a ne-grained rotamer library with stringent quality checks, retaining the reported bond lengths and angles, and observed improvements in modelling. Fig.1.4 illustrates the rotamers available in the Penultimate Rotamer library for backbones in α-helix or β -strand. Width of sticks represents the propensities of rotamers.

1.3

rna

Structure

1.3.1 Polynucleotides While polypeptides dominate the catalytic and structural roles, polynucleotides have a monopoly over communication (rna only) and storage (mostly dna) of genetic information in a cell, much like information channels and warehouses. More research on the rna is revealing its catalytic role too. Polynucleotides are unbranched polymers of nucleotides. A nucleotide in general consists of a phosphate, a sugar and a heterocyclic base (Fig.1.5). In dna and rna, the sugar is ribose (a pentose) and the base is either a purine or a pyrimidine. Two

constitutional dierences between dna and rna are (a) dna's sugar lacks the −OH group at 20 position and (b) dna utilizes thymine instead of uracyl. Various polymerases extend dna or rna chains by joining the sugar's −OH group at 30 position with phosphate group

of another nucleotide.

1.3.2 Base interactions A major structural motif for rna is the A-form helix, so an eye-catching feature in most rna structures is the famous Watson-Crick base pairing predicted initially for B-form dna.

This hydrogen-bonding pattern is possible only between base-pairs AT, AU and GC, which consist of a purine and a pyrimidine respectively. The beauty of this pairing is the isostericity 9

Figure 1.4: Propensity-weighted secondary-structure-dependent rotamers from the Penultimate rotamer library. Stick-widths are proportional to rotameric propensities. Note the dierences in propensities of the same rotameric state for dierent backbones, e.g. for Phe. PHE-α

PHE-β

TYR-α

TYR-β

TRP-α

TRP-β

VAL-α

VAL-β

LEU-α

LEU-β

ILE-α

ILE-β

PRO-α

PRO-β

ASP-α

ASP-β

GLU-α

GLU-β

HIS-α

HIS-β

LYS-α

LYS-β

ARG-α

ARG-β

CYS-α

CYS-β

MET-α

MET-β

SER-α

SER-β

THR-α

THR-β

ASN-α

ASN-β

GLN-α

GLN-β

10

Figure 1.5: Ribonucleotide structure of guanosine. Dihedral angles and names of 3 sides of base are marked.

ge n Ed

stee oog

Previous Nucleotide

N1

C6

ric

kE

dg

Base

O2P

O1P

C1* Sugar

γ

δ

ξ

P

N3

dge Sugar E

C4*

ε

Phosphate

C4

O4*

β

O2P

N9 χ

O5*

C5*

C2

C8

Phosphate α

−C

C5

O3*

O1P

ats on

H

N7

P

W

O6

C2* C3* O2* O3* O5*

Next Nucleotide

11

N2

e

Figure 1.6: The two puckers adopted by sugar in RNA. Nucleotides G-19, U-20 from PDB entry 2FMT are shown. C2' (C3') shifts towards the 5' end in C2'-endo (C3'-endo) pucker with respect to the rest of the sugar ring.

Prev Nucleotide (5’)

BASE (G−19) C2’−endo BASE

(U−20)

C3’−endo

Next Nucleotide (3’)

12

- changing a base pair for any other does not alter the rest of the geometry. Among the nonstandard (non-Watson-Crick) hydrogen-bonded base-pairs a great deal of variety is observed. This variety arises from the atoms participating in hydrogen bonding. There are three sets of atoms in a base which hydrogen-bond with another base (Fig.1.5) - the Watson-Crick edge (e.g. N1-C2 in adenosine), Hoogsteen edge (e.g. N7-C6 in adenosine) and sugar edge (e.g. C4-N3 in adenosine). This gives rise to 6 kinds of edge pairs and hydrogen bonding patterns which describe most base pairs (Leontis and Westhof (2001)). Due to the phenomenon of bifurcated hydrogen bonding (2 donor hydrogens attached to same base, pointing to the same acceptor), a new category sometimes arises between Watson-Crick and Hoogsteen edge. Another signicant feature in most rna structures is the base stacking that goes with base-paired strands. The π − π interactions therein are believed to be stabilizing interactions whereas hydrogen-bonding interactions in base-pairing are responsible for orientation of stretches of backbone and specicity of base-base interactions (Voet and Voet (1995)).

1.3.3 Sugar pucker Pentose sugar can occur in various puckers to relieve the steric clashes of ring substituents. A pucker is a deviation from planar geometry and can occur in many ways for a free sugar. But in all rna structures observed so far, only two puckers are seen: C20 -endo and C30 -endo (Fig.1.6). In the former, the C20 atom is shifted in the 50 direction out of the plane formed by C10 , C30 , C40 , O50 . In a similar way, in the latter, C30 atom is shifted in the same direction out of the C10 , C20 , C40 , O50 plane. Due to this limited freedom, the sugar ring is analogous to the Cβ atom in proteins. Pucker is conformationally important, e.g. B-dna has C20 -endo pucker whereas A-form rna helix has C30 -endo pucker, Z-dna has C20 -endo in purines and

C30 -endo in pyrimidines.

1.3.4 Backbone rna chain is elongated at its 30 end by rna polymerase using a nucleoside triphosphate releas-

ing pyrophosphate forming the rna backbone of alternating groups of sugar and phosphate. In most pdb les, a nucleotide backbone consists of atoms P, O1P, O2P, O50 , C50 , C40 , C30 , O30 . There are thus 6 independent dihedral angles in the backbone per nucleotide. This generates a huge conformational space for the backbone to explore, but in the crystallographic structures of rna and dna, most nucleotides are found clustered around a limited set of values. Unlike the protein backbone, rna backbone preferences are harder to study due to the data sparsity. rna backbone has been studied by observing the dinucleotide preferences (Schneider et al. (2004)), heminucleotide preferences (Malathi and Yathindra (1985)) 13

and approximating them to two virtual dihedral angles (Duarte and Pyle (1998)). The observed rotamericity of the backbone (Murray et al. (2003)) is encouraging for the discrete restraint-based sampling.

1.4 Interactions within macromolecules The free energy of a system has enthalpic and entropic contributions. Synthesis of macromolecules from various ingredients is endothermic because despite enthalpy reduction from favorable interactions between ingredients, ordering costs signicant entropic penalty. Folding of a macromolecule to a stable 3D structure represents a further rise in order and the ensuing entropic penalty must be compensated by a comparable drop in enthalpy. Covalent structure of the folded and unfolded states is the same (with exceptions like disulphide bridges in proteins), which implies that the stability of a folded macromolecule (drop in enthalpy − rise in entropy) is a function of non-covalent interactions within the molecule and with water. Most non-covalent interactions require physical proximity of interacting groups except electrostatics. Electrostatic force of attraction (repulsion) between unlike (like) charges can exert long-range inuence according to Coulomb's inverse square law, the potential form of which is:

Eelectrostatic =

qi qj 4πD²rij

(1.1)

where qi, qj are the charges separated by distance rij in a medium of dielectric constant

D. Electrostatic interactions are heavily inuenced by the solvent due to its shielding eect. Van der Waals interaction diers both in form and range from electrostatic. Its repulsion component arises from overlap of atomic orbitals and attraction from the transient dipole eect. The interaction can be compactly expressed as the Lennard-Jones 6 − 12 form:

Evdw = (

σR 12 σA ) − ( )6 rij rij

(1.2)

Apart from these two fundamental interatomic non-covalent interactions, many other interactions arise between two groups of atoms. The most noticeable of these is hydrogen bonding because it is involved in prominent features like Watson-Crick base pairing and α,β secondary structures in proteins. The donor hydrogen is covalently bonded to a base atom, but interacts favorably with the acceptor atom. Oxygen and nitrogen are frequently observed to be acceptor and base respectively. π hydrogen bonding (Brandl et al. (2001), Steiner and Koellner (2001)) occurs when the acceptor is aromatic ring. Other interactions involving aromatic groups are π − π stacking and cation-π (Mecozzi et al. (1996)) interactions. Salt 14

bridges (Waldburger et al. (1995)) are formed between opposite-charged sidechains such as arginine and glutamate. Finally, aggregation of hydrophobic groups away from solvent is also an important non-covalent interaction, but it is rather unspecic as compared to others described above. Hydrophobic burial is thought to arise from the fact that it is more favorable enthalpically to expose polar sidechains to water than nonpolar ones. Hydrophobic collapse (Agashe et al. (1995)) of aliphatic sidechains is regarded as a key event in protein folding.

1.5 On folding Despite (or due to?) their deceptively simple chemical architectures, macromolecules have a mind-boggling conformational freedom available to them, e.g. a 100-residue protein, with at least 3 conformational degrees of freedom per residue and at least 3 values per degree of freedom, leads to 3300 conformational possibilities. Assuming that the native state is the global minimum of the energy landscape and 10−13 seconds per conformation to sample, unbiased search will take more than 10100 years to fold the protein, but most proteins take milliseconds to seconds for folding. This startling disagreement is called as Levinthal's paradox (Levinthal (1968)), which naturally leads to the speculation that cooperative synergy between dierent parts of the chain must be shaping the energy landscape to achieve fast folding. According to the folding funnel hypothesis (Dill and Chan (1997), Fersht and Daggett (2002), Leopold

et al. (1992)), energy landscape is hierarchical and funnel-like, with native state at the bottom of the funnel. Annsen's suggestion of spontaneous, completely sequence-determined folding (Annsen (1973)) is applicable to only to a subset of proteins. This is supported by existence of proteins which help folding of other proteins (e.g. chaperones, see Fink (1999)), degradation mechanisms for misfolded proteins (e.g. ubiquitin-proteasome pathway, see Goldberg (2003)) and many diseases which may be linked to misfolding-related aggregation (e.g. mad-cow disease, see Chiti and Dobson (2006)). Even the notion of a unique global minimum energy conformation is uncertain. The rugged nature of the energy landscape does not become smoother close to the global minimum, instead the landscape hosts many comparable yet distinct energy minima (Frauenfelder and Leeson (1998)). Molecular heterogeneity results when multiple basins of this kind are occupied (Burling et al. (1996), Rejto and Freer (1996)). These conformational subpopulations, which are believed to be in dynamic equilibrium, are important for protein function (Rader and Agard (1997), Wilson and Brunger (2000)).

15

1.6 3D modelling of protein structures Prediction of 3D structure from sequence is the most fundamental challenge for structural biology. The process of 3D structure modelling depends heavily upon available experimentallydetermined structures and the inferences that can be drawn from them about the sequence. Discrimination between two candidate conformations is achieved using these structural clues. Another major inuence is the available resource (computing time) - it decides how exhaustively the solution space is sampled. The purpose of modelling, discriminating ability of the energy function and the sampling resources available inuence the level of detail represented by a model. Thus a modelling method can be classied on the following correlated choices: (a) model resolution; (b) sampling strategy; (c) energy function. After these choices are made, the modelling process is the use of (b) to explore the conformational space dened by (a) and choosing the least energy conformation according to (c) within the limits of available resources. In addition to structure prediction, these choices extend to structure determination too, because the nal step of structure determination is structure modelling with modications to the energy function to include experimental observations. After the choices, I discuss the common modelling methodologies - comparative and ab initio.

1.6.1 Model representation The more detailed the model, the more degrees of freedom it will have to make conformational sampling challenging. On the other hand, realistic energy functions can be computed only on detailed models. Lattice-based models represent a residue with one or two points which can exist only on a discrete grid (Hinds and Levitt (1994), Dill et al. (1995)). The conformational space of such simple models can be exhaustively sampled to derive coarse-grained qualitative inferences about energy landscape. Obviously such models are not designed to gain atomistic insights. If the condition of lattice occupation is relaxed, we get Cα -only models, or Cα − Cβ models (Fogolari et al. (1996)). Such models have more conformational freedom, but they do not capture many features of mainchain or sidechain. Models with all mainchain atoms and a pseudo sidechain atom manage to capture mainchain features without sidechain exibility (Simons et al. (1997)). All-atom models use all heavy-atoms explicitly but rarely hydrogens, so they are better termed as united-atom models. Such models are generally sucient to calculate interesting features of macromolecules like non-covalent interactions, solvent accessibility, packing density, cavities, X-ray scattering, chemical shifts etc. True all-atom models are rarely used but use of hydrogens is important for calculating accurate hydrogen bonding potentials and atom contact validation (Yang et al. (2007), Word et al. (1999)). An extremely detailed model would consider not only all heavy atoms and hydrogens, but also 16

their electronic states and nuclei. Although such a detailed model is prohibitively costly for macromolecules, hybrid QM/MM models are used which carry out quantum mechanical calculations within the site of interest and molecular dynamics in the rest of the molecule (Senn and Thiel (2007)). Since the macromolecule of interest exists mostly in aqueous or fatty surroundings, modelling of the environment also needs attention. Waters can form intricate structures around molecules, modify the electrostatics by shielding, participate and mediate in solute interactions (Raschke (2006)). Solvent environment also oers viscous drag, random perturbation and damping to atoms in solute molecule. When explicitly represented, generally a water model such as TIP3P is coupled with heuristics like periodic boundaries, particle mesh Ewald (PME) for electrostatics etc. In implicit solvent (bulk) modelling, solvent freedom is integrated out using an approximation to the Poisson Boltzmann equation such as Generalized Born. Composite bulk/explicit models are also used (see Krol (2003)).

1.6.2 Energy function Energy functions are vital to modelling because they bias sampling and discriminate between candidate conformations. In some scenarios where near-exhaustive sampling is possible, energy functions are completely responsible for prediction quality, underlining the need to use high-quality energy functions. The resource-based tradeo here is between the the computing cost and sophistication of the function. There are two broad categories of energy functions: knowledge-based (statistical potentials, see Buchete et al. (2004)) and physics-based (molecular mechanics forceelds, see Ponder and Case (2003)). Knowledge-based functions are derived by performing statistical analyses on accumulated information in structural and sequence databases. They need not correspond to well-known physical theories. Physics-based energy functions are based on well-established physicochemical theories of atomic behaviour. Of course, experimental observations are needed to parametrize the forceelds and in that sense the classication is not absolute. Statistical potentials can be coarse-grained and undierentiable, whereas the physics-based energy functions are atomistic and dierentiable. Statistical potentials simultaneously run the risk of being statistical artefacts, and have the advantage of capturing a little-recognized physical fact.

1.6.2.1 Statistical potentials A typical statistical potential is based on pairwise distances between Cα atoms (coarse), functional groups (intermediate), individual atomtypes (ne-grained) etc. Such distances are 17

mined from structural data and presented by using statistical techniques of binning, smoothing, calculation of propensities and potentials of mean force (e.g. Buchete et al. (2003)). Pairwise potentials are assumed to be additive and a structure is scored by summing up likelihood-based energies of all pairs (Samudrala and Moult (1998)). 3- or 4-body statistical potentials have also been derived based on the assumption of cooperativity (Gan et al. (2001)). The so-called database potentials have been used for various applications like structure validation (Rojnuckarin and Subramaniam (1999)), fold recognition (Jones et al. (1992)), loop modelling (Samudrala and Moult (1998)) etc., although meaningfulness of statistical potentials is sometimes questioned (Ben-Naim (1997)).

1.6.2.2 Molecular mechanics force elds Research on small molecules reveals some strikingly unifying themes about atomic structure (a) covalent bond lengths and angles dened on same kinds of atoms tend to be very nearly the same (b) dihedral angles dened on 4 atoms A, B, C, D where pairs (A,B) (B,C) (C,D) are covalently bonded, are more variable (c) geometric relationship between atoms covalently bonded to the same atom tends to vary little, (d) minimum distance between atomic nuclei rarely falls below certain values (e) partial charges on atoms generate electrostatic forces of attraction and repulsion. Based on extensive experimentation and theoretical work, these observations still constitute the most popular basic framework to think on atomic scale about the energies of molecules and forces. To a good approximation, covalent potentials are harmonic (springlike) in nature. Electrostatics and van der Waals interactions are discussed earlier. These functional forms are adapted to participating atom set by calculating various parameters from experimental observations. The overall molecular mechanics potential for an atom p takes the following form:

E(p) =

X

Kb (l − l0 )2 +

bonds

X

Kθ (θ − θ0 )2 +

angles

X

Kω (1 + cos(nω − γ))

(1.3)

dihedrals

nonbonded X

all X σR 12 σA 6 qa qp + (( ) − ( ) ) + rn rn 4πD²rap atoms neighbours

Solvation is also an important component of physics-based energy functions as mentioned earlier. The combination of functional forms, atom-typing rules, atomtype-specic parameters and solvation models constitutes a forceeld (e.g. amber (Cornell et al. (1999)), charmm (Brooks et al. (1983)), gromacs (Lindahl et al. (2001))). In practice, a forceeld

18

and software to use it are tightly coupled and not interchangeable.

1.6.3 Sampling strategy The goal of sampling strategy is to sample the global minimum so that the energy functions can identify it. The simplest sampling strategy is exhaustive and impractical for most interesting systems and model representations. This necessitates intelligent strategies which bias the sampling process towards low energies without overly restricting the conformational exploration. Thus sampling is not independent of the energy function.

1.6.3.1 Direct methods The simplest sampling strategy would use only a random number generator to obtain the search direction and an energy calculator to evaluate the suitability of that direction. An improvement on this strategy is a set of methods that use an empirical rule to direct the search, e.g. the simplex method biases sampling by moving away from highest energy vertex of presently occupied simplex. The next class of methods uses the rst derivative to bias sampling towards a direction that seems most promising. Such methods (conjugate gradient, steepest descent) show better convergence than simplex method but are limited by dierentiability. Methods like Newton-Raphson use second derivatives to achieve even better results than rst-derivative methods but they are computationally demanding due to the matrix operations involved. Derivative-based methods, by denition, do not go against the derivative and can easily get trapped in local minima of a complicated function.

1.6.3.2 Molecular dynamics and simulated annealing (mdsa) Simulated annealing (SA) is inspired by two physical observations: (a) the characteristics of a metal are better if it is cooled slowly from a high temperature than fast cooling and (b) repeated cycles of heating/cooling improve the quality further. Two major components of this sampling strategy are (i) a move-set that can sample the states of the system around the present state and (ii) concept of temperature that falls as annealing progresses. A favorable change of state (δE < 0) is accepted irrespective of temperature but even an unfavorable change may be accepted when

e−δE/kB T > random(0, 1)

(1.4)

where kB is the Boltzmann constant, T is the temperature and random number generation is uniform between 0 and 1. Temperature-dependent randomness helps in crossing some 19

energy barriers in the landscape and better sampling. Simulated annealing is generally run several times with dierent random seeds and the best result is retained. Molecular dynamics (MD) is inspired by classical mechanics and the belief that repeatedly solving Newton's equations of motion will take an atomic system towards its global minimum. Starting from random assignment of atomic velocities, energy function is repeatedly dierentiated to calculate forces, velocities and displacements. For reducing error, the timestep is very small (∼ 10−15 seconds). MD can also be regarded as a kinetic improvization over derivative-based methods. A combination of these two techniques (mdsa), consists of molecular dynamics at varying temperature similar to heating/cooling cycles of simulated annealing. High temperature imparts high kinetic energy to molecule and achieves better sampling than MD alone. A popular application of mdsa to structural determination is available within Axel Brunger's cns program (Brunger et al. (1998)).

1.6.3.3 Genetic algorithms GA is a technique inspired by evolution which makes a gene tter in its environment due to events like inheritance, recombination and mutations. A typical genetic algorithm starts with many dierent copies of the system, a system represented by a linear gene-like representation of its properties. A tness function evaluates the ttest individuals in a generation and recombines them to create the next generation of individuals. This process is repeated until a satisfactory solution is found. By denition, this strategy operates on a population of solutions as opposed to previously discussed methods that operate on a single copy of the system. Many programs in molecular modelling (e.g. DePristo et al. (2003), Gunn (1997), Yang and Liu (2006)) use some variation of this technique.

1.6.3.4 Tree search Most energy functions require that the entire molecule is sampled at every stage. But when the energy function can be suitably split into parts, the molecule can be sampled incrementally. The conformational possibilities of a molecule can be cast in terms of freedom of parts in a tree-like manner, with an energy function biasing the choice of branching. The optimal solution is a leaf of the tree, so heuristics are aimed at avoiding subtrees that do not contain the best leaf. Use of this technique in DePristo et al. (2003) and Canutescu et al. (2003) will be discussed in detail later.

20

1.6.4 Comparative and ab initio approaches to modelling Evolution acts on genotype through phenotype. Macromolecular structure is closer to selection pressure than genetic cause of the structure, resulting in greater structural conservation. Inspection of protein families reveals that most of the structures are closely superposable. Hence a rational approach to structure modelling is to establish the fold of a sequence and model it based on structures of members of that fold. Information derived from templates, such as mainchain states, sidechain rotamers, long-range interactions etc., can be expressed as probability distributions and the target modelled to be the most probable. The two essential steps of fold recognition and 3D modelling are encoded in popular softwares like Fugue (Shi et al. (2001)), Modeller (Sali and Blundell (1993)) etc. Contrary to the comparative approach, evolutionary information is not used in ab initio modelling because it may not always be available or identiable. Instead physics-based potentials are used assuming that native structure of the sequence is the global minimum of the landscape dened by them. Thus molecular dynamics (Brooks et al. (1983), Cornell et al. (1999)) is the only truly ab

initio approach, and as noted earlier, folding even small proteins correctly is not yet possible due to prohibitive computing and forceeld inaccuracies. In practice, most methods are neither at the comparative, nor at the ab initio extremes of the modelling methodology spectrum which is dened by the extent that a method depends on a knowledge-base. Intermediate methods employ sequence-structure preferences to a certain extent along with physical energy functions. Generally ne-grained structural features have better coverage, and their statistical preferences are more reliable. But for coarse-grained features, e.g. φ − ψ preferences for 5 residue fragments, the problem of data sparsity makes the derived statistics unreliable, and may limit the conformational space articially. But some studies also indicate that fragments from protein structures in the pdb are sucient to model new protein structures, except for long loops (Zhang and Skolnick (2005)). Thus strong conformational preferences exhibited by some sequences have been catalogued (Bystro and Baker (1998)) and used for structure prediction (Simons et al. (1999)) using a Monte Carlo approach in a successful mid-spectrum method called Rosetta.

1.7 Experimental structure determination Despite the extensive research in structure prediction, signicance of experimental structure determination is undiminished. This is due to the following main reasons: (a) prediction is not good enough to make determination redundant (b) even if it were, there is more to structure than producing a model below 1Å all-atom rmsd: computational methods have at best started addressing dynamics and many functional forms a molecule can adopt 21

(open/closed, active/inactive etc.) (c) prediction and determination activities share resources and techniques, and enhance each other. X-ray crystallography has been the most popular technique for determination of atomicresolution structures of macromolecules till now. Perhaps this reects the kinds of targets pursued so far. NMR is the second most popular technique with a rich variety of observables that may hold information about dynamics. Low-resolution techniques like SAXS and EM are gaining ground due to growing need to understand large multi-component assemblies. Joint use of restraints derived from these techniques is also becoming popular. Here I briey discuss what exactly is measured by these techniques.

1.7.1 X-ray Crystallography When a beam of light is elastically scattered by two particles in the same direction, the phase of the beam changes (Fig.1.7). This observation can be combined with spherical approximation of an atom's electronic distribution to estimate its scattering behaviour (also called atomic scattering factor, a real number due to atomic symmetry). The same observation can be used to calculate the scattering from a unit cell, by summing up individual scattering factors of atoms therein in the complex plane. Similarly, for a crystal consisting of a practically innite number of unit cells, scattering can be estimated. An interesting observation in this calculation is that diraction from a crystal must obey the Laue conditions, hence the systematic absences of scattering observations are valuable to determine the spacegroup. A scattering direction can be labelled by the direction indices (hkl) of its imaginary reecting plane. A scattered wave is a complex number obtained by integrating contributions from the entire unit cell :

Z iφhkl

Fhkl e

ρ(xyz)ei2π(hkl).(xyz)

=V xyz

(1.5)

As structure factors are Fourier transforms of electron density, the latter at a point (xyz) in the cell of volume V can be calculated by reversing the transform :

ρ(xyz) =

1 X Fhkl ei(φhkl −2π(hkl).(xyz)) V hkl

(1.6)

2 A crystallographic experiment measures diraction intensity Ihkl = Fhkl only and not φhkl ,

which is called the phase problem. Experimental (such as multiple isomorphous replacement (MIR), single/multiple wavelength anomalous diraction (SAD/MAD) etc.) and computational techniques (like molecular replacement (MR)) can be used to determine initial phases. Due to the phase challenge, the process of determining an atomic model is incrementally cor22

Figure 1.7: Schematic overview of the crystallographic diraction phenomenon. (A) Diraction changes phase. When a parallel beam of light of wavelength λ is diracted through angle 2θ by 2 planes d apart, one of the rays travels an additional distance of 2d sin θ compared to the other ray. This causes a phase change of d sin θ/λ . (B) The atom can be modelled as spherical electron distribution and net diraction calculated from it by integrating over the distribution. (C) Many atomic diractions constitute diraction from a unit cell.

d

(A)

θ Additional distance traveled

(C) (B)

rective. Crystallographic renement is discussed later in section 1.8. A major requirement of this technique is to obtain crystals of molecule of interest, which is still a very challenging, time-consuming and laborious task due to limited understanding of the crystallization process. Crystallization may also restrict the native dynamical nature of the molecule due to close packing.

1.7.2 NMR Nuclear magnetic resonance exploits the characteristic excitation/relaxation behaviour of magnetosensitive atomic nuclei. A population of such nuclei resonate with characteristic frequency in presence of strong magentic eld with small oscillations at that frequency. But 23

when the nuclei are in a covalent environment within a molecule, this frequency shifts in response to the shielding eect of the electronic environment. Thus chemical shifts hold information about the covalent environment of an atom. Apart from this, relaxation behaviours of dierent nuclear species in a molecule are magnetically correlated through the electronic environment independent of covalent bonding. This is called the nuclear Overhauser effect (NOE) and it gives information about non-covalent environments of atoms. J-couplings arise due to the inuence of electronic environment on spin-spin interaction. When used with Karplus relations, they give information about dihedral angles. Residual dipolar couplings arise when internuclear vectors between covalently bonded atoms are unisotropically distributed. Residual dipolar couplings (RDCs) hold information about orientation of internuclear vectors with respect to the magnetic eld. Chemical shifts, NOEs, J-couplings, RDCs etc. are used as restraints along with expected stereochemistry to calculate a series of models which explain the observables equally well. A strength of NMR is that it does not require crystals. But a large number of experiments are required to collect enough data to determine the structure. Requirement of concentrated solution may cause problematic articial oligomericity. For large structures, it becomes dicult to resolve the spectra and number of experiments required becomes impractical. Developing techniques to address bigger structures is an active area of NMR methodological research (Tzakos et al. (2006)).

1.7.3 Cryo EM Electron microscopy uses high-energy electrons to bombard a sample and then electromagnetic lenses to capture 2D images. It has been highly eective in studying ultrastructure of biological samples, e.g. internal organization of cell orgenelles. But radiation damage is a serious concern when it is applied to large macromolecules, to avoid which cryo-cooling must be used. Cryo-cooling creates a thin layer of water around the molecule without iceformation. The prepared sample can then be arranged into an ordered 2D crystal and images taken at various tilts of the crystal plane. These images are computationally assembled into 3D density. Most of the good resolution structures are generally obtained by 2D electron cryocrystallography, but 2D crystals are not absolutely necessary (Saibil (2000)). Even without the regular arrangement, images can be obtained and assembled computationally if particle size is large. Image assembly requires pattern recognition, noise reduction and advanced geometric 2D to 3D conversion techniques. It is computationally demanding and a major hurdle in improving resolution. Various softwares are dedicated to this task (Ludtke et al. (1999), van Heel et al. (1996), Frank et al. (1996)). The advantages of cryo EM are that it provides the larger picture of macromolecular assemblies which can be used to dock high-resolution structures of their components. Snapshots of large assemblies in various functional states 24

can be obtained by initiating the change just before vitrication. As computational challenges are addressed, cryo EM is expected to become more popular in the structural biology community.

1.7.4 SAXS Crystallographic order makes large diraction angles possible for crystals of macromolecules. But particles in solution are undergoing random thermal motion and yield only small-angle X-ray scattering (SAXS). Yet, due to the non-spherical nature of the particles and their partial order in solution, there is structural information in the intensity of scattering versus wavelength curve (Svergun and Koch (2003)). SAXS data can be interpreted to extract a unique shape envelope by making some simplifying assumptions like uniform density within the particle. Softwares like ATSAS (Konarev et al. (2006)) and Saxs3D (Walther et al. (2000)) are dedicated to interpretation of SAXS data. It is a promising complementary approach to atomic-resolution techniques because it does not require crystallization nor is limited by size, especially suited to the niche of molecular sizes too small for EM and too large/dicult for NMR and crystallography.

1.8 Crystallographic renement 1.8.1 Overview The goal of crystallographic renement is to obtain a model that accurately reproduces observed structure factor amplitudes. The main hurdle in this process is the phase problem, which is partially alleviated by phase estimation techniques (MIR, MAD, MR etc.). Still the initial phase errors can be large. Typically, the phase errors are slowly corrected in an iterative renement process consisting of manual and automatic components. Automatic components may consist of shape recognition, fragment searching, minimization, mdsa etc. as discussed in a later section. Intervention by an expert crystallographer is necessary due to imperfections in automatic procedures. The quality of the atomic model at a stage in renement reects mainly the knowledge of phases at that stage. Structure factors calculated using this model are denoted as Fc , φc and Fo denotes the observed structure factor amplitudes. Agreement between observed and calculated structure factor amplitudes is usually assessed by calculating the R-factor:

P R=

− Fchkl | hkl hkl |Fo |

|F hkl hkl P o

25

(1.7)

The R-factor is a simple global measure which does not point at the problematic regions in the model: such regions are identied by calculating their t to the electron density map. Electron density maps calculated by combining Fo , Fc and φc , e.g. the 2Fo − Fc map, reduce the model bias in the density by weighting Fo more. Fo − Fc map indicates regions of bad t by peaks in the density and good regions by almost no density. Model bias in various maps can be reduced by using procedures such the σA -weighing (Read (1986)) which assigns optimal weights to Fo and Fc for every (hkl) and leads to omit maps. In order to make maps comparable, the absolute electron density values are normalized by using the mean and standard deviation of the density map (σ(r) =

ρ(r)−µmap ). σmap

For automatic assessment of local problems in the structure, a real space R-factor is calculated:

Rreal

P |ρo − ρc | =P |ρo + ρc |

(1.8)

where the summation runs over the area of interest in the map. Local errors can also be indicated by violation of expected stereochemistry and known preferences like φ, ψ, χ states. In addition to these measures, patterns in the map give important clues to a crystallographer, e.g. bulky sidechains, tube-like density for α-helix etc. In a typical semi-automated renement, features in electron density become more recognizable as the process advances and phases improve. This is continued till the model and density are in good agreement and no more improvement is obvious. Generally an R-factor of 0.2 (20%) is considered satisfactory for a 2Å resolution diraction data. The crystallographic renement problem is generally underdetermined, as at least 4 parameters per atom (x, y, z, B ) are to be determined using the measured structure factor amplitudes. The number of reections increases quadratically in the higher resolution shells, hence a crystal diracting to low resolution is likely to have fewer reections than parameters. Moreover, reections are not independent of one another and phases can at best be partially determined. To oset the low observations-to-parameter ratio, well-known restraints of covalent geometry and excluded volume are used. Yet the danger of overtting remains in a quest to improve the R-factor. To guard against this, Brunger (1992) famously devised the Rf ree measure. Rf ree factor is calculated in the same way as R but over only a small (5 − 10%) set of all reections. This set is not used in the renement process. Overtting is indicated by an Rf ree signicantly higher than R - well-rened structures have only slightly higher Rf ree .

26

1.8.2 Automation Automation of any kind is always welcomed by the crystallographer as the large sizes of biomolecules and phase errors make the renement process slow, laborious and error-prone. Approximate positioning of the entire molecule in the asymmetric unit is possible if the structure is known to be similar to some other structure, say through homology. This technique, called molecular replacement, consists of rotational and translational search. Another important approach to knowledge-assisted search is to identify fragments from known structures that t a given region in electron density (see Adams and Grosse-Kunstleve (2000)). Finegrained knowledge-based sampling has recently emerged as a promising automation technique (DePristo et al. (2005), Furnham et al. (2006b)). On the other hand, there are methods which use no knowledge from a database, such as Perrakis et al. (1997) which iteratively adds and removes dummy atoms to introduce up to thrice the expected number of atoms, then chooses the ones which satisfy the covalent geometry. Direct methods operate on the premise that intensities already contain phase information and the phase triplet relationship can be used to estimate phases. They have been applied only to small proteins so far with high-resolution data. A popular implementation is Shake-and-Bake (Weeks and Miller (1999)) which uses phase probabilities and triplet relationship to derive and minimize a function of all phases. This minimization is nontrivial and performed in both reciprocal (Shake component) and real (Bake component) spaces. Too many uniformly distributed atoms in macromolecular unit cell is the major obstacle for this otherwise interesting and potentially very useful class of methods. Continuous optimization techniques like conjugate gradient (e.g. Murshudov et al. (1997)) and mdsa (Brunger et al. (1998)) continue to be important and popular due to their netuning of a model close to the optimal structure. In cns, the energy function is a combination of molecular mechanics energy terms and a weighted crystallographic target:

Etotal = EM M + w.Exray

(1.9)

Exray can be the least squares target or the maximum likelihood target. In least squares, a line of best t is rst found between Fo , Fc sets and the derivatives generated change the model to push Fc towards this line. This target is based on the simplistic assumption that experimental error in an Fo set has a Gaussian distribution. It is a special case of the more general maximum likelihood target (e.g. refmac, Murshudov et al. (1997)) which does not assume Gaussian error. The addition of a crystallographic target biases the molecular mechanics landscape towards the experimental data. The existence of molecular mechanics terms increases the radius of convergence of the crystallographic target. 27

1.8.3 Errors and Validation Due to its manually-driven iterative nature, crystallographic renement remains prone to errors. Some gross errors are tracing a chain in reverse direction, incorrect topology of secondary structure elements, compensating mistakes in sequence register etc. Finer errors include nonplanar peptide bonds, incorrect bond angles (e.g. N − Cα − C ), interchanged sidechain densities, D-amino acids, rare φ − ψ state, a questionable rotamer etc. which occur due to overtting. Crystallographic measures like Rf ree , Rreal can be used to assess model quality globally and locally. A model can be validated by comparing its structural characteristics against the distribution of those characteristics in a database of known high-quality structures. Ideally, validation should be independent of renement, i.e. the same distributions should not be used in validation as well as renement (Kleywegt (2000)). Apart from covalent geometry, stereochemistry and dihedral distribution (φ, ψ, χ) checks, all-atom contacts and packing can also be used for validation as non-covalent interactions are essential for stability, and are expected to occur. Procheck (Laskowski et al. (1993)), MolProbity (Davis et al. (2004)), Rampage (Lovell et al. (2003)) are popularly used for structure validation. Validation tools need frequent updating to incorporate knowledge obtained from new structures.

1.8.4 Variability The common perception that a single conformation with isotropic B -factors in pdb les represent the diraction data in the best possible manner is highly doubtful, so is the perception of a very rigid macromolecule in the crystalline state. Due to the high solvent content in macromolecular crystals, macromolecules have enough conformation freedom in that highly ordered state (Jensen (1997)), even comparable to the in vivo state, as demonstrated by reversible ligand binding (Petsko (1996)). Not surprisingly, single-conformer rened structures of the same sequence dier due to local practices in dierent labs, experience of crystallographers, slightly dierent starting conformations etc. (Ohlendorf (1994), Zoete et al. (2002)). Often the R, Rf ree statistics of macromolecular renement are signicantly above the experimental noise: investigation of this with simulated diraction data reveals the inability of single isotropic conformer model to account for the underlying variability which may be neither harmonic nor isotropic (Kuriyan et al. (1986), Vitkup et al. (2002)). This state of macromolecular crystallographic renement (diering interpretations, none quite satisfactory) is indeed a cause for concern. This may be attributed to physical causes which blur the data, like internal motion within the molecule, overall motion of the molecule and imperfect packing in the crystal, as well as theoretical fact of the renement problem being underde28

termined. It is not trivial to estimate the expected variability in the coordinates determined from crystallography, nor the individual contributions to it. Although the questions of magnitude of coordinate variability, its causes and correlations therein are largely unresolved, the general best-practices consensus is to look for and deposit ensembles instead of single conformer (Furnham et al. (2006a)). Yet single-conformer deposition remains the common practice. There are two reasons for this, practical and theoretical: (a) additional human eort is necessary for reliable identication of multiple conformers and (b) observations-toparameter ratio drops proportionally as ensemble size increases. Observations-to-parameters ratio is an indicator of the reliability of the rened model. The higher this ratio, the less is the chance of model features being artifacts. But if structural detail in the model is lowered, it may not serve the purpose of structure determination, nor would it be possible to achieve good R-factor - another indicator of reliability. Thus the tradeo for a crystallographer is between incorporating enough structural detail in the model and keeping observations-to-parameters ratio respectably high. Parameters are of three types: spatial (Cartesian coordinates), thermal (describing uncertainty around modelled coordinate) and occupancy (probability of nding an atom at its modelled coordinate). Spatial parameters are generally reduced easily by choosing not to model hydrogens and waters. Assuming a constant full occupancy for all heavy atoms is also a common practice. Thermal parameters are reduced by assuming an isotropic B -factor. These atom-specic simple approximations reduce the parameters signicantly. But further reduction can be obtained by considering correlations between atomic parameters. Spatial coordinates have high covalent correlation which can be used to reduce 3N Cartesian coordinates to 0.4N torsion angles for a typical polypeptide chain (as in torsion angle renement of Brunger et al. (1998)). Common isotropic B -factor can be assigned to a group of atoms, reducing thermal parameters. Anisotropic displacement modelling requires six parameters per atom for ellipsoid denition but TLS (translation, libration, screw) renement assumes that molecules or their parts (entire domains or aromatic sidechains or helices etc.) are internally rigid and the displacement of each rigid unit can be modelled en bloc with twenty parameters (T + L + S → 6 + 6 + 8) (Winn et al. (2001)), implying that TLS renement is economic in terms of parameters for groups larger than three atoms. Using non-crystallographic symmetry also reduces parameters (Borhani et al. (1997), Furnham et al. (2006b)), e.g. monomers in a homodimer can be modelled identically. Experimental approach of assessing macromolecular dynamics bypasses the observationsto-parameters issue altogether by observing conformational changes as frequently as 150ps in time-resolved crystallography (Moat (1998)), especially for light-sensitive proteins. But theoretical eorts also have tried to assess dynamics and heterogeneity of macromolecules in 29

Energy

Figure 1.8: Knowledge tunnelling. The renement trajectory on the left is derivative-driven and can get trapped in a local minimum. But when assisted with knowledge-tunnelling steps, it is more likely to lead to the global minimum.

Knowledge tunnel

Conformational Coordinate

crystals despite the observation-to-parameter ratio issue, relying on R, Rf ree relationship to indicate overtting if any. Kuriyan et al. (1991) used molecular dynamics simulations with crystallographic term, starting from deposited structure, to model single conformers with isotropic B -factors as well as two-copy ensembles, to nd that both reduced Rf ree . Similar approaches were adopted by Pellegrini et al. (1997) and Wilson and Brunger (2000) etc., but multiconformer renement is rarely performed. Recent eorts like DePristo et al. (2005) and Terwilliger et al. (2007) use a practical workaround by nding many single-conformer isotropic models and presenting them as an ensemble1 . Terwilliger et al. (2007) have suggested a lower bound on coordinate variability by analyzing such collection.

30

1.9 Thesis Overview The challenge of macromolecular structure determination boils down to searching and optimization in a complex landscape shaped by experimental observations, various potentials of covalent geometry, van der Waals and electrostatic potentials. As discussed earlier, derivativebased methods have a small radius of convergence. Addition of kinetic energy to the system in MD improves sampling, but in a random unbiased manner. Hence these methods have to be rescued from local minima often during the renement process. The rescue eort is usually mounted by an expert crystallographer. This makes crystallographic renement subjective and laborious, also discouraging multi-conformer interpretations. The crystallographer tweaks the structure by correcting obvious mismatches between model and density, and traverses a knowledge-tunnel through an energy-hill in the landscape (Fig.1.8). Knowledge-tunnelling is attempted by all methods that use statistical preferences derived from structural data to bias the model's sampling towards native-like features. Its eectiveness depends on completeness and reliability of the kind of knowledge used. Recent work in this lab has eectively used knowledge-tunnelling to address heterogeneity and knowledge-based automation in protein crystallography. The knowledge-based preferences used, the φ − ψ propensities and the sidechain rotameric libraries, were ne-grained and reliable. To complement the ne-grained sampling, positional restraints on sidechain centroid and Cα atom of each residue were used. Starting from approximate mainchain and repeated sidechain assignment, DePristo et al. (2004) showed that slightly diering structures are very comparable in terms of quality measures like Rf ree , Ramachandran outliers, covalent geometry etc. This highlighted the need for ensemble interpretation of crystallographic data (Furnham et al. (2006a)). DePristo et al. (2005) demonstrated that knowledge-based all-atom sampling can automatically rene an approximately known structure with a composite rapper/cns protocol. Furnham et al. (2006b) extended this approach, with minimal manual intervention, to low resolution data also to solve an important structure (Dore et al. (2006)). This thesis was inspired by the goal of extending knowledge-tunnelling to other challenges in structure determination. Macromolecules are of dierent kinds, so are the methods to sample their conformations and determine their structures. My hypothesis is that the ideas of energy landscape and knowledge-tunnelling apply equally well in all dierent scenarios involving macromolecules and experimental data. This thesis supports that hypothesis by demonstrating the eectiveness of knowledge-tunnelling for some new scenarios in macromolecular crystallography. 1 But

it must be noted that this is a distinct approach from multi-conformer renement and it is safer to term the its outcome as a collection to avoid confusion with ensemble (see Chapter 4).

31

I begin by describing in Chapter 2 a versatile software framework Rapper tk to perform knowledge-based sampling. rapper-like Cα -tracing is reproduced with Rapper tk and benchmarked to observe improved sampling. Chapter 3 describes the implementation of a sidechain optimization algorithm within this software. This algorithm is demonstrated in the context of protein crystallographic renement to rene approximate mainchains, identify sequence registry for low-resolution data and multiconformer renement. Chapter 4 demonstrates knowledge-tunnelling for all-atom renement. I show that loops with little positional information can be automatically modelled with Rapper tk and their structural heterogeneity can be explored. Chapter 5 demonstrates the use of Rapper tk for incremental conformational sampling of rna chains. I demonstrate automated crystallographic renement of parts of trnaAsp chain starting from approximate positional information. Chapter 6 concludes this thesis by summarizing the contributions and weaknesses of this work, with a discussion of future plans.

32

Chapter 2

Rapper tk: a versatile engine for discrete restraint-based conformational sampling of macromolecules Background Macromolecular structures are modelled by conformational optimization within ex-

perimental and knowledge-based restraints. Discrete restraint-based sampling generates highquality structures within these restraints and facilitates further renement in a continuous all-atom energy landscape. This approach has been used successfully for protein loop modelling, comparative modelling and electron density tting in X-ray crystallography.

Results Here I present a software toolkit (Rapper tk) which generalizes discrete restraint-based

sampling for use in structural biology. Modular design and multi-layered architecture enables Rapper tk to sample conformations of any macromolecule at many levels of detail and within a variety of experimental restraints. Performance against a Cα -tracing benchmark shows that the eciency has not suered despite the overhead required by this exibility. I demonstrate the toolkit's capabilities by building high-quality β -sheets and by introducing restraint-driven sampling. rna sampling is demonstrated by rebuilding a protein-rna interface. Ability to construct arbitrary ligands is used in sampling protein-ligand interfaces within electron density. Finally, secondary structure and shape information derived from EM are combined to generate multiple conformations of a protein consistent with the observed density.

Conclusion Through its modular design and ease of use, Rapper tk enables exploration of a wide

variety of interesting avenues in structural biology. This toolkit, with illustrative examples, is freely available to academic users from http://www-cryst.bioc.cam.ac.uk/~swanand/mysite/rtk/index.html.

33

2.1 Introduction1 Atomic structures of biological macromolecules give key insights into their biochemical function and are determined by conformational optimization within the landscape dened by experimentally derived and knowledge-based restraints. Molecular dynamics and minimization, implemented in popular softwares like charmm (Brooks et al. (1983)) and gromacs (Lindahl et al. (2001)), play a signicant role in this process. However, all-atom forceelds used by these methods give rise to complex and rugged energy landscapes, which often create substantial diculties in locating meaningful minima. This task can be facilitated if seed conformations are obtained within convergence radii of these minima. Recent studies have illustrated this for proteins in various contexts (de Bakker et al. (2006)). Analyses of high resolution structures have yielded discrete preferred conformational states for protein backbones and sidechains (Murray et al. (2003), Bystro and Baker (1998), Lovell et al. (2000), Dunbrack and Cohen (1997)). Protein loops modelled with weighted sampling of knowledge-based preferences, excluded volume restraints and ideal stereochemistry, when further optimized with an all-atom forceeld, have accurately predicted the native conformation (Jacobson et al. (2004), de Bakker et al. (2003)). The combination of knowledgebased local preferences (for fragments smaller than 10 residues) with non-local physical energy terms like hydrophobic burial and hydrogen bonding in a simulated annealing protocol has been eective in protein structure prediction (Bradley et al. (2005)), homology modelling ( Fan and Mark (2004)) and structure determination (Rohl and Baker (2002)). Interpretations of crystallographic data of both high (DePristo et al. (2005)) and low (Furnham et al. (2006b)) resolutions have been achieved by combining discrete and continuous approaches. The promise of this hybrid approach has not yet been fully exploited; for instance it has not been used to assess conformational ensembles to enhance structure determination with NMR and EM data, to explore exibility of ligands including macromolecules such as rna or to examine diversity at macromolecular interfaces. Our approach, encoded in rapper (DePristo et al. (2003), see Fig.2.1), has been applied successfully to a range of protein modelling problems where restraints have been introduced from knowledge of structures or experimental observations. But rapper is limited in applicability due to its inexibility in molecular representation (proteins only), sampling direction (N to C) and search algorithm (Genetic Algorithm with Branch and Bound : gabb). These limitations have to be removed if the idea of discrete restraints-based sampling is to be applied to new problems. I found that this was quite challenging within the rapper codebase (>30,000 lines of c++ code). In this paper I describe an alternative framework, Rapper tk, which (a) programmatically 1 The

benchmarking part of this work was done in collaboration with Anjum Karmali and published as Gore et al. (2007).

34

decouples the logically distinct concepts like search algorithms, knowledge-based conformational preferences, sampling and building techniques and (b) provides access to them with a scripting language. The former reduces development time by allowing modules to be treated in isolation - e.g. rna sampling and building can be implemented independent of gabb. The latter speeds up the process of adapting the software to new scenarios, say by coding high level tasks like parsing and le manipulations in the scripting language. I show that both impact scientic productivity by allowing faster application of discrete restraints-based sampling to new problems. Analogous to MD softwares which provide a platform to run MD/minimization schedules, Rapper tk provides a platform for discrete restraints-based sampling and reproduces rapper functionality for proteins as a special case. Following sections describe the design, implementation and benchmarking of Rapper tk. I demonstrate that

Rapper tk has a exible, robust and easy-to-use software library which generalizes and builds upon the major concepts from rapper methodology in a modular, multi-layered fashion.

2.2 Implementation Fig.2.2 shows a typical step in rapper-like incremental sampling. This involves three distinct steps : sampling of dihedral angles φ, ψ, ω , building coordinates for the next peptide using those of the previous and checking the Cα -positional restraint. This suggests the concepts of sampler, builder and restraint. rapper maintains a population of conformers and executes these steps repeatedly on them according to gabb. This can be abstracted as search strategy which is responsible for correct ordering and execution of samplers, builders and restraints. In the modular, layered design of Rapper tk (Fig.2.3), application scripts reside at a level higher than search strategies - they carry out the task of preprocessing, creating necessary builders, samplers and restraints for the problem at hand, and passing them to the appropriate strategy. I have chosen a c++/swig/python style of coding, whereby the interface of c++ code is exposed in python by generating suitable wrappers automatically with swig. Such architecture has become popular among academic softwares (e.g. Xplor-NIH, Schwieters et al. (2003)) as it provides robustness without losing the uidity needed in academic implementations. I now describe the major concepts in more detail.

2.2.1 Coordinates Dierent sets of coordinates need to be maintained in order to allow for sets of conformers, either for population-based searches or for using ensemble averaged restraints. Some coordinates are known and xed, e.g. secondary structure elements in a loop building exercise. 35

Figure 2.1: The central search algorithm in Rapper. gabb is a blend of genetic algorithm and branch-and-bound approaches. Red nodes represent restraints-violating conformation extensions and green nodes stand for the restraints-obeying ones. Some conformational extensions may be left unsampled (not shown). Subtrees emanating from green nodes only are explored further. Set of green nodes at each level is kept below a xed size (population size), and this allows conformational exploration in time proportional to protein length, leading to an ensemble of restraints-satisfying conformations. START

LEVEL−1 (N−terminal)

LEVEL−2

LEVEL−3

Intermediate levels

LEVEL−K (C−terminal)

STOP

36

Figure 2.2: Basic concepts in Rapper tk. A sampler samples discrete conformational preferences (e.g. φ, ψ, ω ). A builder uses the sample and calculates a set of unknown coordinates from a set of known coordinates (e.g. peptide building). A restraint checks that the calculated coordinates satisfy some geometric criterion (e.g. whether calculated Cα coordinate lies within a spherical region).

......COORDINATE SET.... .

. . .

ϕψ Sampler

Peptide Builder

ω Sampler

CA positional restraint sphere

37

. . .

Figure 2.3: Rapper tk architecture - modular and layered. Modules written in c++ are exposed to python using automatic wrappers generated with swig.

Application Scripts

Utility Scripts

Search Strategies

PYTHON

SWIG

AUTOMATICALLY GENERATED PYTHON WRAPPERS

Builders

Samplers

Restraints

Geometry

Misc.

Knowledge−based conformational distributions

CPP

DATA

Each point has an associated hard-sphere (van der Waals) radius adapted from those used in probe (Word et al. (1999)). A high-level application script generates the coordinates. Builders and restraints operate upon specied indices in given coordinates.

2.2.2 Samplers A sampler chooses a datum from an underlying distribution of conformational preferences by random weighted sampling. Well-known examples are weighted φ, ψ sampling for protein backbone (Lovell et al. (2003)), rna backbone (Murray et al. (2003)) and sidechain rotamer sampling (Lovell et al. (2000)), all derived from high quality crystallographic structures. New types of sampling can be easily incorporated by writing a new sampler for the corresponding builder, say tri-φ, ψ sampler for tripeptide fragments, substructure sampler etc.

2.2.3 Restraints Values of various geometric entities are useful in constraining the conformational space, e.g. internuclear distances derived from NMR NOEs, electron-density from X-ray analyses, Cα positional information from templates in comparative modelling, and so on. A restraint object holds the information of points on which it is to be tested, and the method of testing. 38

A restraint is generally binary; it is either satised or violated. A restraint can be also be optional, i.e. it can be discarded if sampling consistently fails due to that restraint.

2.2.4 Builders and followers A builder consists of the indices of coordinates it uses and those it calculates, along with the calculation technique. For instance, the φ, ψ -based peptide backbone builder uses coordinates of {C i−2 , N i−1 , Cα i−1 } to calculate coordinates of {C i−1 , N i , Cα i }; {C i−1 , N i , Cα i , Cβ i ,

C i , N i+1 } coordinates are used by a backbone-dependent sidechain builder; and so on. Thus builder is an abstraction of coordinate calculations operating on input and output indices within a coordinate set. A builder may have an empty input set or may have only known coordinates as inputs, in which case it is called a seed builder (e.g. peptide N terminal anchor builder). There is a maximum number of trials a builder can undertake to extend a conformation; this will depend upon the conformational space available to sample. In order to avoid futile sampling, the builder may implement a session in which only unique samples are used, thus improving sampling diversity. Follower is a concept specic to populationbased searches. A builder is another's follower if it is advisable to execute it in the same population-search step as the leading builder. This was an improvisation rst used during

Cα -trace scripting to build sidechain immediately after the relevant mainchain.

2.2.5 Sampling Strategy The sampling strategy orchestrates the builders and restraints systematically to generate conformations. The sampling strategy can be divided into ordering and execution of restraints and builders. Automatic ordering allows the application script to create builders and restraints in any convenient order. Because strategies are coded in python, it is easy to write a new strategy.

2.2.5.1 Ordering of builders and restraints A correct strategy must calculate the order of execution for builders and restraints. There is a partial ordering induced on builders due to their input and output coordinates, i.e. a builder may not be executed unless its input coordinates have been computed, except for seed builders. Thus there is a digraph of builders, with possibly many seed builders and others depending on one or more builders. Restraints can be checked only after all coordinates to be tested have been computed, hence there is restraint-builder dependence. An ecient strategy must test a restraint as early as possible in order to avoid sampling the disallowed conformational space. Once a builder succeeds or fails in its task, an ecient sampling 39

strategy must use the builder dependence digraph to identify the builder to be attempted next. The strategy currently implemented in Rapper tk determines the builder order by topologically sorting the builder digraph, more specically as follows:

• In case of multiple seed (parentless) builders, a dummy builder is assumed to be their parent. A procedure similar to dfs (depth rst search) is used to assign unique parents to all nodes, i.e. convert the digraph into a tree. A node appears as child of another node only if the latter is the only unvisited parent of the former.

• The size of subtree rooted at each node is found. • Using dfs again, an order is established for the nodes. When a node is popped o the dfs stack, its children are pushed onto the stack in the ascending order of subtree sizes.

• The order thus obtained is the nal ordering used by the default strategy. If a builder fails, its unique parent builder may be executed, and the results of the parent and all its children discarded. If a builder succeeds, the builder next in order may be called. • From this builder order, restraints are identied for each builder such that they have all the necessary points computed after the builder. Thus every builder has associated restraints to check after it is executed. As an illustration, consider conformational sampling of a three residue peptide (see Fig.2.4) under the Cα spherical positional restraints. Four kinds of builders are employed. NanchorBuilder uses the rst two Cα restraints to anchor the peptide. Backbone-dependent sidechain builder is used for sampling sidechains. Since this builder requires parts of the backbone from adjacent residues also, two dummy Gly residues are added, one each at the beginning and end of the tripeptide. PeptideBuilder is used to build peptides in forward and backward directions. NanchorBuilder is the seed builder as it has no input points. Reverse PeptideBuilder, PeptideBuilder-1 and SidechainBuilder-1 depend on it because their input coordinates are partly or completely contained in its output coordinates. Similarly, SidechainBuilder-3 depends upon PeptideBuilder-1 and PeptideBuilder-2. Restraints CARestraint-1 and 2 depend upon PeptideBuilder-1 and 2. From these dependences, a directed graph can be constructed with builders and restraints as nodes. Topological sort on this graph produces a linear order of the builders, which suggests the builder to be tried after a successful (restraints-satisfying) builder. The backward ordering (or fallback ordering ) determines the builder to be called after an unsuccessful (not satisfying restraints) builder.

40

Figure 2.4: Automatic ordering of builders and restraints involved in sampling a threeresidue peptide. Upper panel shows the builders and their output coordinates, Cα positional restraints and the coordinates they test. NB (N-anchor-builder), PB (φ, ψ -sampling based peptide builder), RPB (PB-like, but in reverse direction), SC (backbone dependent sidechain builder) are used. A builder points into the coordinates it computes and is pointed at by the output coordinates of the builders it depends on. Restraints contain the coordinates they test. Bottom panel shows the corresponding dependence digraph on builders and restraints. A builder node depends upon the builder node it is pointed from. A restraint node is connected to a builder after execution of which it is possible to execute the restraint. Topological sort on this digraph results in a linear forward order (black) and fallback order (red) on builders.

SC3

CAR2

NB PB1 PB2 SC2

SC1

CAR1

RPB

NB

NB

CAR1

CAR2

PB1

RPB

SC1 TOPOLOGICAL SORT

RPB

SC2

SC3

PB2

SC1

CAR1

PB1 SC2 PB2

CAR2 SC3

41

2.2.5.2 Execution of builders and restraints Once the ordering among builders and restraints is established, various search strategies can be used to sample conformational space. The simplest is an exhaustive search, where each restraints-satisfying option available to a builder is explored. rapper uses PopulationSearch algorithm (gabb) as mentioned earlier. gabb limits the number of restraints to be checked at every extension step and provides a pool of t parents to build upon. Each parent is allowed to contribute more than one child and parents compete to put their children in the children pool. In addition to PopulationStrategy, Rapper tk provides a minor variation which allows limited backtracking (using fallbacks described earlier). The number and size of backtrack steps can be specied. In cases where the parents are not extensible at a certain step, the population search is restarted some steps earlier, determined by number and size of backtrack step specied. This saves the cost of starting from rst step in case of failure at an advanced step.

2.2.6 Spatial grid for checking clashes Steric clashes are a very important restraint on conformational freedom. Hence the output of every builder is veried with a 3D grid that uses geometric caching to check the clashes eciently. A GridHelper is provided to the grid to modify clash-checking functionality according to the application requirements. For instance, in atomic models, rst and second covalent neighbours of an atom need not be clash-checked, the van der Waals radius of sidechain atoms needs to be reduced due to discrete sidechain rotameric states, etc.

2.2.7

pdb

reader, model renderer etc.

The i/o functionality is written in python. pdb reader is largely adapted from a previous work (Gore et al. (2005)). ModelRenderer is currently a pdb writer, but can be extended to write models in other formats too. ModelRenderer is invoked by the strategy when it succeeds in sampling a conformation within given restraints.

2.2.8 Application scripts Application scripts are high-level python scripts which generate problem-specic context by preprocessing given information and creating necessary Rapper tk components to be used by the search strategy. They can be invoked as execution modes from Rapper tk launcher script. Application scripts are assisted by various utility scripts like the one for creating a standard set of builders and restraints. 42

2.3 Results 2.3.1 Benchmarking As tracing a polypeptide chain is central to all the tasks performed by rapper, I compare

Rapper tk's performance at chain tracing with that of rapper for 9 large (> 300 residues) proteins from the DePristo et al. (2003) benchmark set (see Table 2.1). Ensembles of 50 models were generated and the average rmsd values for the ensembles calculated. If the conformational search could not generate a model within 24 hours of computational time, the search was considered unsuccessful. Mainchain-only models were generated using Cα restraint radii of 0.5, 1, 1.5 and 2Å. The Cα restraint threshold denes the radius of the sphere within which the Cα atom of the modelled residue is restrained to lie. The centre of this sphere is given by the native Cα position. All-atom models were generated under

Cα restraint of 1Å and 2Å restraint on the centroid of the sidechain atoms. The van der Waals radii were reduced by 25% to compensate for the fact that only specic sidechain rotamers were allowed. Sidechain centroid restraint places and orients the side chain atoms with respect to the mainchains and aects bulky side chains more than the smaller sidechain groups.

Rapper tk can trace either from N to C terminal (forward) or in the C to N (backward) directions, with and without sidechains, in guided or standard sampling modes. Standard sampling is rapper-like φ, ψ sampling which is unaware of the Cα restraint to be satised. Such sampling can be the bottleneck when restraints are tight or only a small portion of the restraint spheres are reachable geometrically. Hence I have also incorporated guided sampling in which the sampler is aware of the restraint and produces samples within that restraint. As shown in Fig.2.5, the location of Cα i is dened by C i−2 , N i−1 , Cα i−1 and r (distance between Cα i−1 ,Cα i ), α (angle N i−1 -Cα i−1 -Cα i ), θ (torsion angle C i−2 -N i−1 -Cα i−1 -Cα i ). Thus the restraint sphere is sampled spatially by the guided sampler to obtain r, α, θ samples. Corresponding φ, ψ, ω values are found using a precalculated mapping from r, α, θ to allowed

φ, ψ, ω . Since this mapping is one-to-many, a random sample is taken from available φ, ψ, ω values. Such sampling ensures that the restraint sphere is sampled eciently while still using

φ, ψ values from the allowed region of Ramachandran plot.

2.3.1.1 Mainchain modelling In addition to comparing the main chain modelling accuracy between rapper and Rapper tk in standard forward mode, models were built in the backward (C to N) mode in order to check whether the performance varies . Table 2.2 shows the model accuracy under a

43

Figure 2.5: Guided sampling. Location of Cα i can be described by specifying locations of C i−2 , N i−1 , Cα i−1 along with {r, α, θ} or {φ, ψ, ω}. This leads to one-to-many mapping between {r, α, θ} and {φ, ψ, ω}.

ω θ

r

φ α

ψ

spherical positional restraint of radius 1Å on Cα atoms. Similar values of mainchain and Cα rmsd obtained demonstrate that performance of Rapper tk is comparable to that of rapper

and consistent across the whole target set. The low standard deviation values within each ensemble show that all the three approaches produce tight clusters containing models that are all equally acceptable. Larger restraint radii result in looser restraints and give models that deviate further from the native structures. rmsd values in Table 2.3 demonstrate that both rapper and Rapper tk perform equally well under dierent Cα restraint thresholds. For the restraint radius of 0.5Å both rapper and Rapper tk failed to nd complete ensembles for proteins 4enl, 8abp and 8tln. For 8tln, the conformational search repeatedly failed at Leu133. Since the conformational search builds one residue at a time, slight errors introduced earlier can sometimes make it dicult to nd a suitable conformation for a residue causing repeated failures at the same position.This limitation can be circumvented by building the peptide chain in the reverse direction. Using backward building for 8tln, 5 models could be found having an average main chain rmsd of 0.41Å (0.01) and a Cα rmsd of 0.35Å (0.01). Models for proteins 8abp and 4enl were built using the guided sampling mode in Rapper tk.

2.3.1.2 All atom modelling As can be seen from Table 2.4, the model accuracies for rapper and Rapper tk are comparable and do not vary signicantly across the dierent proteins. The average all atom rmsd for rapper and Rapper tk for the entire protein set is 1.07Å and 1.08Å respectively. On using

44

Figure 2.6: Variation in sampling time as a function of restraint strictness. Computational cost scales inversely as a function of Cα restraint radius for rapper (squares), Rapper tk (diamonds) and Rapper tk using backward building (triangles). 5cpa is excluded. 800

Rappertk RAPPER Rappertk Backward

700

Running time [sec/model]

600 500 400 300 200 100 0 0.5

0.75

1 Restraint Radius [Angstrom]

1.5

2

the guided sampling mode within Rapper tk there is a 0.06Å to 0.08Å reduction in the main chain and a 0.07Å to 0.09Å reduction in the all atom rmsd values. The average all atom rmsd over the target set also reduces to 1.0Å. The average χ1 error is comparable for rapper

(42.3o ) , Rapper tk (42.0o ) and guided sampling (41.6o ). The average χ1,2 values for the three approaches are also similar, 42.4o for rapper, 42.4o for Rapper tk and 42.1o for Rapper tk using guided sampling.

2.3.1.3 Computational Cost and Quality Check Model quality was assessed using PROCHECK (Laskowski et al. (1993)). All structures have main chain bond lengths and angles within the limits of the standard deviation of their small molecule values and also have good sidechain stereochemistries. Computational cost scales with the size of the restraint sphere used to generate the models. As can be seen from Fig.2.6 the computational cost for Rapper tk is less than that for rapper for restraint radii 0.5, 1, and 1.5 1.5Å. The average time taken by Rapper tk under a Cα restraint of 2Å is only slightly 45

Figure 2.7: Computational cost for all-atom modelling across target set. The average time required to build a successful model is shown for rapper (diamonds), Rapper tk (squares) and Rapper tk with guided sampling (triangles). 500

RAPPER Rappertk Rappertk Guided 3323 sec

Running time [sec/model]

400

300

200

100

0 lcem

1nif

1php

3app

3pte PDB ID

46

4enl

5cpa

8abp

8tln

Table 2.1: Set of target proteins pdb Ida

1cem 1nif 1php 3app 3pte 4enl 5cpa 8abp 8tln:E

Protein Name Cellulase Nitrite reductase Phosphoglycerate kinase Penicillopepsin Transpeptidase Enolase Hydrolase (c-terminal peptidase) Arabinose binding protein Thermolysin

Resolutionb 1.65 1.6 1.65 1.8 1.6 1.9 1.54 1.49 1.60

Sizec 363 333 394 323 347 436 307 305 316

a pdb code (and chain identier, if necessary) b Resolution of the crystal structure in Å. c Number of amino acids in the protein chain. higher at 37 s/models compared to rapper which takes 32 s/model. There is a visible improvement in speed at the restraint radius of 0.5Å. This demonstrates that Rapper tk is more able to nd a solution within a very tight restraint network with fewer failed attempts. Fig.2.7 shows the computational cost for all-atom modelling of each protein in the target set. The time taken to build a successful all atom model by rapper is similar to that taken by

Rapper tk. For 5cpa, rapper repeatedly failed at TYR:198 which has an unusual ω angle of

154.5o . rapper takes an average of 3323 s to nd a solution whereas Rapper tk is able to build a model in an average time period of 147 s. On using guided sampling the computational cost signicantly decreases. The average time taken reduces to 69 s/model compared to the average time of 165 s/model taken by Rapper tk and 176 s/model by rapper. Also on using guided sampling, the cost is nearly the same across the set, irrespective of the stereochemistry of the individual structures.

47

Table 2.2: Comparison of mainchain rmsd under 1.0Å Cα restraint. rapper

Target 1cem 1nif 1php 3app 3pte 4enl 5cpa 8abp 8tln:E a b

a

Mainchain 0.69 (0.01) 0.72 (0.01) 0.70 (0.01) 0.71 (0.02) 0.71 (0.01) 0.72 (0.02) 0.73 (0.02) 0.71 (0.01) 0.71 (0.02)

b

Cα 0.66 (0.01) 0.67 (0.01) 0.66 (0.01) 0.67 (0.01) 0.67 (0.01) 0.68 (0.01) 0.68 (0.02) 0.67 (0.01) 0.67 (0.01)

Rapper tk Mainchaina 0.69 (0.01) 0.71 (0.01) 0.70 (0.01) 0.72 (0.02) 0.71 (0.01) 0.71 (0.02) 0.73 (0.02) 0.72 (0.02) 0.70 (0.02)

Forward Cα b 0.67 (0.01) 0.67 (0.01) 0.66 (0.01) 0.67 (0.01) 0.67 (0.01) 0.67 (0.02) 0.67(0.01) 0.67(0.02) 0.67(0.02)

Rapper tk Mainchaina 0.68 (0.01) 0.72 (0.02) 0.70 (0.01) 0.70 (0.01) 0.71 (0.01) 0.71 (0.01) 0.72 (0.02) 0.71 (0.01) 0.71 (0.01)

Backward Cα b 0.66 (0.01) 0.67 (0.02) 0.66 (0.01) 0.66 (0.01) 0.67 (0.01) 0.67 (0.01) 0.67 (0.02) 0.67 (0.01) 0.67 (0.01)

Ensemble average main chain rmsd. Ensemble average Cα rmsd.

Table 2.3: Comparison of model accuracy under dierent Cα restraint thresholds

rapper

c

Rapper tk Forwardc Rapper tk Backwardd

b

MC Cα a MCb Cα a MCb Cα a

0.5 0.42 (0.01) 0.35 (0.01) 0.43 (0.01) 0.35 (0.01) 0.42 (0.01) 0.35 (0.01)

Cα restraint threshold in Å 1 1.5 0.71 (0.01) 1.00 (0.02) 0.67(0.01) 0.98(0.02) 0.71 (0.01) 1.00 (0.02) 0.67 (0.01) 0.98 (0.02) 0.71 (0.01) 1.00 (0.02) 0.67 (0.01) 0.98 (0.02)

1.30 1.29 1.30 1.29 1.30 1.29

2 (0.03) (0.02) (0.03) (0.03) (0.03) (0.03)

Values in parentheses indicate standard deviations. a Ensemble average C rmsd[Å] over all successfully modelled proteins. α b Ensemble average main chain rmsd[Å] over all successfully modelled proteins. c No models could be generated for targets 4ENL, 8ABP, 8TLN under 0.5Å C restraint by α rapper and Rapper tk forward building. d No models could be generated for targets 3PTE, 4ENL, 5CPA, 8ABP under 0.5Å restraint by Rapper tk backward building

48

Table 2.4: Comparison of accuracy of all atom modelling pdb

ID 1cem 1nif 1php 3app 3pte 4enl 5cpa 8abp 8tln:E

MC a 0.69 0.71 0.7 0.71 0.71 0.71 0.73 0.71 0.71

rapper

AA b 1.06 1.10 1.07 1.08 1.07 1.06 1.13 1.10 1.07

χ1 c 38.6 40.4 39.4 46.1 44.0 42.2 44.7 41.6 43.5

χ1,2 d 36.5 42.7 41.1 53.1 39.7 39.4 48.3 41.9 39.1

MCa 0.69 0.71 0.70 0.71 0.71 0.71 0.72 0.72 0.70

Rapper tk AAb χ1 c 1.06 38.9 1.09 39.6 1.07 40.1 1.08 46.4 1.07 42.8 1.06 42.4 1.12 43.6 1.10 40.5 1.06 43.4

a

χ1,2 d 37.2 42.4 41.4 53.4 39.8 39.3 48.1 41.7 38.6

Rapper tk Guided MCa AAb χ1 c χ1,2 d 0.61 0.97 38.7 37.0 0.68 1.04 39.8 42.5 0.63 0.99 39.8 41.9 0.66 1.00 44.3 52.1 0.65 0.98 43.1 38.9 0.64 0.98 41.9 39.4 0.65 1.04 42.9 47.6 0.63 1.01 41.2 41.3 0.64 0.98 42.3 38.1

Average (over 50 models) of the main chain rmsd (Å) between a model and the crystal structure. b Average (over 50 models) of the all atom rmsd (Å) between a model and the crystal structure. c Average (over 50 models) of the mean side chain χ1 error (degrees) between a model and the crystal structure. d Average (over 50 models) of the mean side chain χ1,2 error (degrees) between a model and the crystal structure.

49

2.3.2 Illustrations I now describe the use of Rapper tk to carry out some new sampling tasks.

2.3.2.1 Protein-ligand interface sampling in electron density Protein ligand interactions are central to understanding the roles of ligands as well as the mechanisms of enzymes. The approximate location of a ligand is often known but small ligands often have poor electron density. This scenario is suitable for automatically tting various ligand conformations into the density with Rapper tk, thus creating an ensemble for further renement. From a recent paper on automatic modelling of ligands (Wlodek et al. (2006)), I chose a medium resolution (2.6Å) structure (1di9) of p38 kinase in complex with a quinazoline ligand. In order to describe the degrees of freedom in a ligand, a le format was devised. It describes the ligand's bootstrapping (init lines), rotatable bonds (rotbond lines) and internal distance restraints (mindist lines). Builders and restraints are created using the information given in this le. Covalent bond lengths and angles are not altered from the initial coordinates given as input. Depending on ligand proximity, small sections of protein chains are identied and sampled using a loop sampling and closure procedure. Loop closure samples the location of Cα i given the locations of Cα i−1 and Cα i+1 as shown in Fig.2.8. Sidechain centroid restraint and Cα restraint are lenient close to ligand. Electron density restraints are employed using the excellent Clipper libraries (Cowtan (2003)) for crystallographic computations. The deposited pdb structure is used to phase the structure factor amplitudes and to obtain an electron density map. EDrestraint is satised by builder outputs which lie in reasonable density (> 0.25σ ) and have good mean density (> 1σ ). EDrestraints are optional except for the ligand. EDrestraints operate on the output of each builder. This scheme of exible-protein exible-ligand yields an ensemble of protein-ligand interface conformations which are consistent with the expected degrees of freedom of ligand, electron density, hard-sphere clash restraints and covalent geometry of the protein (Fig.2.9). Further renement and ensemble interpretation will be addressed in future work. Apart from crystallographic application, such sampling can be used by small molecule docking programs also to generate trial conformers of the ligand and protein.

50

Figure 2.8: Loop closure procedure in Rapper tk. Cα i lies on a circle centered at the midpoint of line joining Cα i−1 and Cα i+1 . Radius of the circle is determined by length of the line. The circle is in a plane perpendicular to the line. Candidate Cα i positions are sampled on this circle and {r, α, θ} − {φ, ψ, ω} mapping (explained earlier in relation to guided sampling) is used to select a position.

2.3.2.2 Protein-rna interface sampling Although rna conformational preferences are harder to identify due to the much larger conformational space (7 backbone dihedral angles), recent analysis has revealed the rotameric nature of the rna backbone (Murray et al. (2003)). I use these preferences to extend the rna chain as shown in Fig.2.10. Bootstrapping copies the initial few atoms from the given

structure to the region specied by restraints on them. Incremental build of the rna chain is done by RNAsuiteBuilder, which depends on atoms {C5*, C4*, C3*} and builds atoms {O3*, P, O1P, O2P, O5*, C5*, C4*, C3*} along with sugar and base. In this illustration (Fig.2.11), I choose protein chain A and rna chain E from a recently solved protein-rna complex (helicase-core region of Vasa bound to a single stranded rna, Sengoku et al. (2006)). I identify sections of protein chain in close proximity to the rna. These sections are later sampled as loops with loop closure and restrained with Cα and sidechain centroid positional restraints. rna bootstrap builder regards {C5*, C4*, C3*, P, O1P, O2P, O5*} atoms of the rst nucleotide as a rigid body and translates/rotates it so that C5*, C4*, C3* atoms are within 2Å of native positions. During incremental building, the C3* atom is restrained to lie within 2Å of the native C3* atom. As before, the deposited pdb structure is used to phase the deposited structure factor amplitudes and builders are

restrained to build within a mean electron density of 1σ . Generation of multiple conformations of protein-rna interface with Rapper tk can be useful in deriving multiple interpretations permitted by the crystallographic data. Interface 51

Figure 2.9: p38 kinase in complex with a quinazoline ligand (pdb 1di9). Green conformations are generated by Rapper tk. Sticks show the ligand. Electron density not shown for clarity.

52

diversity thus assessed may lead to novel insights into function. This issue will be addressed in detail in a future study.

2.3.2.3 Sampling β sheets In low-resolution crystallographic or EM data, salient features of the structure (β -sheet or

α-helix) are more detectable than the terminal regions or loops, making it desirable to start building a model at such features. α-helices are easier to sample than β -sheets because hydrogen bond restraints in helices are sequential unlike those in sheets. Hence sequential sampling is inecient for the later strands in a sheet. As Rapper tk is not restricted to sequential sampling, a β -hairpin can be built as shown in Fig.2.12, by bootstrapping at the linker of the strands and extending in forward and reverse directions. The building order is zigzag and helps in maintaining strict hydrogen bond geometry (distance O-N within between 1.5Å,3.5Å angle C-O-N > 100o ). I observed that this builder order is more ecient in sampling the hairpin under positional and hydrogen bonding restraints, than the simple sequential order.

Rapper tk extends this scheme of sampling β -sheets to parallel sheets and arbitrarily many strands (see Fig.2.13). If residue positions (...i − 1, i, i + 1...) correspond to positions (...j − 1, j, j + 1...) in parallel β -sheets, hydrogen bonding distance restraints are applied on (N i ,Oj−1 ), (Oi ,N j+1 ), (N i+2 ,Oj+1 ), (Oi+2 ,N j+3 ) etc. In antiparallel sheets, where residue positions (...i − 1, i, i + 1...) correspond to positions (...j + 1, j, j − 1...), distance restraints are applied on (N i ,Oj ), (Oi ,N j ), (N i+2 ,Oj−2 ), (Oi+2 ,N j−2 ) etc. Sheets with multiple strands are tricky due to the variable number of hydrogen bonds between adjacent strands. Additionally, a strand may be linked to other strands in both parallel and antiparallel arrangements, e.g. in a 3-stranded sheet with corresponding residue positions (...i − 1, i, i + 1...), (...j − 1, j, j + 1...) and (...k +1, j, j −1...), residue j is involved in (N j ,Oi−1 ), (Oj ,N i+1 ) while residue j +1 forms hydrogen bonds (N j+1 ,Ok−1 ), (Oj+1 ,N k−1 ); this pattern repeats every alternate residue. This scheme is used in the next example.

2.3.2.4 All-atom model generation from approximate secondary structure information and particle shape Techniques like EM and SAXS are valued for their ability to estimate macromolecular shape and to help in global relative positioning of parts of the particle. Automatic identication of secondary structures and prediction of their topology is possible (Kong et al. (2004a), Kong

et al. (2004b)) by morphological analysis of EM data. Coupled with secondary structure prediction from sequence, this generates approximate positional restraints on Cα atoms in 53

Figure 2.10: Rapper tk procedure for RNA sampling. Bootstrap builder assigns the location of a few initial atoms of the rst nucleotide by rigid-body transformation from given structure satisfying the positional restraints specied on C5*, C4*, C3* atoms. Chain extension is carried out by RNAbuilder according to the backbone dihedrals sampled by RNAsampler (using the rotamericity described in Murray et al. (2003)). RNAbuilder builds the corresponding sugar and base also.

Bootstrapping P

O2P

O4

C5 C6

O5* O4*

O1P C5*

N1 C2 C1* O2

C4* C3* O3*

RNA Builder

C4

C2* O2*

RNA Sampler

54

N3

Figure 2.11: Protein/RNA interface sampling with Rapper tk. Multiple conformations possible for a protein-RNA interface (helicase-core region of Vasa, chains A, E in pdb 2db3) within electron density restraints are shown. Protein chain in the native structure is rendered as green cartoon whereas the native RNA chain is rendered as orange cartoon and blue/green sticks. Five models sampled for the interface are shown as blue ribbons (protein), orange ribbons and blue/green lines (RNA). Electron density not shown for clarity.

55

Figure 2.12: β -hairpin building. Blue is NanchorBuilder's output, red is that of forward PeptideBuilder, brown is that of backward PeptideBuilder and magenta that of PeptideBridgeBuilder. Dotted lines show the distance restraints used for hydrogen bonding in addition to 0.5Å Cα positional restraints.

ASP−47

ILE−46 GLY−48

VAL−45

GLU−49

VAL−44

THR−50 GLN−43 CYS−51 LEU−52

56

Figure 2.13: β sheet building by identifying the ladder and sampling along the steps. Multiple strands and both (anti/parallel) arrangements can be sampled within hydrogen bonding (angle C-O-N and distance O-N) restraints.

57

secondary structures. I demonstrate here that Rapper tk can combine the shape and secondary structure positional restraints to generate atomic models. In order to simulate this scenario, I generated an articially blurred electron density map at 10Å resolution for pdb 121p using EMAN (Ludtke et al. (1999)). 3Å Cα positional restraints are placed on residues in secondary structures. All atoms are restrained to lie within the shape restraint dened by the 1σ contour of the map. There are no positional restraints on sidechains and loops. Hydrogen bonding restraints are used for β sheets as described earlier, and also on α helices. Ten models thus generated are shown in Fig.2.14. Model variations are large in loops but not in secondary structures due to secondary-structure-specic sampling style used by Rapper tk.

2.4 Discussion and Conclusions Rapper tk's design makes it possible to apply discrete restraint-based modelling to a variety of problems robustly and easily because

• Introducing new builders, restraints, samplers and search strategies is easy. • Any level of granularity can be chosen to represent the structure. • Automatic ordering of builders and restraints spares the user from the tedious task, but a preferred order may be imposed if needed. • Any number of coordinates may be known before modelling. They can be used as restraints or to make seed builders or just as steric obstructions. • Ensemble building and average restraints can be introduced easily by adding restraints which check the average value of some property of the conformational pool. The modularity and exibility of Rapper tk makes it an attractive platform for carrying out discrete restraint-based modelling tasks under a variety of restraints, as I have demonstrated here. Rapper tk can also be useful to generate decoy sets useful in developing energy functions for discriminating between non-native and native conformations. Our immediate goals with this toolkit include exploring protein-ligand and protein-rna interface conformations, aiding automation of X-ray renement and developing a protocol for interpreting NMR restraints. To address these tasks more eectively, some more features will likely be needed. For instance, non-binary restraints are not at present implemented. To introduce such analog restraints, the population search strategy will be modied to allow scoring of conformational extensions as well as members of an ensemble of conformations. 58

Figure 2.14: Combining shape and secondary structure skeleton to generate atomic models. The cartoon shows the pdb structure 121p. Shape restraint comes from 1σ contour of a map (not shown) articially generated with EMAN. Positional restraints used are 3Å Cα restraints on secondary structure residues, shown as spheres. Ten Rapper tk models sampled within these restraints are shown as ribbons.

59

I also intend to implement coarse samplers to address sparse restraint scenarios, e.g. by analyzing geometric preferences between adjacent secondary structure elements, a coarsegrained secondary structure incremental sampling can be achieved. Another concern is that although builder order in Rapper tk is exible, still it is a linear order, hence concerted conformational change is not possible. We are working on implementing a strategy inspired by the scwrl algorithm (Canutescu et al. (2003)), which will operate at the level of side-chains as well as fragments and optimize the conformational possibilities independent of builder order. Another strategy under consideration involves simulated annealing and incorporation of conformation-modiers which tweak the structure in a particular way, e.g. local backbone moves, rigid-body fragment movements, sidechain ips and so on. Tweakers will form the move-set for simulated annealing which will be used to obtain a coarse structural framework that will be further explored to get atomic models. In conclusion, I believe that Rapper tk will prove to be a useful platform for conformational sampling and searching for a wide range of applications.

60

Chapter 3 Optimal sidechain packing in proteins and crystallographic renement Background Amino acid sidechains often adopt one of a few distinct, physicochemically favorable

conformational states called rotamers. Rotameric preferences and compact packing are sucient to estimate the conformations of most sidechains, as demonstrated by approaches such as scwrl in homology modelling. But such algorithms have not been applied to protein crystallographic renement yet.

Results A scwrl-like combinatorial optimization algorithm was adapted for assigning sidechain

rotameric states that maximize the electron density map occupation while minimizing steric clashes. Self energy of a rotamer was dened as its electron density map occupation score. Pairwise energy between rotamers of dierent sidechains was dened to be zero or innite depending on van der Waals clashes. The program (opsax) was tested on 5 proteins by introducing error in mainchains and comparing the subsequent cns-only and cns/opsax renements. The latter renement was extended to multiconformer models too. Sequence assignment exercise examined, at various articially lowered resolutions, whether cns/opsax renement can discriminate between correct and incorrect assignments.

Conclusion The composite cns/opsax renement yielded better Rf ree values than cns-only

renement. A further drop in Rf ree was observed with multiconformer renement. On complete mainchains, correct sequence could be discriminated eciently in most cases even for a low observation-to-parameter ratio of 4, but this was not possible for a fragmented mainchain. Thus the value of opsax approach was clearly evident in automating some tasks of protein X-ray renement.

61

3.1 Introduction Amino acid sidechains are vital to folding, function and dynamics of proteins because they are the main ingredients of physicochemical diversity therein. Two salient features of sidechains are their rotamericity and compact packing in globular protein domains. Rotamericity is the tendency to be in one of a few distinct, locally optimal conformations and was hypothesized from small molecule analysis (Sasisekharan and Ponnuswamy (1971)). With increasing size of the pdb, this hypothesis was statistically conrmed by many studies, leading to compilation of rotamer libraries for use in homology modelling (Dunbrack and Cohen (1997), Lovell et al. (2000), Maeyer et al. (1997), Tuery et al. (1997)). Rotamer libraries describe a set of discrete states for sidechains and their propensities, dependent or regardless of the backbone conformation. Packing of sidechains in globular protein domains is complementary, compact and like a jigsaw puzzle (Levitt et al. (1997), Richards (1973)). Taken together, packing and rotamericity can determine the conformations adopted by sidechains. This has been demonstrated by sidechain placement programs (Canutescu et al. (2003), Maeyer et al. (1997), Smith et al. (2007), Xu (2005)) which nd the optimal balance among rotamer propensity score, van der Waals energy and sometimes evolutionary information. These methods are capable of nding the global minimum of a composite function consisting of self-energy of rotameric states (rotameric propensity, conservation scores from homologous templates) and pairwise energies (van der Waals energy between close rotamers). They rely on clever techniques like dead-end elimination, branch-and-bound and graph decomposition to reduce the search space and avoid futile energy calculations. Sidechain optimization programs are valuable in homology modelling because mainchain can be deduced from templates more accurately than sidechains, in addition to the fact that sidechains are densely packed. For the same reasons, they could be useful in protein crystallography too, including the following scenarios. First, renement of a model with approximately known mainchain can be automated with an optimal sidechain assignment procedure. Second, sidechain heterogeneity can be assessed by running the procedure several times and performing a multi-conformer renement. Thirdly, hypotheses about sequence registry of a full or fragmented mainchain can be quickly tested. Perhaps the sidechain placement procedure will be more eective when combined with mainchain sampling, but that is not the scope of the present chapter. Here I assess the value of optimal sidechain assignment for some crystallographic scenarios. I begin by describing the sidechain placement procedure (termed opsax for Optimal Sidechain Assignment for Crystallography) and its implementation within the Rapper tk framework (Chapter 2, Gore et al. (2007)). Then I describe and compare the composite (cns/opsax) renement protocol against the cns-only

62

protocol and show that the former generally outperforms the latter when renement statistics are compared. Multi-conformer renement with 2 and 4 copies is performed to observe a further improvement in renement. Then I ask whether the correct sequence registry can be eciently discriminated with simulated low resolution datasets for complete and fragmented mainchains.

3.2 Sidechain reassignment procedure This procedure is inspired by scwrl (Canutescu et al. (2003)), a popular program for optimal sidechain reassignment. For clarity and completeness, I describe the implementation which is a new module written in the conformation sampling software Rapper tk.

• Rotamers from the Penultimate Rotamer library (Lovell et al. (2000)) are used in this work. For each sidechain to be assigned, steric clashes of all of its rotamers with rest of the system are checked. All clashing rotamers are removed. • Self-energy of a rotamer is the average value of electron density in the 2Fo − Fc map over grid points that lie within 1Å of atoms in the rotamer. Pairwise energy of rotamers of two dierent sidechains is dened as ∞ or 0 depending on whether they sterically clash or not. van der Waal radii of rotamer atoms are reduced by 50% while calculating the steric clashes.

• Rotamers are further removed using dead-end elimination (Goldstein DEE, Goldstein (1994)). For a sidechain, a set of neighbouring sidechains which may sterically clash with it are identied. Then two rotamers a, b of this sidechain are compared by checking whether:

(a, i) − (b, i) +

N brs X

minc {(a, i, c, j) − (b, i, c, j)} > 0

(3.1)

j6=i

where (a, i) is the self-energy of the rotamer a of residue i and (a, i, c, j) is the pairwise energy of rotamer a of residue i and rotamer c of residue j . If this relation holds, then rotamer a is said to be dominated by rotamer b and is removed from further calculations.

• Sidechain pairs with possibility of steric clash are connected to form a sidechain graph (Fig.3.1). A connected graph is biconnected when removal of any one vertex does not disconnect the graph. Articulation vertices are those that belong to more than 63

one biconnected subgraph in a graph. Such biconnected subgraphs and articulation vertices are identied by using the Boost Graph Library (Siek et al. (2002)). The biconnected subgraphs and articulation points can be viewed as vertices and edges in a complementary graph, which is then topologically sorted to order the biconnected subgraphs. These subgraphs are then solved in that order.

• A biconnected subgraph has an articulation vertex assigned to it. Solving a subgraph means nding the best rotamer assignment for the subgraph for each rotamer of the articulation vertex. A biconnected subgraph is solved with a branch-and-bound technique. In principle, the number of possible rotamer assignments in a subgraph grow exponentially as the number of vertices. This can be viewed as a tree (Fig.3.2), but most of the leaves of the tree are not feasible due to steric clashes. In order to avoid visiting all possible rotamer assignments for the components, at each stage of the search, the best possible energy from unassigned nodes in the component is calculated in the context of currently assigned nodes. This is the bounding energy :

Ebound =

X

mina (i, a) +

i>j

XX

mina,b (i, a, k, b)

(3.2)

i>j k6j

If Ebound + Ej > Ebest , then the current subtree of search is abandoned as it is not possible to obtain a solution better than already known from that subtree. When a subgraph is solved, the internal energy of each rotamer of its articulation vertex is replaced by the minimum energy of the subgraph with articulation vertex in that rotameric state and the corresponding subgraph rotamer assignment remembered.

• After all subgraphs have been solved like this, the biconnected subgraphs are visited in the reverse order. For each articulation vertex assigned to the subgraph, the least energy rotamer is now known. This is used to assign rotamers to all vertices - this is the global minimum assignment for this energy function. • Unconnected sidechains are assigned the least energy rotamer.

3.3 Single- and multi-conformer renement 3.3.1 Renement protocols The goal of this exercise is to compare the the cns-only and cns/opsax renement procedures to assess the value of the sidechain placement procedure (Fig.3.3). The deposited 64

Figure 3.1: Identication and ordering of biconnected components in a sidechains graph. The graph (left) is prepared by checking which sidechains have clashing rotamers. This graph is decomposed into biconnected subgraphs and the articulation vertices are identied. The biconnected subgraphs can be represented as another graph (right): two subgraphs are connected if they share an articulation vertex. The biconnected subgraphs are ordered using topological sort and solved in that order. The nal rotamer assignment is found by traversing the subgraphs in the reverse order. sc.8

5

sc.9

sc.7

4 sc.6

2

sc.5

3

sc.3

sc.4

1

sc.2

sc.1

65

Figure 3.2: Branch-and-bound optimization implemented in OPSAX. For each rotameric state of the articulation vertex of a biconnected subgraph with N+1 nodes, there are N assignments to be decided in order to solve the component. This can be viewed as N-level tree, each level corresponding to an assignment. Search in this tree can be pruned to make it more ecient. The best assignment in the tree seen so far is remembered (Ebest ). At the current stage of search, the best possible energy from the subtree below is computed and added to present energy. If this sum is worse than Ebest then the subtree is not searched further. Level−1

Level−2

E_best

E_j

Level−j

Level−j+1 E_bound

E_bound E_bound

Level−N

66

Table 3.1: 5-protein set used in this work. #AA stands for number of amino acid residues. Number and percentage of loop residues are also listed. PDB id 1aac 1g35 8cho 9ilb 1kx8

Name amicyanin HIV protease δ5 − 3-ketosteroid isomerase interleukin-β chemosensory protein A6

#AA 105 198 174 153 109

#loop AA(%) 64 (61) 95 (48) 40 (23) 82 (54) 20 (18)

Resolution 1.31Å 1.80Å 1.47Å 2.28Å 2.80Å

#reections 21199 19059 7439 9535 5096

structure is perturbed by carrying out an all-atom Cα -trace on the structure with given Cα restraint but without any sidechain restraints. B-factors of all protein atoms are reset to 30. In cns-only renement, the trace is rened with 10 rounds of cns, each round consisting of 3 cycles of mdsa starting from 5000K. At the start of cns/opsax composite renement, Cβ and sidechain atoms are removed and deposited structure factors are phased with the mainchain. Sidechains are reassigned within this map after calculating the Cβ atom positions. This is followed by similar cns renement as cns-only protocol. In subsequent cns/opsax rounds, all sidechains are rst scored by the correlation coecient between Fc and 2Fo − Fc maps on grid points around sidechain atoms. Whenever a rotamer is reassigned, its B -factors are set to 30, else they are copied from the previous round. Only the sidechains with < 0.9 correlation are reassigned. In both protocols, structures with the best Rf ree are considered for further analysis. The cns/opsax protocol is extended to multiconformer models with 2 and 4 copies. In the 2 copies case, 2 Cα -traces are generated and stripped of their sidechains and Cβ atoms. They are assigned partial occupancy of 0.5 and deposited structure factors are phased with this model. Cβ and sidechains are built for both traces with this map. All-atom models with partial occupancy are rened with cns. In subsequent rounds, in a way similar to single-conformer models, each copy is assessed to detect ill-tting sidechains and rebuilt independently of each other, but subjected to cns renement together as a multiconformer model. This is depicted in Fig.3.4. Similar protocol is replicated for 4-copies case also. Single and multi-conformer renements are carried out over a range of Cα -trace radii. Renement with each radius is repeated 5 times. The dataset used in this chapter is shown in Table 3.1.

67

Figure 3.3: Single conformer renement owchart. The dierence between CNS-only (left) and CNS/OPSAX (right) protocols is that the ill-tting sidechains are identied and reassigned in the latter.

68

Figure 3.4: Multiconformer renement. 2Fo − Fc maps are calculated using the composite model and equal partial occupancies. Composite model at ith stage is split into single conformers. Ill-tting sidechains in single conformers are identied and reassigned using the composite map. Rebuilt conformers are combined into a multiconformer model and rened with CNS in 3 cycles of MDSA starting at 5000K.

69

3.3.2 Single-conformer renement Fig.3.5 shows the mean Rf ree plotted against the trace radius for all proteins. For single conformer case, I observe that Rf ree values are consistently better in cns/opsax renement than cns-only renement over all trace radii, with the improvements more noticeable in 1AAC,

1G35 and 8CHO. Fig.3.6 shows the comparison of rotameric states between the deposited structure and single conformer models. χ1 and χ1,2 values of the structure and a model are compared by dening χ1 to be similar for two corresponding residues if they are within 40o of one another, and by dening χ1,2 to be similar if both χ1 and χ2 for two corresponding residues are within 40o of one another. Clearly, the rotamer accuracies depend on the quality of mainchain. For 1AAC, 1KX8 and 1G35, cns/opsax renement shows better rotamer statistics than cns-only renement, but this trend is reversed in lower resolution structures 9ILB and 8CHO,

suggesting that there is a greater sidechain variability aorded by lower resolution datasets. Fig.3.7 shows the reduction in rotamer accuracies as resolution drops. The HIV protease structure (1G35) does not follow this trend, but it can be explained by existence of a large ligand (inhibitor AHA024). In the cns/opsax renement protocol, ligand is copied from the deposited structure in the beginning and then copied from last to present iteration. This indirectly informs cns about incorrectly placed rotamers. Except this outlier, lowering of mean rotamer accuracy and increase in its standard deviation is a clear tendency and indicates that lower resolution data accommodates greater variability. Improvement in cns renement due to opsax procedure is primarily due to the fact that cns alone rarely relocates bulky sidechains into better density. Due to the gradient-driven

nature of cns, relocation into better density is likely only if there is a continuous favorable gradient from the present to better, but this is rarely the case due to intermediate regions of low density. opsax relocates such sidechains whenever permitted by rotamericity and excluded volume restraints. An ill-tting sidechain will be moved out of a patch of density if there is another sidechain which ts better. This process enriches the model with more wellplaced rotamers and leads to better renement. But the best Rf ree statistics do not indicate native-like renement, but a dependence on quality of mainchain provided at the start of renement. This is due to a major limitation of opsax that it does not modify incorrectly placed parts of mainchain. Due to this, it is not often that renement statistics become comparable to that of the native. But in the rare instances when they do, the corresponding models can point towards the structural heterogeneity. For instance, in the 1KX8 case (see Fig.3.8), I observed that 3 of the 15 models generated had a comparable Rf ree (0.287, 0.287, 0.284) to the native (0.283), yet the dierences between the deposited structure and models were dierent for each model. This example suggests that opsax can be a quick way to 70

Figure 3.5: Improvement in Rf ree with sidechain placement protocol as compared with CNSonly protocol. Each CNS run consists of 2 cycles of MDSA starting at 5000K with isotropic B factor renement and maximum likelihood target. Order of plots is 9ILB, 1AAC, 1KX8, 1G35, 8CHO. 0.4

0.4

CNS only CNS+RTK 2-conformer 4-conformer

0.35

Rfree

Rfree

0.35

CNS only CNS+RTK 2-conformer 4-conformer

0.3

0.25

0.3

0.25

0.2

0.2 0

0.5

1

1.5

2

0

0.5

CA error

0.4

0.4

CNS only CNS+RTK 2-conformer 4-conformer

1.5

2

1.5

2

CNS only CNS+RTK 2-conformer 4-conformer

0.35

Rfree

0.35

0.3

0.25

0.3

0.25

0.2

0.2 0

0.5

1

1.5

2

0

0.5

CA error

1 CA error

0.4

CNS only CNS+RTK 2-conformer 4-conformer

0.35

Rfree

Rfree

1 CA error

0.3

0.25

0.2 0

0.5

1 CA error

71

1.5

2

Figure 3.6: Sidechain accuracy statistics for models obtained by CNS-only and CNS/OPSAX protocols. χ1 accuracy of a model with respect to the deposited structure is dened as the percentage of model sidechains that have a χ1 within 40o of the corresponding deposited sidechain. χ1,2 accuracy is dened similarly. The accuracy plots are ordered as 9ILB, 1AAC, 1KX8, 1G35, 8CHO. 100

% residues with similar rotamer

% residues with similar rotamer

100

80

60

40

20

0

CNS only Chi-1 CNS+RTK Chi-1 CNS only Chi-1,2 CNS+RTK Chi-1,2 0

0.5

80

60

40

20

0 1

1.5

2

CNS only Chi-1 CNS+RTK Chi-1 CNS only Chi-1,2 CNS+RTK Chi-1,2 0

0.5

CA error

% residues with similar rotamer

80

60

40

0

1.5

2

1.5

2

100

CNS only Chi-1 CNS+RTK Chi-1 CNS only Chi-1,2 CNS+RTK Chi-1,2 0

0.5

80

60

40

20

0 1

1.5

2

CNS only Chi-1 CNS+RTK Chi-1 CNS only Chi-1,2 CNS+RTK Chi-1,2 0

0.5

CA error

1 CA error

100

% residues with similar rotamer

% residues with similar rotamer

100

20

1 CA error

80

60

40

20

0

CNS only Chi-1 CNS+RTK Chi-1 CNS only Chi-1,2 CNS+RTK Chi-1,2 0

0.5

1 CA error

72

1.5

2

Figure 3.7: Accuracies of χ1 , χ1,2 as a function of resolution. The accuracy generally drops with resolution and sizes of errorbars increase, indicating greater variability accommodated by low resolution data. Note that the increasing order of resolution is 1AAC, 8CHO, 1G35, 9ILB, 1KX8. The bump for 1G35 is due to a large ligand.

% residues with similar rotamer

100

90

80

70

60 Chi-1, without CA error Chi-1, 0.5 A CA error Chi-1, 1.0 A CA error Chi-1, 2.0 A CA error

50

40 1

1.5

2

2.5

3

Resolution (Angstrom)

% residues with similar rotamer

100

Chi-1,2, without CA error Chi-1,2, 0.5 A CA error Chi-1,2, 1.0 A CA error Chi-1,2, 2.0 A CA error

90

80

70

60

50

40 1

1.5

2 Resolution (Angstrom)

73

2.5

3

identify the regions of structural heterogeneity.

3.3.3 Multiconformer renement Previous work (e.g. Wilson and Brunger (2000)) has identied two ways to describe structural heterogeneity better than a single-conformer isotropic B-factor model: single-conformer anisotropic B-factor model or multiconformer isotropic B-factor model with partial occupancies on each conformer. Previous section suggested that a collection of single-conformer isotropic B-factor models generated with cns/opsax protocol can give clues about structural heterogeneity. This encouraged me to extend the protocol to multiconformer renement for deriving an ensemble. Multiconformer renement is generally expected to yield better renement statistics than its single-conformer counterpart because the renement process has more parameters to t to the experimental data. Fig.3.5 shows that this is true in all 5 cases for the 2-copy renement. But contrary to expectations, 4-copy renement is not consistently better than 2-copy renement, but this can be explained as follows: (a) all conformer have dierent and erroneous mainchains and (b) due to the nature of derivative calculation procedure, errors in a conformation do not get corrected as vigorously as in single-conformer case and (c) overtting may occur due to too many parameters and (d) the relationship between multiple conformers and their combined density is not as clear as a single conformer and its density. Inspite of these diculties, sometimes the renement yields Rf ree values slightly lower than the native model and the resulting ensemble may be considered as good an interpretation of the diraction data as the deposited model. I nd four such cases, 2 for 1KX8 and 1AAC each - the two cases for both proteins consist of a 2- and a 4-conformer renement (Fig.3.9). Both ensembles show variability in the same regions, lending credibility to the interpretations. The 4-conformer ensemble shows more variation than the 2-conformer ensemble. An interesting case of concerted conformational change is observed in the 4-conformer 1KX8 case (Fig.3.10). Residues Tyr-98, Trp-81, Leu-73 and His-72 are systematically dierent in one of the copies while the other copies are very similar to each other. It is unclear whether this is an artefact of the cns/opsax protocol or a genuine concerted movement. In any case, the observed ensemble renements suggest that the protocol developed here may be of value to derive multiconformer interpretations.

74

Figure 3.8: Structural heterogeneity identied for 1KX8. Upper panel shows the residuewise all-atom rmsd in the 3 models with Rf ree comparable to the deposited. Lower panel shows the deposited model as green sticks and the 3 models as lines. Heterogeneity occurs at dierent residues in dierent models.

All-atom RMSD

3 2.5 2 1.5 1 0.5 0 10

20

30

40

50

60 70 Residue Number

75

80

90

100

110

Figure 3.9: Multiconformer interpretations resulting from CNS/OPSAX protocol. 2conformer (left) and 4-conformer (right) ensemble interpretations of 1AAC (top) and 1KX8 (bottom), having a lower Rf ree than the deposited structure are shown.

76

Figure 3.10: Conformational variability in 1KX8 multiconformer rened model with 4 copies. Rf ree values are comparable for the multiconformer and deposited. Tyr-98, Trp-81, Leu-73 and His-72 are signicantly dierent in one of the copies (sticks) than other 3 (lines) and the deposited model (red lines). This can be interpreted as concerted motion occurring rarely.

HIS−72

LEU−73 TRP−81

TYR−98

77

3.4 Sequence assignment Typical protein crystallographic renement begins with approximate identication of complete or fragmented mainchain, followed by sequence identication and detailed all-atom iterative renement. At medium and low resolution, sequence assignment on approximate mainchain becomes dicult because electron density contours may not contain much shape information and sidechain densities may be absent or intermerged. Pattern matching techniques, alignment with homologous structures and manual guesswork all play a part in addressing this problem. In theory, all possible assignments can be made and renement carried out to nd the correct one, but it is easy to see that even ve mainchain fragments create a combinatorial explosion beyond addressing with brute force. Here I investigate whether an ecient testing of a sequence assignment is possible at low resolution when mainchain is known only approximately. The same 5 proteins as before are used but with articially lowered resolution such that observation to parameter ratio drops till 3. On a 2Å Cα -traced mainchain, 21 dierent sequences are assigned and rened. The 21 sequences consist of the correct sequence and the correct one oset by upto 10 residues in either direction. The terminal region left vacant by osetting is assigned a random sequence. cns renement carried out here is shorter than the previous protocol - a round consists of 2 cycles of mdsa starting at 3000K. After initial sequence assignment, 5 rounds of such cns/opsax renement are carried out. A sequence discrimination exercise consists of 21 sequence trials and it is repeated 25 times to get a statistical estimate of success rate in identication of correct sequence. An exercise is successful if Rf ree for correct sequence is the least compared to the other 20 incorrect sequences. Fig.3.11 shows the success rate (alternatively, discriminating power) in identication of correct sequence as a function of observation to parameter ratio. For even a small observation to parameter ratio of 4, the discriminating power is more than 80% for 4 of the 5 proteins. In other words, in 4 of 5 attempts, the correct sequence renes to a less Rf ree than other sequences. Thus cns/opsax protocol is highly eective is identifying correct sequence. As expected, the discrimination power increases with increase in observations to parameter ratio. The time required to rene a combination of sequence and mainchain for observations to parameter ratio of 4 is around 20 minutes, i.e. around 7 hours for 21 sequence/mainchain combinations, suggesting that the protocol is ecient also. Investigation into the 1AAC case where the protocol does not work well revealed that too many observations were removed in articial lowering of resolution, and the protocol worked well when all data was used. I attempted to extend this exercise further to fragmented mainchains. I assumed that

78

secondary structure elements were approximately known and repeated the same procedure as before to estimate the discriminatory power. Loop residues were not included in the model during any round of renement. Results suggest that this case is more challenging than the complete mainchain (Fig.3.12). For low resolution scenario, the success rate is better than

50% only for 1KX8. Even when all reections are used, only 1KX8 and 8CHO have satisfactory success rates. This observation is consistent with the secondary structure content - both 1KX8 and 8CHO have nearly 80% of residues in secondary structure compared to nearly 50% for others (Table 3.1). This suggests that in the fragmented mainchain case, packing restraint is lost due to excluded sidechains and mainchains of loop residues, and packing restraint is important to drive renenement to a non-random Rf ree . Perhaps rotameric propensities may be helpful here to compensate for loss of packing and this needs to be further investigated.

3.5 Conclusions One of the main motivations for this work was to experiment with orderless sampling. In rapper's gabb algorithm, there is a xed N to C order on sampling. In Rapper tk, this order

can be suitably changed but not eliminated, indeed the existence of sampling order is essential to keep the sampling tractable. But some scenarios where sets of conformational samples of residues are independent of one another allow orderless sampling. Sidechain sampling is a classic example of this. The limitations of this approach are computational in nature, for instance, I noted it takes a long time to solve large biconnected components, similar to the observations of Canutescu et al. (2003). Availability of an orderless sampling algorithm makes

Rapper tk a more useful software package and complements the gabb algorithm, extending its applicability. opsax can be extended beyond amino acid sidechains, to entities like exible ligands or mainchain fragments adopting limited set of conformations. Work in this chapter tested whether orderless sampling as implemented in opsax is useful for protein crystallographic renement. I showed that (a) sidechain optimization can be used in conjunction with cns to carry out automated renement starting from approximate mainchains (b) multiconformer renement improves renement statistics and may help in speculating concerted conformational changes (c) ecient discrimination of correct sequence assignment is possible at low resolution with approximate mainchain. The cases of renement where Rf ree of a single or multiconformer model is comparable to the Rf ree of the deposited structure are useful to assess the structural heterogeneity inherent in the structure of aorded by diraction data. Single conformer renements lead to collections and multiconformer renements lead to ensembles. A simple application of cns/opsax protocol could obtain both collections and ensembles, but only rarely of a qual-

79

#successful attempts (out of 25)

Figure 3.11: Sequence assignment statistics for complete approximate mainchain. Upper panel shows the success rates of registry recognition and the lower panel shows the average time required for rening a mainchain for a registry, for various observation-to-parameter ratios. All mainchains were generated at 2A CA restraints. For a mainchain, 20 incorrect registries are tried in addition to the correct one. Each CNS run consists of 2 cycles of MDSA starting at 3000K with isotropic B factor renement with standard residual target.

25

20

15

10 9ILB 1KX8 1AAC 1G35 8CHO

5

0 3

4

5

6

7

#Observations / #Parameters ratio

Avg time (minutes)

25

20

15

10 9ILB 1KX8 1AAC 1G35 8CHO

5

0 3

4

5

6

#Observations / #Parameters ratio

80

7

Figure 3.12: Success rates of registry recognition in the fragmented mainchain case. Only the residues in secondary structure elements are used in the renement. Discrimination of correct sequence depends largely upon the fraction of secondary structure residues.

#successful attempts (out of 25)

25

9ILB 1KX8 1AAC 20 1G35 8CHO

15

10

5

0 3

5

7

all

#Observations / #Parameters ratio

ity better than the deposited structure. Further work will be required to devise new methods to do this routinely. An interesting avenue is to detect rotameric heterogeneity and build multiple sidechains instead of one. Ecient discrimination of correct sequence given an approximate mainchain and low resolution data is a promising result. But I also realized that the same procedure was not so promising for fragmented mainchains due to loss of packing. Further work will have to nd ways to simulate the packing eect that missing loops would provide, e.g. building missing loops rst before sequence assignment, use of rotameric propensities in the energy function, more realistic treatment of van der Waals interactions etc. From the computational perspective, opsax algorithm can be improved by nding out articulation edges and components, especially when the biconnected subgraphs are large. Large subgraphs can be divided into more manageable subgraphs this way and many interesting scenarios be addressed which would otherwise be computationally prohibitive, e.g. taking into account long range correlations between sidechain rotamers like NOEs and electrostatic energy. To conclude, this work demonstrates that optimal sidechain assignment is useful for crystallographic renement and points at further avenues to adapt it better to multiconformer and low-resolution applications.

81

Chapter 4 Modelling protein loops and their heterogeneity Background All-atom crystallographic renement of proteins is a laborious manually driven pro-

cedure, as a result of which, alternative and multiconformer interpretations are not routinely investigated.

Results I describe ecient loop sampling procedures in Rapper tk and demonstrate that single

loops in proteins can be automatically and accurately modelled with few positional restraints. Loops constructed with a composite cns/Rapper tk protocol consistently have better Rf ree than those with cns alone. This approach is extended to a more realistic scenario where there are often large positional uncertainties in loops along with small imperfections in the secondary structural framework. Both ensemble and collection methods are used to estimate the structural heterogeneity of loop regions.

Conclusion Apart from benchmarking Rapper tk for the all-atom protein renement task, this work

also demonstrates its utility in both aspects of loop modelling - building a single conformer and estimating structural heterogeneity the loops can exhibit.

82

4.1 Introduction X-ray crystallography has been the most popular protein structure determination technique of both pre- and post-genomic eras. The challenges of macromolecular crystallography are manifold - after the dicult steps of expression, purication, crystallization and data collection, there remains the nal and important task of data interpretation in order to build a model which explains the observed diractions. Structural interpretation requires overcoming the phase problem and often starts with partial and incorrect phases. Typically, semi-automatic iterative renement is carried out, gradually improving the model's quality as indicated by the R and Rf ree factors as well as decrease in covalent geometry and excluded volume violations. Although excellent softwares like CCP4 (CCP4 (1994)), Phenix (Adams

et al. (2002)) and

cns

(Brunger et al. (1998)) make this task possible, the structure rene-

ment procedure remains manually-driven hence laborious and subjective. Due to this, the heterogeneity in structural interpretation of diraction data is often ignored in favour of a single-conformer isotropic B-factor model. Protein structure is important for its function. But very stable, rigid proteins cannot exhibit enzymatic activity. This suggests that proteins have to be stable enough to retain their fold yet dynamic enough to be functional. Both experimental and computational studies indicate that single-conformer interpretation of crystallographic data is not adequate to capture the native state dynamics which is largely conserved even in a crystal owing to its high solvent content (Petsko (1996), Jensen (1997)). Reporting a multiconformer interpretation of data will make use of the structure less misleading, especially in the analyses that depend on geometry such as shapes of binding sites, orientations of sidechains, detection of non-covalent interactions and so on. While multiple interpretations are necessary, they should be free from any bias such as that introduced when dierent crystallographers solve the same diraction data. Multiconformer interpretation will be greatly facilitated by automated methods. Thus multiple persuasive justications emerge for automating the protein crystallographic renement task: (a) capturing the dynamics of protein in the crystalline state (b) removing subjective bias from the renement process and (c) reducing the need for precious human resource. But this goal is hard to achieve in practice. The under-determined nature of the problem (number of independent observation < number of parameters) prevents a straightforward solution by minimization. Even when sucient restraints exist, minimization methods like conjugate gradient, steepest descent etc. suer from the problem of local minima. Hence use of well-known features of proteins is unavoidable. Automatic pattern recognition in electron density is very successful in presence of high resolution data and good phases because it looks for such features (Perrakis et al. (1997)). But at medium resolution or given poor

83

phases, this strategy can get misled. Our recent eorts with automated crystallographic renement started with rapper, which is a conformation sampling program for proteins and uses a genetic algorithm cum branchand-bound (gabb) algorithm. DePristo et al. (2004) showed that multiple interpretations similar to the deposited structure are possible given the deposited data, and the divergence in interpretation is correlated to resolution. With

rapper,

it was demonstrated (DePristo

et al. (2005)) that when a protein structure is approximately known, it can be rened to native-like quality, unlike mdsa in cns which may get stuck in local minima. Fundamental features of rapper responsible for avoidance of local minima traps were (a) use of ne-grained, propensity-weighted φ − ψ maps for backbone sampling (b) use of backbone-dependent rotameric libraries (c) use of ideal Engh and Huber covalent geometry (d) mild use of electron density and positional restraints to guide the sampling process. Later Furnham et al. (2006b) demonstrated that a low-resolution dataset can be rescued and interpreted semiautomatically to obtain structure of a system with great biological signicance. DePristo et al. (2005) observed that automatic renement becomes less satisfactory as positional restraints become weaker: the structures could not be rened if the initial Cα perturbation was of order of 3Å or more. This is not unexpected because larger positional restraints dilute the information and would make the search harder. But often a practical problem encountered in crystallography is that of missing loops, i.e. knowing loop regions with far less Cα positional certainty than the regions with regular secondary structure. By denition, loops exhibit rich variability in backbone torsion angles. They are thought to be more dynamic than the protein secondary structural framework and also functionally more interesting. Thus it is important to use the available restraints as eciently as possible to build loops despite weaker electron density and greater positional uncertainty while tolerating small positional errors in the framework. After determining a single-conformer loop structure, the second important challenge is to estimate the structural variability of the loops. It is easy to see by generating articial data that existence of structural heterogeneity for a loop results in confusing electron density. In general, partial occupancies result in weaker density than full occupancy. Sidechains of the same residue may occupy dierent density contours. Overlaps in conformations may lead to signicant loss of shape information. These challenges can be expected to make the task only harder for minimization-based programs when rening a multiconformer structure. Following the reformulation of

rapper

as a versatile modular software called Rapper tk

(Chapter 2), it was essential to benchmark its performance for all-atom protein crystallographic renement. Hence the rst result in this chapter is the all-atom knowledge-based crystallographic renement given the positional restraints for the entire protein, establishing 84

that a similar result as

rapper

(DePristo et al. (2005)) can be achieved. I then demonstrate

that single loops in proteins can be reconstructed to a high quality with Rapper tk using little positional information. This case is then extended to include all loop regions and a small error in the framework to show that the composite

cns/Rapper tk

renement approach

is suitable in a realistic scenario. Finally, I ask whether single-loop heterogeneity can be modelled with collections of independently generated models or ensembles of conformers.

4.2 Methods 4.2.1 Overview of iterative renement Each step in renement procedure consists of: (a) identication of residues which do not t well into density, (b) nding contiguous bands of such residues, (c) rebuilding the bands with knowledge-based conformational sampling within restraints, (d) optimal sidechain placement of rebuilt sidechains and (e) rening the resulting model with

cns.

The t of a set of atoms to electron density is calculated as the correlation coecient between the σA -weighted omit map and Fc map for a region around 1Å of the atoms. The maps used are both generated by

cns

renement script, hence they are described on the

same grid. Following Kleywegt and Jones (1996a) and DePristo et al. (2005), the correlation coecient between the maps is calculated on the grid neighbourhood around atoms of interest:

P CorrCoef = pP

σomit σc P 2 2 σomit σc

(4.1)

If the correlation is below 0.9, the atoms are agged for rebuilding. Correlation is calculated on all atoms in residues and then on mainchains only, sidechains only and peptide atoms only. Ill-tting sidechains are marked for sidechain reassignment whereas residues with ill-tting peptide or mainchain or all-atoms are marked for all-atom reconstruction. Once the residues are agged for all-atom rebuilding, contiguous bands are identied and marked either as N-terminal, C-terminal or intermediate. Bands are then sampled in random order using the PopulationSearch algorithm. Each band is attempted 5 times and left as it was if it cannot be sampled within given restraints. Previously sampled bands are considered while sampling later bands. N and C terminal bands are built using forward and reverse techniques and weighted sampling of φ − ψ propensities. Building of intermediate regions is described in a later section. Once all bands are sampled, all the resampled sidechains are reassigned using the optimal sidechain placement procedure described in the previous chapter.

85

4.2.2 Electron Density Ranker Generally a single conformer model would rene better if its occupation of the 2Fo − Fc map is better - so a model within 1σ contour is more reliable than 0.5σ . Thus, on output of each builder, a binary electron density restraint can be applied with a σ -level cuto. But the quality of map is not uniform over all residues and hence such binary restraint is useful only for ensuring that model remains within positive electron density. Hence, in addition to that restraint, an analog ranker is used for electron density that ranks the possibilities and chooses the better ones. In the population search algorithm, the ranker asks more children to be generated at each conformation extension step than the population size (typically 5 times more), ranks them and chooses top-ranking ones to ll the conformation pool. The ratio of number of children generated to the population size is termed as the enrichment ratio. The electron density ranking scheme calculates score of a set of atoms by summing up the σ values in a 1Å region around their coordinates. The eective σ value is calculated by penalising the negative σ and attening the peaks by upper cuto, the latter for better recognition of shape of density rather than spikes, say due to waters or ions. In addition to ltering children at each step, the ranker also chooses the best member from the conformational pool generated, which is returned as the sampled model for the band.

4.2.3 Symmetry-related clash cheking As described in Chapter 2, Rapper tk uses geometric caching implemented as Clashchecking grid for eciently deciding whether atomic van der Waals spheres are overlapping. This excluded-volume restraint rules out many unproductive sampling trajectories. When loop positions are largely uncertain, the existence of symmetry-related images of the molecule around it acts as an excluded volume restraint to loop sampling. Rapper tk uses the Clipper (Cowtan (2003)) libraries for crystallographic computing for symmetry-related calculation. Clashchecking grid uses Clipper's Spacegroup class and symmetry operators therein to calculate the images of atoms to be added into the grid. Images within 20Å of the bounding box of given protein coordinates are considered. First the grid looks for any clashes between sampled coordinates and their images. Then it is veried that they do not clash against the rest of the coordinates or their images. In case of no clashes, the new coordinates and their images are added to the grid. Removing coordinates from the grid removes their images too.

4.2.4 Loop closure The typical incremental sampling step in Rapper tk builds Cαi and (i − 1)th sidechain in the forward mode or Cαi and (i + 1)th sidechain in the reverse mode. In this context, loop closure 86

can be formulated as nding the locations of mainchain atoms {C i−1 , Oi−1 , N i , Cαi , C i , Oi ,

N i+1 } and sidechains of (i − 1)th ,ith and (i + 1)th residues. Seamless loop closure of this kind is challenging because many conditions need to be met: (a) the covalent angles and lengths should be correct (b) φ, ψ states of 3 residues should be in the allowed regions (c) two ω angles should be adopt cis or trans conformation, but not be restricted to one or the other (d) 3 sidechains should be rotameric and (e) van der Waals restraints should be obeyed. A sampling procedure was devised for meeting conditions (a), (b), (c) and (d), while (e) is met using clash-checking restraints. The sampling procedure is similar to the one described in Chapter 2, but modied to meet the cis ω state too. First, the two ω angles are sampled, leading to the center, plane and radius of the circle on which the middle Cα is sampled. The circle is uniformly sampled. For each sample, the {r, α, θ, } − {φ, ψ, ω} mapping is used to build the mainchain atoms. Then the three sidechains are sampled from a rotamer library. Sampling is continued until a conformation satisfying all restraints is found. The problem with this sampling is that the restraint density abruptly increases at loop closure because it is not clear how to back-propagate the geometric requirements (a) and (c). Due to this, the sampling procedure fails often and takes a long time to nd a valid sample. Often an incorrect conformation is built in case of imperfect density because sampling of φ, ψ, ω is not propensity-weighted. Hence after signicant experimentation with this approach, it was abandoned in favour of a simpler approximate approach. In the simpler approach, the loop closure is formulated as nding the coordinates of mainchain atoms {C i , Oi , N i+1 } and sidechains of residues i and i + 1. A φ sampler is used to build the C i atom which is required to lie between 0.5 to 2Å from the N i+1 atom. Covalent angles N i − Cαi − C i and C i − N i+1 − Cαi+1 are restrained to lie between 90o and 150o . ω dihedral angle Cαi − C i − N i+1 − Cαi+1 is allowed a maximum deviation of 30o from cis or trans conformation. Two sidechains are sampled for each C i sampled. It is observed that it is more ecient to close a loop with this method than the previous.

4.2.5 Both-sided sampling Bands to be rebuilt can be of three types: the N terminal band, the C terminal or intermediate. For the C and N terminal bands, forward and reverse sampling are used respectively. For the intermediate regions, the most ecient way is to use a both-sided sampling approach as opposed to only forward sampling. As explained previously (Chapter 2, Fig.12, in the context of β -sheet sampling), in both-sided sampling, residues are sampled in the order

i, k, i+1, k −1, i+2, k −2, .... In case of forward or reverse sampling, only a weak loop-closure distance restraint informs the sampling process of the other end of the loop, but with bothsided sampling, information at both N and C termini is actively used. A distance restraint is 87

used between Cα atoms at the same sequence distance from both termini, so that the chance of loop closure remains high despite both sided sampling. Initial experiments with crystallographic loop building clearly showed that renement was better with both-sided sampling than forward-only sampling, especially with larger loops and weaker positional restraints. Thus, in this work, I have used forward-only, backward-only and both-sided sampling for C terminal, N-terminal and intermediate bands respectively.

4.2.6 Multiconformer sampling This type of sampling constructs many conformations of the same band simultaneously. Instead of incrementally sampling one band, multiple models of the band are extended simultaneously. This is achieved by re-implementing the PopulationSearch algorithm in its plural form in which each builder is replaced by a set of builders that have same input and output atoms in dierent models. Clashchecking is not performed across models. Electron density ranker uses the combined output of a set of corresponding builders to calculate the score of a child conformation. Due to this, the possibility of getting attracted into higher density is reduced and the chance of occupying the density generated due to genuine heterogeneity increases. The disadvantage of this kind of sampling is the obvious increase in conformational freedom and execution time.

4.3 Results 4.3.1 Reproducing

rapper/cns

renement

The utility of knowledge-based renement has been demonstrated by DePristo et al. (2005) with automatic renement of perturbed starting structures of 9ILB and 1KX8 to an Rf ree almost same as the deposited structure. When that renement protocol was closely reproduced in Rapper tk, very similar results were obtained. Five proteins were selected in the 2Å-3Å resolution range from the pdb: 9ILB (Yu et al. (1999)), 1KX8 (Lartigue et al. (2002)), 1MB1 (Taylor et al. (1997)), 1BYW (Cabral et al. (1998)) and 1RN7 (Alvarez-Fernandez et al. (2005)). Five perturbed structures were generated for each of them within 2Å Cα and 3Å sidechain centroid restraints respectively. 20 rounds of both

cns-only

and

cns/Rapper tk

renement protocols were performed on these starting models to obtain the Rf ree statistics summarized in Table 4.1. Rf ree gures reported for 9ILB and 1KX8 by DePristo et al. (2005) for same restraints were 0.27(0.01) and 0.32(0.01) respectively - the corresponding statistics observed here are comparable.

88

89

2.28

2.80

2.10

2.60

2.50

9ILB

1KX8

1MB1

1BYW

1RN7

112

110

98

99

153

#AA

0.270

0.292

0.292

0.294

0.225

Baselinea Rf ree

0.366 (0.054)

0.342 (0.020)

0.333 (0.021)

0.317 (0.018)

0.266 (0.016)

CNS-only Rf ree

0.318 (0.009)

0.316 (0.013)

0.302 (0.003)

0.310 (0.010)

0.253 (0.011)

Rf ree

CNS/Rapper tk AA-RMSDb χ1 c χ12 d (pdb vs. models, among models) 1.29 (0.08) 75.2 (2.63) 53.0 (4.64) 0.585 79.100 55.100 0.60 (0.02) 76.2 (1.93) 50.2 (4.06) 0.551 77.800 57.500 0.60 (0.02) 75.8 (2.31) 56.4 (2.05) 0.598 78.300 59.600 0.58 (0.06) 78.6 (3.32) 57.4 (5.60) 0.595 79.700 58.400 0.61 (0.04) 81.0 (2.09) 53.2 (1.93) 0.604 80.300 54.500

Baselines Rf ree values are computed by running a long CNS-only renement starting from the deposited structure. b All-atom RMSDs are in Å units. First two numbers are the mean and standard deviations of RMSDs between the PDB and 5 models. The third number is the mean of RMSDs among 5 models based on 5 C2 = 10 pairwise comparisons. c χ1 accuracy of a protein structure with respect to another is dened as the percentage of sidechains in the former that have a χ1 diering by less than 40o from the corresponding sidechain in the latter. First two numbers are the mean and standard deviation of the χ1 accuracies between 5 models and the deposited structure. The third number is the mean of χ1 accuracies from 5 C2 = 10 pairwise comparisons among the 5 models. d Same as c , but with χ1 and χ2 both instead of only χ1 .

a

Resolution (Å)

PDB Id

Table 4.1: Full-protein testset and renement statistics for 5 starting models generated within 2Å Cα and 3Å sidechain restraints.

Table 4.1 shows the variation in all-atom rmsd and χ1 , χ12 values as function of resolution. The reported

rmsd

is the average of pairwise unsuperposed

rmsd

between the deposited and

each of the composite models and thus can be said to indicate the inaccuracy in retrieving the deposited model from an approximate starting model. This inaccuracy does not seem to be sensitive to the resolution, suggesting that at least in the 2.1Å-2.8Å resolution range, an approximate model can be corrected to a similar quality with respect to the deposited one irrespective of the resolution. When the models are compared among themselves in a pairwise manner, the average rmsd

and χ1 , χ1,2 gures can be said to represent heterogeneity. This heterogeneity is slightly

lower than the inaccuracy, but the dierence is insignicant, i.e. each model is as far from the deposited structure as from any other model. Recently the heterogeneity dened similarly has been suggested to be the minimum uncertainty expected in the coordinates of a singleconformer model of that structure (Terwilliger et al. (2007)). An ideal renement method should start with approximate models and yield a set of high heterogeneity models each of which agrees at least as well with the data as the deposited structure, i.e. a combination of results in DePristo et al. (2004) and DePristo et al. (2005). Clearly, the protocol used is similar to DePristo et al. (2005) and perhaps expectedly, does not yield greater heterogeneity than inaccuracy. But when the models are assigned partial occupancies and combined to create a multiconformer model (Fig.4.4, Section 4.3.4), the collection Rf ree values for 1MB1, 1BYW and 1KX8 drop signicantly by 1.5 − 2% than the deposited structure. This drop suggests that perhaps structural heterogeneity is captured to some extent.

4.3.2 Rebuilding missing loops Five structures of various resolutions and no obvious homology were selected from the

pdb:

1MB1 (Taylor et al. (1997)), 1BYW (Cabral et al. (1998)), 1KXB (Choi et al. (1996)), 1RN7 (Alvarez-Fernandez et al. (2005)) and 2DBO (Ishii et al. (2005)). All structures have a single continuous peptide chain between 100-200 residues and no ligands. For each structure, a loop at least 10 residues long was chosen for rebuilding (Table 4.2). Unlike the previous exercise, there are no positional restraints on loop sidechains. Cα atoms are positionally restrained, in the rst case with 5Å restraints and later with 10Å restraints. The loops were rebuilt using the both-sided loop sampling and loop closure techniques within the iterative renement protocol. For 4 of 5 loops considered, the 5Å perturbation of the loop can be corrected to within one point of the baseline Rf ree . In the 10Å case, this performance drops marginally to two points from the baseline Rf ree for the same cases. Fig.4.1 shows the large dierence in the quality of 90

cns-only

and cns/Rapper tk renement

protocol. Every starting model renes to a structure very similar to the deposited using the composite protocol whereas it gets trapped in local minima during

cns-only

renement. The

1RN7 case (Fig.4.2) is unsatisfactory due to a dicult 5-residue segment in the loop (Pro-80, Asn-81, Leu-82, Asp-83, Asn-84). As noted by Alvarez-Fernandez et al. (2005), the density for this segment is confusing, perhaps due to underlying heterogeneity, and consistently misleads the band sampling into a conformation dierent from the deposited.

4.3.3 Framework and loops Perhaps a more frequently encountered scenario than the previous two is the one in which both the secondary structure and loops have positional uncertainty. In such cases, the loop regions invariably are more unreliable than the secondary structure framework. In order to simulate this scenario, the framework was restrained to 1Å Cα and 3Å sidechain centroid restraints, whereas loops were restrained to 3Å Cα restraints and no sidechain restraints. Five models were built for each protein and then iteratively rened using both cns/Rapper tk

cns-only

and

protocols. The renement statistics are summarized in Table 4.3. Note that

the renement composite renement statistics are not as good as in the previous exercises, but still better than the cns-only

cns-only

renement. Fig.4.3 shows a typical contrast between the

and the composite renement protocols in this scenario.

4.3.4 Variation of Rf ree with collection size I dene a collection as a set of independently rened structures which when taken together may capture some aspects of structural heterogeneity. This term is introduced to distinguish the collection from an ensemble (see Section 4.3.6) which is also a set of structures, but rened in an interdependent manner. For the previous exercises of renement (whole chain, 5Å loop, 10Å loop and loops with framework), single best Rf ree models from ve

cns/Rapper tk

trajectories are chosen and

combined to create collections, e.g. a three model collection is created by choosing three lowest Rf ree models from the ve and assigning occupancy of 0.33 to each of them. A collection is subjected to a short

cns

renement and Rf ree at its end is noted as collection

Rf ree . Fig.4.4 shows the variation of such Rf ree values as a function of collection size. The drop in Rf ree is modest and the highest when going from collection size of 1 to 2 or 3. The collection Rf ree generally rises for sizes 4 and 5. This indicates the danger of overtting due to increase in number of parameters. Thus a straightforward combination of models does not seem to be the correct way of describing heterogeneity. Intelligent schemes for parameter reduction need to be investigated in this regard, such as upper-bounding B -factors, enforcing 91

1MB1 1BYW 1KXB 1RN7 2DBO

PDB Id

Resolution (Å) 2.1 2.6 2.9 2.5 2.76 98 110 158 112 148

#AA

Loop range and size 66-75 (12) 115-124 (12) 206-218 (15) 76-95 (22) 83-98 (18)

Baseline Rf ree 0.292 0.292 0.283 0.270 0.289

Rf ree (CNS-only) 5Å 10Å 0.331 (0.024) 0.368 (0.022) 0.329 (0.006) 0.335 (0.008) 0.326 (0.007) 0.337 (0.010) 0.347 (0.018) 0.366 (0.022) 0.332 (0.015) 0.354 (0.006)

Rf ree (CNS/Rapper tk) 5Å 10Å 0.304 (0.013) 0.307 (0.005) 0.299 (0.012) 0.314 (0.008) 0.290 (0.008) 0.293 (0.017) 0.311 (0.009) 0.316 (0.013) 0.291 (0.010) 0.291 (0.008)

Table 4.2: Dataset for loop building and renement statistics.

92

Figure 4.1: Loop building exercise for the 1MB1 loop with 10Å Cα restraints. Top panel shows the loop in the deposited structure (green) and starting models generated for it. Middle panel shows the best Rf ree models (slate) obtained during the CNS-only renement. Bottom panel shows the CNS/Rapper tk models (magenta) and the loop in the deposited structure (green) in all-atom representation.

93

Figure 4.2: Loop building exercise for the 1RN7 loop with 5Å Cα restraints. Panels arranged in a similar way to Fig.4.1.

94

Figure 4.3: Framework/loop exercise for 1BYW. The deposited structure is shown as thick green ribbon and starting structures as brown ribbon in the top panel. Middle panel shows the best Rf ree structures from CNS-only trajectories (slate) and bottom panel shows those from from the CNS/Rapper tk trajectories (magenta).

95

Table 4.3: Dataset and renement statistics for the loop-framework renement PDB

Resolution (Å)

#AA

1MB1 1BYW 1KXB 1RN7 2DBO

2.1 2.6 2.9 2.5 2.76

98 110 158 112 148

Loops #AA (%) 27 50 54 33 57

(28) (45) (36) (29) (39)

baseline 0.292 0.292 0.283 0.270 0.289

Rf ree CNS-only Mean (Std.Dev.) 0.352 (0.012) 0.400 (0.046) 0.351 (0.022) 0.342 (0.032) 0.370 (0.020)

CNS/Rapper tk Mean (Std.Dev.) 0.325 (0.0l0) 0.321 (0.016) 0.336 (0.006) 0.328 (0.008) 0.335 (0.017)

the same B -factor on corresponding atoms across all models and positionally constraining them together when large variability is not expected.

4.3.5 Mistakes in the composite protocol Although the

cns/Rapper tk

renement produces a well-rened structure very close to the

baseline Rf ree , it never betters the latter. This is due to imperfections in various components of the protocol. Identication of residues to rebuild relies on the χ2 correlation coecient which can sometimes be an unsatisfactory substitute for human judgement. This can lead to unnecessary resampling of satisfactory bands and sometimes incorrect conformers are not detected. The copying of non-protein atoms from one round to next may sometimes result in their permanent misplacement. The sampling problem may result in unsatisfactory bands because the right conformation must be generated in order to be picked by the electron density ranker. On the other hand, the ranker may not score a correct conformation as the best one. If the density for a band is weak, band sampling may sometimes lead into the density of waters or ligands. In spite of these diculties, the renement statistics presented in previous sections are satisfactory. But when restraint radii are increased beyond those used here, the serious problem of spatial overlap between restraint spheres of two dierent bands starts aecting the band sampling. The correct density for a band then may be occupied by a band sampled before it. In such case, the correct density is always occupied by the wrong band. For the wrong band, subsequent cns renement may take it so far from its correct location that the restraint radii may be too small to let the band be built correctly again. Further work is required to get rid of the problem of band overlaps, either by restraint adjustments or change in the sampling strategy.

96

Rfree

Rfree

97

3

Collection Size

0.27 5

0.27 4

0.28

0.31

0.32

0.33

0.34

0.22

0.28

3

5

0.29

2

4

1MB1 frame-loop 1BYW frame-loop 1KXB frame-loop 1RN7 frame-loop 2DBO frame-loop

Collection Size

2

0.29

1

1

0.3

0

0

0.24

0.26

0.28

0.3

0.32

0.3

0.31

0.32

0.33

0.34

0.27

0.28

0.29

0.3

0.31

0.32

0.33

1MB1 loop 10 1BYW loop 10 1KXB loop 10 1RN7 loop 10 2DBO loop 10

Rfree Rfree

0.34

0

0

1

1

3

3 Collection Size

2

Collection Size

2

4

1MB1 loop 5 1BYW loop 5 1KXB loop 5 1RN7 loop 5 2DBO loop 5

4

1MB1 whole-chain 1BYW whole-chain 1RN7 whole-chain 9ILB whole-chain 1KX8 whole-chain

5

5

Figure 4.4: Variation of Rf ree with collection size in whole chain, loop and framework-loop exercises. For collection size of 0, baseline Rf ree is shown. For collection size 1, the mean and standard deviation of best Rf ree models in CNS/Rapper tk trajectories are shown. The rest are calculated by combining the best Rf ree individual models with partial occupancies.

4.3.6 Modelling loop heterogeneity with collections and ensembles Conformational diversity is most pronounced in loop regions due to the relatively smaller number of non-covalent interactions to maintain order. Absence of good density for a loop when rest of the structure has good density is a sure indication of the loop's exibility. Modelling heterogeneity is challenging because density is generally more confusing for such regions owing to conformer overlaps and subsequent dilution of shape information. Heterogeneity can be modelled with collections or ensembles. Members of a collection are single-conformer isotropic B-factor models determined independently of one another. Members of an ensemble are determined in a highly interdependent manner and have partial occupancies. Derivation of a collection is a simple way to estimate the unavoidable uncertainty in structure determination, but it cannot be said to represent any structural correlations. A major advantage of collections is their simplicity. A procedure that produces a single model can be executed multiple times with dierent random seeds or starting models to generate a collection. Thus the time taken increases only linearly as the collection size. An implicit assumption in the ensemble representation is that the members are in dynamic equilibrium, making ensemble a much stronger statement than the collection. Determining ensembles is very challenging because it is unclear how to determine the number and occupancies of the ensemble members prior to or during the renement process. Another major challenge is the linear increase in the number of parameters which results in an exponential increase in search space and execution time. In order that its output be credible, any procedure that aims to model the structural heterogeneity must be rst validated using articial data where the real heterogeneity is accurately known. To that end, I have chosen a signicantly simplied kind of heterogeneity by generating articial diraction data in which the underlying heterogeneity is restricted to a single loop and consists of two equally-occupied loop conformations. For a loop each from 1MB1, 1BYW and 2DBO, two conformers were generated by perturbing the loop to within

3Å Cα restraints and no sidechain restraints. All non-protein atoms (ions, waters etc.) have been removed so as to reduce the density dilution. Articial diraction data were created with the same cell, spacegroup and resolution as the deposited structure. For self-consistency in the cns forceeld, data was generated iteratively. The average of the two conformations was considered as the starting conformation for further heterogeneity modelling. A collection of 4 members was generated for each loop with

cns/Rapper tk

protocol used

previously. An ensemble consisting of 2 members was generated for each loop using multiconformer sampling described previously. As with the single-conformer protocol, multiconformers were sampled iteratively and alternatingly with cns. Enrichment was increased to 10 98

and population size to 200 to be able to build reasonable models. Positive electron density restraint was enforced on mainchain. The performance of collections and ensembles can be visually inspected in Fig.4.5, Fig.4.6 and Fig.4.7 but it is essential to quantify quality of heterogeneity modelling objectively. The two important quantities to measure are: the extent to which both conformers are captured and the extent to which at least one conformer is captured. The former (multiconformer quality index, MQI) should quantify how much of the underlying diversity is represented and the latter (single-conformer quality index, SQI) should quantify how well at least one of the heterogenous states is modelled. If conformers Hi constitute the true underlying heterogeneity and Mi are the ensemble or collection members which model it, then MQI and SQI can be calculated as:

M QI =

X

minj (Rmsd(Hi , Mj ))

(4.2)

i

SQI = mini,j (Rmsd(Hi , Mj )) where

rmsd

is calculated over the atoms of interest, e.g. a sidechain or all Cα atoms in

the loop. Note that these expressions do not consider the occupancies. Table 4.4 quanties the performance of ensembles and collections using Rf ree , MQI over loop Cα atoms, and MQI, SQI over each sidechain. The ensemble Rf ree values are smaller than those for collections as expected due to greater number of parameters. Cα MQI suggests that mainchain heterogeneity is modelled better in the ensemble method. Sidechain SQIs do not show any systematic dierence between the two methods, which suggests that both methods capture the rotameric heterogeneity to a similar extent. But sidechain MQIs tend to be slightly better for ensemble than collection. This is not surprising because in principle, the only limitations on the ensemble method are sampling and ranking of conformational extensions. Generally the higher density option is chosen in single-conformer modelling but a lower density can also be chosen in multiconformer modelling due to a greater number of atoms to place in the density. This is evident from residues Lys-72 and Tyr-73 in 1MB1, Glu-118, Asp-119 in 1BYW and Lys-86, Lys-87 in 2DBO where collection models are biased towards one of the loop conformations due to weak density.

4.4 Concluding Discussion The problem of automated crystallographic renement is interesting, challenging and has signicant immediate practical relevance. An automated solution for building single con99

Figure 4.5: Heterogeneity modelling with collection and ensemble for a loop in PDB 1MB1. The top panel shows the articially generated loop heterogeneity with corresponding electron density contoured at 1σ . The middle and bottom panels respectively show a 4-member collection and 2-member ensemble model of that heterogeneity.

100

Figure 4.6: Heterogeneity modelling with collection and ensemble for a loop in PDB 1BYW. Panels arranged as in Fig.4.5.

101

Figure 4.7: Heterogeneity modelling with collection and ensemble for a loop in PDB 2DBO. Panels arranged as in Fig.4.5. In the top panel, the red contours correspond to 0.5σ .

102

Baseline Rf ree

0.055

0.159

0.292

PDB

1MB1

1BYW

2DBO

103 0.312 (0.005)

0.242 (0.004)

0.127 (0.003)

Collection Rf ree Mean (Std.Dev.)

0.271

0.192

0.094

Ensemble Rf ree

2.01

0.79

2.11

MQI-Cα Collection

1.10

0.58

1.42

MQI-Cα Ensemble 4.269 3.575 5.624 3.316 1.606

0.994 1.849 2.171 4.580 1.402 0.829

0.696 0.598 2.160 6.821 0.814 1.739 0.850 0.396

Gln-67 Phe-70 Lys-72 Tyr-73 Gln-74

Lys-116 Asn-117 Glu-118 Asp-119 Val-122 Ile-123

Asn-84 Val-85 Lys-86 Lys-87 Arg-89 Arg-90 Pro-91 Ser-92

0.286 0.210 0.901 0.456 0.136 0.394 0.220 0.173

0.325 0.529 0.169 1.833 0.279 0.273

0.354 0.262 0.889 0.151 0.349

1.710 3.202 0.622 3.830 7.495 2.174 1.273 0.576

1.004 1.861 0.178 3.703 1.000 2.559

4.285 3.271 2.539 0.233 1.614

0.147 1.073 0.153 0.451 1.480 0.442 0.522 0.189

0.275 0.275 0.049 1.281 0.262 0.677

0.516 0.670 0.235 0.115 0.189

Residue-wise MQI, SQI for collection; MQI, SQI for ensemble

Table 4.4: Comparison of collection and ensemble heterogeneity modelling styles with articially generated 2-conformer heterogeneity for 3 loops. MQI and SQI are in Å units.

former models from approximate spatial restraints, combined with an automated method to explore heterogeneity, will allow the crystallographic community to revisit and annotate the entire

pdb

with structural variability information. Such information will have an im-

pact on all analyses that rely on accurate coordinate information, like in-silico ligand design and binding, non-covalent interactions and sequence-structure conservation, etc. It will also signicantly change the understanding of crystalline state and protein exibility, benetting the renement process. The main components for successful heterogeneity annotation are reliable construction of single-conformer models and reliable estimation of heterogeneity that is predominantly seen in loops. This chapter has attempted to develop methods to that end. Application of the rapper approach to crystallographic renement (DePristo et al. (2005), Furnham et al. (2006b)) has the primary benet of crossing the energy barriers in a nonrandom manner, based on knowledge-based sampling instead of kinetic sampling. As described in Chapter 2,

rapper

has been reformulated as Rapper tk recently, creating possibili-

ties of applying knowledge-based sampling in many dierent ways. In this chapter, I showed that Rapper tk can be used to automatically rene the whole protein structure starting from reliable positional restraints on mainchain and sidechain. This benchmarked its performance vis-a-vis

rapper

as reported by DePristo et al. (2005) for a similar task. Then I showed

that by ecient use of available restraints, single loops can be modelled in protein structures to native-like quality with few positional restraints. The ecient use of available restraints was a result of symmetry-related clashchecking, restraint propagation using loop anchors and sampling from both anchors simultaneously. The same strategy could be extended to a more realistic problem of a large uncertainty in loop regions and an imperfect secondary structure framework. The

cns-only

renements, run as controls, showed the value added by Rapper tk

to the renement task. In addition to determination of single-conformer models under diering restraint qualities, I started addressing the challenge of heterogeneity assessment in loops. This is indeed a very dicult problem with fundamental unknowns like number of conformers, correlations within heterogeneity and relative occupancies. Conformational heterogeneity can be divided into two types: the simpler sidechain-only heterogeneity where mainchain is nearly the same and the all-atom heterogeneity where mainchain also takes distinct conformations. The latter can be further divided based on the extent of spatial overlap between the conformers. Sidechain-only heterogeneity is relatively easy than the all-atom heterogeneity because the density is likely to contain good cues about diversity. But for overlapping conformers, a visual inspection of density is less likely to be helpful. There are two distinct ways to model heterogeneity, which I have termed collections and ensembles, depending upon interdependence of member conformations. For single-loop 2-conformer overlapping heterogeneity, I generated both the 104

collections and ensembles and assessed how well they modelled the heterogeneity. The main observation was that the collection was generally biased towards the higher electron density. The ensemble method, due to more freedom and parameters available to it, manages to avoid this trap and t two distinct conformers, leading to better modelling of heterogeneity than the collection. Various pitfalls of the composite renement protocol were recognized and they need to be addressed in future. Addressing the problem of overlapping bands will signicantly increase the reliability of the method given very approximate positional restraints, and make the method more useful in low resolution, large uncertainty cases. Perhaps many models can be generated for such bands and the best combination of those models can be used. At lower resolution, use of coarse-grained sampling (fragment sampling) may also be useful, followed by ne-grained φ − ψ − χ sampling. From the heterogeneity perspective, a fundamental question to address would be the estimation of the nature of underlying conformations before attempting to model it because ensemble sampling must have prior knowledge of the number and occupancies of its members. Generation of collections seems the only way for such estimation, for which collection modelling method will have to be modied suitably to sample within electron density yet avoid the bias towards higher density. The main challenge of ensemble sampling is the explosion in conformational freedom and the work ahead will have to focus on eciently scaling this sampling method for larger ensemble size.

105

Chapter 5 sampling and crystallographic renement rna

Background Dramatic increases in rna structural data have made it possible to recognize its

conformational preferences much better than a decade ago. This has created an opportunity to use discrete restraint-based conformational sampling for modelling rna and automating its crystallographic renement.

Results All-atom sampling of entire rna chains, termini and loops is achieved using the Richard-

son rna backbone rotamer library and an unbiased distribution for glycosidic dihedral angle. Sampling behaviour of Rapper tk on a diverse dataset of rna chains under varying spatial restraints is benchmarked. The iterative composite crystallographic renement protocol developed here is demonstrated to outperform cns-only renement on parts of trnaAsp structure.

Conclusion This work opens exciting possibilities for further work in rna modelling and crystallography.

106

5.1 Introduction 5.1.1 Role of

rna

rna is involved in many important biochemical functions involving genetic information, such

as its storage (viral rna), communication (mrna) and modulation (snorna, microrna). rna also performs protein-like functions like enzymatic catalysis (ribosomal peptide bond

formation - rrna) and specic binding (amino-acid-specic trna) etc. It is believed to have played a major role in the early evolution of cellular life because it is functionally intermediate to proteins and dna, exhibiting enzymatic activity as well as information storage and transfer (Voet and Voet (1995)). There is an increasing recognition of rna's importance in cellular life (Schlick (2006)) and attempts to organize available experimental information as rna ontology (Leontis et al. (2006)).

5.1.2

rna

structure

rna is simpler than proteins in the sequence space due to a much smaller alphabet, but

structurally it is more complicated. A typical nucleotide contains at least thrice as many non-hydrogen atoms as an amino acid residue. The most prominent parts of polynucleotide structures are nucleotide bases which are purines or pyrimidines. Purines adenosine and guanine are 5,6 aromatic rings and resemble tryptophan's sidechain. Pyrimidines uracyl and cytosine are aromatic 6-rings which resemble phenylalanine and tyrosine sidechains. The bases can undergo a variety of post-transcriptional modications, increasing the eective number of base types (Dunin-Horkawicz et al. (2006)). A striking feature of dna and (most) rna structures is the common Watson-Crick pairing of purines with pyrimidines, and the

associated base stacking. But in rna structures, there are other non-canonical base interactions which contribute to stabilization of various rna motifs (Leontis and Westhof (2003)). Bases are linked to 5-membered ribose sugar rings through glycosidic linkages. The χ torsion angle, which describes base rotation with respect to sugar, is distributed around −120o or diametrically opposite to it, around −60o (Schneider et al. (2004)). Sugar ring connects bases to the backbone, and occurs only in two conformations, C30 -endo or C20 -endo. The phosphate-sugar backbone has six torsion angles (α, β, γ, δ, ², ζ ) and much greater freedom than the protein mainchain. But conformational correlations in that space have been recognized recently (Duarte and Pyle (1998), Murray et al. (2003), Schneider et al. (2004)). Despite chemical dierences, protein and rna chains are logically similar. rna backbone and protein mainchain are the unbranched chains in both polymers and show clear preferences for parts of their dihedral spaces. In proteins, mainchain completely determines Cβ coordinate 107

and similarly, rna backbone almost completely determines the sugar coordinates. Bases are similar to sidechains, because both are rotameric and confer chemical characteristics to respective polymers1 . Thus, rna backbone, sugar and bases are analogous respectively to protein mainchain, Cβ atom and sidechains.

5.1.3

rna

structure prediction

Like other biopolymers, sequence data for rna is far greater than 3D structural data. rna crystals generally do not diract as well as proteins because rna is harder to purify and crystallize, possibly due to size and exibility. Hence structure prediction methods are important to bridge the sequence-structure gap. rna structure prediction is done at two levels - secondary and 3D. Secondary structure prediction is important because it can help identify a variety of motifs like stem, hairpin loop, internal loop, junction loop, bulges and pseudoknots. These predictions can prove to be important restraints to guide further 3D structure prediction. 3D structure prediction is important to locate interesting sites and tertiary interactions, but it has so far been dependent on secondary structure prediction (Shapiro et al. (2007)). Secondary structure prediction estimates the base pairings given a sequence. Due to standard Watson-Crick base-pairing, rna commonly exhibits helical stem regions. The sequence that connects the two strands in a stem is called a loop. Stem and loop arrangement can develop in a hierarchical fashion, giving rise to a structure that can be represented like a tree. Dynamic programming based algorithms like Mfold (Zuker et al. (1999)), Sfold (Mathews

et al. (1999)), RNAstructure (Mathews et al. (2004)) assign secondary structure in such a way as to minimize the free energy for the sequence2 . Optimal and highly-ranked suboptimal solutions are very likely to contain the correct secondary structure. Suboptimal solutions can be ltered using Boltzmann sampling (Ding et al. (2004)) or abstract shape analysis (Steen et al. (2006)) to enrich the solutions of dynamic programming algorithms. In addition to dynamic programming, various other approaches have also been utilized such as genetic algorithms (Shapiro et al. (2001)) and Monte-Carlo sampling (Xayaphoummine et al. (2005)). All approaches can be further enhanced by using multiple sequence alignments, based on the information-theoretic principle that MSAs improve the signal to noise ratio. Tree-like simplicity of rna secondary structure is lost when pseudoloops are formed by base-pairing of a stretch in loop with another strand. Pseudoloops are known to occur in many more complicated ways than the simplest H-type. They reduce exibility of the 1 But

base rotamericity is weaker and not used in this work. energies used here are experimentally determined as a function of host secondary structure type and base-pairing. 2 Free

108

structure because often the stems involved in a pseudoloop are coaxially stacked. Dynamic programming algorithms which include general psudoknots scale poorly but simple H-type pseudoknots can be incorporated without loss of eciency (Shapiro et al. (2007)). Fully-automated 3D structure prediction procedures are yet to be devised for rna. This is perhaps due to the complexity of rna structure and relatively less structural information as rna is studied more often from a non-structural perspective. Present approaches encoded in programs like Erna-3D (Zwieb and Muller (1997)), rna2D3D (Yingling and Shapiro (2006)) and S2S (Jossinet and Westhof (2005)) are focussed on assisting the 3D model building exercise interactively. The inputs are a combination of known/predicted secondary structure, features derived from 3D structural data and available experimental restraints. The interactively assembled model is generally subjected to molecular dynamics renement and minimization (Shapiro et al. (2007)). Recurrent 3D motifs in rna structure are short sequence-dependent combinations of backbone conformations and base interactions. A complex set of noncovalent interactions stabilize them. Motif identication has not matured enough to be usable in 3D structure prediction (Leontis and Westhof (2003)).

5.1.4

rna

crystallographic renement

rna crystallography is harder than protein crystallography because nucleotides are bigger

and more exible than amino acid residues. rna crystals rarely diract better than 2Å. Due to many high-quality protein structures, their statistical preferences can be used eectively to solve more protein structures. This critical mass eect is yet to be achieved for rna as there are not enough structures for condent identication of backbone preferences and 3D motifs. Apart from their stand-alone utility, high-quality single-chain rna structures are also essential for docking into low-resolution EM data of large complexes containing rna chains. Temperature factors suggest that rna exibility is the least for paired bases and the highest for phosphates. Yet phosphates are also easy to detect due to greater electron density. Hence rna crystallographer identies bases and phosphates of rna chain in the initial map and then iteratively completes and renes the structure. Due to lack of structural preferences, this process is manual, tedious and laborious. Methods and progress in rna crystallography have been reviewed by Holbrook and Kim (1999) and Holbrook (2005).

5.1.5 This work This work is inspired by the success of rapper's protein sampling which proved eective in loop sampling, comparative modelling and automation of crystallographic renement 109

(de Bakker et al. (2003), DePristo et al. (2005), Furnham et al. (2006b)). It is the last task that would be very useful to the crystallographer if replicated for rna. In protein crystallography, approximate locations of Cα atoms and sidechains identied by the crystallographer are sucient for rapper to reach an almost rened structure. It is expected that a similar approach would work for rna chains too, given the approximate locations of phosphates and bases visually identiable in the electron density. Apart from crystallographic use, a generalized restraint-based all-atom sampler of rna would be useful for generating decoy structures useful for benchmarking of energy functions. It would also allow generation of models with a prescribed sequence and secondary structure, and serve as a tool for generating 3D models of rna motifs. In this chapter, I show that rapper's gabb (genetic algorithm using branch-and-bound technique) algorithm can be extended to rna structures to sample it accurately and eciently under a variety of positional restraints on backbone and bases. I also demonstrate the allatom iterative crystallographic renement of parts of a trnaAsp structure.

5.2

rna

tracing

These benchmarks assess the utility of rna sampling for the intended application of crystallographic renement, hence the restraints chosen here reect the kind of information a crystallographer can provide. Spherical positional restraints are used for phosphates (P atoms) and C40 atoms. Base planes are restrainted using a union-of-spheres restraint (baseplane restraint). This restraint is satised when the sampled set of coordinates lie within the union of given spheres. As noted in Chapter 2, Rapper tk uses the Richardson rotamer library (Murray et al. (2003)) for rna backbone sampling. Sugar-phosphate backbone consists of six dihedral angles : α (O30i−1 −Pi −O50i −C50i ), β (Pi −O50i −C50i −C40i ), γ (O50i −C50i −C40i −C30i ), δ (C50i −C40i −

C30i − O30i ), ² (C40i − C30i − O30i − Pi+1 ) and ζ (C30i − O30i − Pi+1 − O50i+1 ). Murray et al. (2003) dene the rna backbone suite as a set of seven dihedral angles {δ i , ²i , ζ i , αi+1 , β i+1 , γ i+1 , δ i+1 } and identify 42 distinct rotamers. In a recent eort, this library has been extended to 46 rotamers, with standard deviations specied for each cluster (J. M. Richardson, personal communication). Glycosidic dihedral χ is dened over O40i − C10i − N 10i − C20i for pyrimidines

O40i − C10i − N 90i − C40i for purines. χ preferences have not been rigorously analyzed, hence in this work it is randomly sampled between −180o and +180o in steps of 10o . The basic operation in rna chain extension is building next or previous backbone suite by sampling a backbone suite rotamer and building sugar/base of present nucleotide using a random sample for glycosidic linkage. Various styles of sampling use this building block in 110

dierent ways.

5.2.1 Sampling styles For iterative crystallographic renement, basic operations over the rna chain are rebuilding the whole chain, or its terminal (50 or 30 ) or an intermediate fragment (loop) :

• Forward sampling (50 → 30 ) is performed using the default rna builder as described in Chapter 2. This builder depends on atoms (C50i , C40i , C30i ) and yields (O30i , Pi+1 , O50i+1 , C50i+1 , C40i+1 , C30i+1 ) atoms (see Chapter 2 for gure). It also builds the sugar and base of ith nucleotide.

• Bootstrapping required for sampling the whole chain is explained in Chapter 2. It involves approximate positioning of (P, O1P, O2P, O50 , C50 , C40 , C30 ) atoms of the rst nucleotide. • Backward sampling (30 → 50 ) is performed by slightly changing the forward builder. The same backbone rotamers are sampled, but the builder depends on atoms (O30i , C30i , C40i ) to calculate coordinates for atoms (C50i , O50i , Pi−1 , O30i−1 , C30i−1 , C40i−1 ) (see Fig.5.1). Sugar and base for ith nucleotide are also built.

• Loop sampling uses forward sampling. Nucleotides between and including start and end indices are rebuilt. Base of (start − 1)th nucleotide is resampled within 2Å positional restraints. Approximate loop closure is achieved by partial sampling of (end + 1)th nucleotide's (P, C50 , C40 , C30 ) atoms under similar restraints. Loop closure restraint is back-propagated by enforcing a spherical positional restraint centered at P end+1 atom with radius 7 ∗ (end + 1 − i) Å on P i atom and also forcing it to remain 5Å away from the P end+1 atom.

5.2.2 Initial observations For benchmarking of rna tracing capabilities, I have used a set of diverse rna chains compiled by Duarte and Pyle (1998) for their virtual dihedral analysis (summarized in Table 5.1). In the rst exercise, I restrained P, C40 atoms to 2Å positional restraints and sampled only the backbones of the chains. But a model could be generated for only 12 of the 48 chains. This suggested that the Richardson rotamer set consisting of only 46 states was too coarse-grained for the task being attempted. Indeed, 46 is a small number for capturing preferences of a

111

Figure 5.1: Reverse RNA builder

C4*

C4 C5

C3*

C6

O3* O2P P

O4

C1* C5*

C2*

C4* C3* RNA Builder

N1

O4*

O5*

O1P

N3

O3*

RNA Sampler

112

O2*

C2

O2

Table 5.1: Dataset of RNA chains used for the tracing exercise. PDB id 1i6u 1kh6 1l2x 2fmt 1mzp 1duh 1cx0 1ec6 1m5k 1b7f 1f1t 1cvj 1b23 1ddy 1e7x 1et4 1g1x 1f7u 1hmh 1jbt 1qf6 1m8x 1ddl 1h4s

Sizea 37 27 27 77 55 44 71 19 91 12 37 8 73 35 16 35 39 75 34 28 76 8 7 67

Filtered Sizeb 11 10 10 17 18 19 18 6 20 3 4 1 13 4 16 9 14 23 11 7 25 3 0 12

PDB id 1jj2 1l9a 1n78 361d 1k8w 1kq2 1f7y 1c0a 1kxk 1hq1 1ivs 1gid 1qtq 1lng 1f27 1jbs 1ehz 1hr2 1mji 1y 1e7k 1ntb 1ser 429d

a

Sizea 121 125 75 19 21 6 56 77 69 48 75 158 73 96 18 28 76 156 33 74 9 21 64 12

Filtered Sizeb 18 15 24 1 10 0 9 37 11 35 6 29 28 8 11 11 26 19 7 11 4 4 9 6

Size is the number of nucleotides in the chain as in the deposited PDB structure. Filtered size is the size of the largest contiguous segment of rotameric backbone suites in the given chain. The Richardson RNA backbone rotamer set consists of 46 7-dihedral tuples, along with standard deviations for all dihedrals in each tuple. A backbone suite is rotameric if the largest single-dihedral dierence between the suite and the closest Richardson backbone rotamer is < 30o or < 3σ of that dihedral angle. b

113

exible backbone consisting of 6 dihedral angles. Hence it was decided to supplement sampling by perturbation - after a rotamer is sampled, a random noise within 1σ of the respective dihedrals is added to them. Standard deviations were kindly provided by J. M. Richardson (personal communication). When the latest exercise was repeated with perturbed sampling, at least one model could be generated for 47 of 48 examples. When perturbed backbone-only sampling was performed within tighter 1Å restraints on P, C40 atoms, this dropped to 36 of

48 chains. These failures could be traced to the non-rotameric backbone suites present in the chains. Then the longest stretch of good backbone suites was identied within every chain. A good suite was dened to be the one for which the largest single angle dierence from the closest rotamer was within 30o or 3σ for that angle. As seen from Table 5.1, such good fragments are fairly small as compared to whole chain. 45 of 46 such fragments could be sampled successfully under the same restraints. On these fragments, all-atom sampling was also possible within 1Å restraints on P, C40 atoms and 5Å baseplane restraints. By dropping the C40 from these restraints, an increase in sampling time was observed, accompanied by a reduction in number of examples for which 10 models could be built (33 of 45). This is due to population dilution, which in this case is the reduction in number of members which will satisfy restraints for the base of next nucleotide. As expected, using stricter base restraint of 3Å made the matters worse due to greater base restraint violations and no propagation of base restraints onto the backbone. Base restraint used here is hard to satisfy closely because a small error close to sugar amplies towards the far end of the planar base. This problem can be addressed if given base restraint can be propagated onto C40 atom, but it is unclear at present how to achieve this.

5.2.3 Sampling performance Two characteristics are desirable in a sampling process: (a) given tight restraints, sampling should be ecient and (b) given loose restraints, sampling should produce native-like conformations owing to the knowledge of native structure incorporated in it. In other words, sampling cost should be directly proportional to length of the sampled fragment and inversely proportional to the restraint strictness. Sampling accuracy should be directly proportional to restraint radius. To check conformity with these expected traits, I carried out backbone-only sampling of ltered fragments under positional restraint of 1, 2 and 3Å on phosphorous atom. Note that fragments may be the entire chains or at either terminus of the rna chain or in between, hence this also tests corresponding sampling styles. All-atom sampling exercises were carried out under the same restraints on P atom and baseplane restraint of 5Å on bases. 10 modelling attempts were made in each sampling exercise. A modelling attempt fails if it cannot produce 114

a model in 5 trials. Each trial uses backtracking, i.e. if sampling fails at a nucleotide, it is restarted from a position 3 nucleotides before it in the sampling order. In all-atom sampling, glycosidic linkage (χ dihedral) is sampled uniformly over the entire range at 10o intervals. van der Waals radii of base and sugar atoms are reduced by 50%. Sampling performance is quantied by measuring the average rmsd of models and average time taken to produce a model as functions of fragment size, restraint radius and whether bases are modelled. The time plots (Fig.5.2) suggest a linear correlation between fragment size and sampling time for both backbone-only and all-atom models, hence lines of best t have also been plotted. Regression coecients of these lines are informative. In both cases, regression coecients suggest that the time needed for sampling with P restraint radius of 2Å is twice as much as that with 3Å and four times as much with 1Å as with 2Å. Comparison of the regression coecients in backbone and all-atom cases suggest that latter is nearly ten times costlier than the former. The rmsd plots (Fig.5.3) suggest a weak correlation between the rmsd and fragment sizes, i.e. rmsd is lower for smaller fragments with the same restraint radius. This prompted the tting of a log curve. rmsd falls with the size of P restraint. For each restraint size, all-atom rmsd is more than backbone rmsd. Interestingly, the backbone rmsd in all-atom case is better than that in backbone-only case, indicating the inuence of base restraint in guiding the backbone. Thus, the main sampling trends are: (a) smaller restraint radius leads to greater sampling time, (b) all-atom sampling is costlier than backbone-only, but leads to backbones with less rmsd (c) sampling time is proportional to fragment size and (d) rmsd tends to be smaller

for smaller fragments. These trends are expected from previous experience with protein sampling exercises. But there are signicant dierences too, due to dierences in restraint density. In protein Cα tracing, backbone sampling models 3N atoms under N positional restraints (ignoring carbonyl oxygen), hence positional restraint density is backbone tracing exercise carried out here, this density is

1 6

1 . 3

In the rna

(ignoring phosphate oxygens

O1P, O2P ). This is reected in the backbone rmsd: in the case of proteins, backbone rmsd is generally lower than the Cα restraint radius but it is generally higher for rna than the P restraint radius. Another dierence is rotamericity of protein sidechains and lack of it in glycosidic linkages. This is indicated by lower all-atom rmsd for proteins than rna chains under similar restraints. To sum up, trends observed in rna sampling are expected and satisfactory enough to attempt application to the crystallographic scenario.

115

Figure 5.2: Variation in sampling time with RNA fragment size. The following scatter plots (above backbone-only, below all-atom) indicate a linear correlation between average sampling times and fragment sizes. Regression coecients of lines of best t suggest that sampling time nearly doubles from 3Å to 2Å and almost quadruples from 2Å to 1Å. Similarly, all-atom sampling is roughly a magnitude costlier than backbone-only case. Note that 3 outliers have not been considered for the 1Å backbone-only plot and 4 fragments did not yield any model during 1Å all-atom sampling.

Time per model (seconds)

200

P1 4.13x-5.12 P2 0.79x+2.11 P3 0.49x+0.47

150

100

50

0 0

5

10

15

20

25

30

35

40

30

35

40

Fragment size (# nucleotides)

Time per model (seconds)

2000

P 1 BB 5 46.44x-290.06 P 2 BB 5 10.36x-32.31 P 3 BB 5 5.95x-9.35

1500

1000

500

0 0

5

10

15

20

25

Fragment size (# nucleotides)

116

Figure 5.3: Variation in sampling accuracy with RNA fragment size. The following plots (top1Å, middle-2Å, bottom-3Å) show relationship between average rmsd of models of fragments and fragment lengths. There is a weak tendency to have lower rmsd for lower lengths, hence a log curve was tted for each scatter. In general, at a given P restraint radius, all-atom models have better backbone rmsd than backbone-only models. All-atom rmsd is slightly greater than backbone rmsd in all-atom models. rmsd increases as restraint radius increases. 3.5

RMSD (angstrom)

3 2.5 2 1.5 1

P 1 backbone 0.15 * log(x) + 1.60 P 1 BB 5 backbone -0.04 * log(x) + 1.67 P 1 BB 5 all-atom 0.02 * log(x) + 1.88

0.5 0 0

5

10

15

20

25

30

35

40

35

40

35

40

Fragment size (# nucleotides)

3.5

RMSD (angstrom)

3 2.5 2 1.5 1

P 2 backbone 0.20 * log(x) + 1.91 P 2 BB 5 backbone 0.04 * log(x) + 1.80 P 2 BB 5 all-atom 0.04 * log(x) + 2.14

0.5 0 0

5

10

15

20

25

30

Fragment size (# nucleotides)

3.5

RMSD (angstrom)

3 2.5 2 1.5 1

P 3 backbone 0.25 * log(x) + 2.21 P 3 BB 5 backbone 0.15 * log(x) + 1.98 P 3 BB 5 all-atom 0.08 * log(x) + 2.37

0.5 0 0

5

10

15

20

25

Fragment size (# nucleotides)

117

30

5.3 Foray into crystallographic renement 5.3.1 About trna structure Transfer rnas are classic structures from the 1970s. Till mid-90s, structures of trna (Hingerty

et al. (1978), Sussman et al. (1978), Westhof et al. (1988), Westhof and Sundaralingam (1986)) were the only large rna structures in pdb (Shi and Moore (2000)), making them remarkable achievements of crystallography techniques of that decade. trna is a cloverleafshaped molecule in its secondary structure representation and has a L-shaped 3D form (Fig.5.4). trna is an essential cog in the translational machinery of the cell which incrementally translates the transcripted mrna into peptide chain one residue at a time. Ribosome nds a trna with a 3-nucleotide anticodon complementary to current mrna codon. This trna has an amino acid attached to its 50 end, which the ribosome then attaches to the growing polypeptide. trna structures are attractive for demonstrating crystallographic utility of discrete restraintbased rna sampling because they are neither too small nor too large, are structurally wellstudied and have 3 loop regions (anticodon loop, T ψC loop, D loop) with non-Watson-Crick base pairing. For this work, trnaAsp structure was used, solved at 3Å by Westhof et al. (1988). This structure (pdb 2tra) renes to R/Rf ree of 0.2552/0.3063 with cns starting from deposited structure and data.

5.3.2 Composite renement protocol Similar to composite renement protocols used earlier in the thesis, this work also uses perturbed starting structures and rebuilds them with the aim of improving Rf ree . In brief,

Rapper tk identies the ill-t nucleotides by calculating the correlation coecient between Fc map and σA -weighted cns omit map for regions around the backbone, sugar/base and entire nucleotide. Low (< 0.9) correlation coecient indicates nucleotide stretches to rebuild, which are then built incrementally using gabb algorithm. Ten times more children are generated as the population size, and top 10% are retained based on their electron density occupation score, leading to an enriched population. Resampled nucleotides get a B -factor of 30 assigned to all of their atoms. Non-rna atoms (ligands and waters) are not used during sampling. Best member of population (according to density occupation) is written out as the new model along with non-rna atoms appended to it. The coordinates and B -factors of non-rna atoms are copied from the previous renement iteration. This model is rened with cns (2 rounds of mdsa starting at 5000K , intervened by a 200-step minimization). This procedure is repeated

for 10 iterations. It is expected that rna models generated with rotameric backbone states 118

Figure 5.4: tRNA structure: the schematic diagram shows the typical secondary structure of tRNA. 3D representation below it shows all-atom and cartoon representation of tRNAAsp as in PDB entry 2tra (Westhof et al. (1988)). 3’end

5’end

Acceptor stem Tψ C Loop

D−loop

Variable loop

Anticodon Loop

Anticodon loop

water−coordinated Mg ion

D Loop

Variable loop spermine

TψC loop

119

Acceptor Stem

to obey given positional restraints, positive electron density restraints and excluded volume restraints would be within the convergence radius of cns, i.e. such models can be used to assist cns in nding well-rened structures, starting from ill-tting ones.

5.3.3 Rening a helical fragment cns renement was performed initially on the anticodon loop (nucleotides 33 − 37) and the

T ψC loop (nucleotides 54 − 60), starting from models where the loops were perturbed by rna tracing within tight positional restraints (P , baseplane restraints of 2Å, 5Å respectively).

In both cases I observed that cns was able to correct the errors introduced in the native structure. This was in contrast to proteins where similar positional restraints on Cα and sidechains generally result in unsatisfactory cns-only renement. But removal of baseplane restraints from the rna trace deteriorated the renement quality. This suggested that cns convergence radius is larger for rna structures than proteins, and Rapper tk sampling may be of value only in cases where spatial information about the structure is highly uncertain. In order to use a simple example to begin with, a fragment in rna duplex (nucleotides

23 − 27) was chosen, with clear base densities. Initial perturbation was carried out with 2Å P restraints and no base restraints to generate 5 models. The perturbed models were subjected to cns renement only. In 4 of 5 cases, cns renement was unsatisfactory. 3 such cases are shown in Fig.5.5. When the composite renement protocol was applied to the same region with the same starting models, all trajectories resulted in well-rened structures (Fig.5.6). The mean of best Rf ree values in cns-only trajectories was 0.311 as compared to 0.304 for the composite protocol trajectories. It is interesting to note that Rf ree does not strongly reect the salient dierences in the renement trajectories indicated by Fig.5.5 and Fig.5.6.

5.3.4 Rening the T ψC loop The same exercise was repeated for nucleotides 54 − 60, the T ψC loop. The native density for this loop is not as good as the helical fragment (see Fig.5.7). cns-only renement resulted in mean best Rf ree of 0.316 over the 5 renement attempts, whereas the same for composite renement was 0.303. Visual inspection of these models shows the greater variability in the cns models and that each attempt was stuck in a local minimum. 3 of 5 composite models

rened to a structure very similar to native, but the rest were trapped in a local minima. Close observation of these 2 cases revealed that spurious density appearing elsewhere led

Rapper tk sampling away from the native.

120

Figure 5.5: CNS-only renment of trnaAsp (PDB 2tra) can be unsatisfactory. Starting models were generated by perturbing a 5 nucleotide fragment (23-27) with 2Å P restraints and no base restraints. Top-left panel shows the native structure of the fragment with its omit map contoured at 2σ . Other 3 panels show the best Rf ree structures in 3 dierent CNSonly renement trajectories, with respective omit maps also contoured at 2σ . This suggests that CNS can get trapped in local minima in case of high initial structural uncertainty. Note that the CNS renement here is with minimal restraints, i.e. hydrogen-bonding restraints between base-pairs were not provided to CNS.

121

Figure 5.6: Composite cns/Rapper tk renement of a helical fragment from trnaAsp . Spherical positional restraints of radius 2Å were used around P atoms of the 5-nucleotide (23-27) fragment from PDB 2tra. No restraints were imposed on bases. The cns/Rapper tk renement resulted in satisfactory renement in all 5 attempts. Best Rf ree models in each trajectory are shown in magenta, with the deposited structure in sticks representation.

122

Figure 5.7: Composite cns/Rapper tk renement of the 7-nucleotide T ψC loop (54-60) from trnaAsp with 2Å P restraints and no base restraints. The top panel shows the native fragment with its omit map density. Middle panel shows the best Rf ree models of cns-only renement. Bottom panel shows the same for the composite renement. Native structure (green) is shown for reference in middle and bottom panels.

123

5.3.5 Rening the anticodon loop Anticodon loop spans nucleotides 33−37, of which G−34, U −35 and C −36 have two equally occupied states in the 2tra structure. Initial attempts to repeat the previous exercise on this loop were unsatisfactory because Rapper tk tried to t a single conformation to these heterogenous nucleotides. Due to this, I created articial diraction data at the same resolution by considering only the rst conformation of each nucleotide and assigning full occupancy to it. This signicantly changed the renement trajectories and a similar trend as previous two exercises could be observed. For ve cns-only and composite renements, mean best

Rf ree values were 0.254 and 0.215 respectively, indicating a much improved renement with the composite protocol. Fig.5.8 shows that the composite protocol yields almost identical structures and cns-only renement gets trapped in dierent local minima.

5.3.6 Typical problems with renement protocols There are two main reasons for suboptimal cns-only renement, identiable from successive structures in the renement trajectories (Fig.5.9). Firstly, if a base is very far away from its native-like location, cns renement does not restore it. Secondly, a base may get trapped into densities of phosphate, sugar or another base, in which case even if the base is not too far away, it is dicult to restore it. This is reminiscent of bulky misplaced sidechains in protein crystallographic renement. Structure trajectories suggest that improved renement with cns/Rapper tk protocol must be due to relocation of bases by rna sampling, which is brought about by the electron-density based enrichment of incremental building using rotameric backbones. A typical corrective rebuilding step is shown in Fig.5.10. The obvious mistakes in base placement are corrected with Rapper tk whereas cns carries out small corrections to take the conformations towards the optimal. There are some imperfections in the Rapper tk sampling scheme which may sometimes lead to incorrect nal structures: (a) rna is very exible and population size of 300 and enrichment factor of 10 may not be sucient (b) Lack of χ preferences means that selective pressure due to bases is low - it is further weakened in case of weak base density (c) Collateral damage may be caused by cns renement of a defective loop, e.g. perturbations in nearby regions of structure or symmetry-related copies do not get repaired during Rapper tk rebuilding step leading to higher Rf ree (d) Scoring scheme based on maximizing the electron density occupation may promote occupation of sharp peaks like waters and phosphates although there are obvious dissimilarities between such peaks and the shape of a base.

124

Figure 5.8: Composite cns/Rapper tk renement of the 5-nucleotide anticodon loop (33-37) with 2Å P restraints and no base restraints. The top panel shows the native fragment with its omit map density. Middle panel shows the best Rf ree models of cns-only renement. Bottom panel shows the same for the composite renement. Native structure (green) is shown for reference in middle and bottom panels.

125

Figure 5.9: CNS renement of RNA can get trapped in local minima. An instance of CNSonly renement yields a structure (green) very dierent from native (magenta). Corresponding 2Fo − Fc omit map is shown contoured at 1σ around a three nucleotide stretch (green sticks, nucleotides 54-56) only for clarity. Corresponding nucleotides in the two structures are shown with arrows. CNS model has bases too far away from native locations and also occupying the wrong density. Also note the appearance of density around wrong bases.

C−56

PSU−55

5MU−54

126

Figure 5.10: A typical corrective step carried out by Rapper tk RNA sampling with rotameric backbone and density enrichment. Blue is the native structure, green is a perturbed and cns-rened model. Magenta is the Rapper tk model found using the positional restraints and omit map of the green model. Note that Rapper tk model removes gross errors yet small errors may remain in comparison to native.

C−56 PSU−55

RapperTK RNA sampling

A−57

C−56

A−57 PSU−55

127

5.4 Conclusion This chapter suggests that knowledge-based sampling can be applied eciently and productively to rna structures. gabb algorithm was extended to sampling of rna chains at

50 -end, 30 -end and intermediate regions. Modied nucleotides were incorporated in addition to standard ones for all-atom sampling. Using a 48-chains dataset, I showed that sampling performance is along expected lines and suggests its suitability for real-world applications like crystallography. Then I demonstrated that a helical strand, the T ψC loop and the anticodon loop in the trnaAsp structure can be automatically sampled and iteratively rened using crystallographic data. It was found that the composite cns/Rapper tk protocol yields structures better rened than those by the cns-only protocol. Shortcomings of both protocols were discussed. This work shows that automated crystallographic renement of rna chains is possible given the approximate trajectory of phosphates. This is a promising result for reducing manual eort and allowing exploration of multiple conformations. Yet some concerns remain and must be addressed in future work. Sampling preferences themselves are imperfect. A-form conformation is adopted by more than 50% rna suites but population frequencies for rest of the backbone rotamers are unclear. Hence I have used equal weights for all suite rotamers. For similar reasons, I have not used the weakly bimodal nature of the glycosidic linkage. A careful analysis of available structural data will be required before incorporating such preferences reliably, because sampling preferences are meant to bias the conformational search and not restrict it. Another improvement necessary for quicker sampling is the propagation of phosphate and base restraint onto the backbone (e.g. on C40 ) so that base restraint satisfaction becomes more likely. At present this is a sampling bottleneck. There are a few promising ways to extend this work. Firstly, whole-chain crystallographic renement of rna structures can be performed for low resolution structures to reduce the number of non-rotameric suites. Secondly, rna sampling can be used to generate 3D all-atom conformations for secondary structures or motifs by expressing the base pairing/stacking interactions as distance restraints. All conformations sampled to satisfy these restraints will be useful in 3D structure prediction which is, as noted before, a process of assembling 3D coordinates of predicted secondary structures. Finally, protein and rna sampling can be combined together for automating the crystallography of protein-rna complexes, especially the very large ones like ribosomes so that human attention will be required only in the early and late stages of renement.

128

Chapter 6 Conclusion and Future Directions 6.1 Thesis Summary Using available structural knowledge to derive new structural knowledge has been a popular practice from as long as molecules have been modelled. Prior structural knowledge is generally formulated as scoring functions, dierentiable energy functions or discrete statistical preferences. Derivatives of energy functions with respect to atomic coordinates bias a model progressively towards native-like characteristics in a kinetic sampling protocol like mdsa in cns. In contrast, the changes brought about by a sampling protocol using knowledge-based

preferences are discrete and impart often-observed features of previously known structures to a conformation. Kinetic sampling approaches are prone to getting trapped in the local minima of their energy functions. A discrete change to a trapped conformation, informed by sampling of knowledge-based preferences, can rescue the conformation from its local minimum. This is termed as knowledge-tunnelling. rapper encodes a form of knowledge-tunnelling that has been used successfully to com-

plement continuous optimization protocols, e.g. ab initio loop structure prediction in combination with amber and protein crystallographic renement in combination with cns. But rapper has limited applicability because macromolecules other than proteins cannot be

sampled and the sampling algorithm cannot be easily modied. In Chapter 2, I describe the reformulation of rapper as Rapper tk, a versatile software framework capable of undertaking a diverse set of conformational modelling tasks due to its modularity and scripting interface. A benchmark shows that rapper-like performance is possible with the new toolkit. Subsequent chapters show that this reformulation is useful to implement a sidechain placement algorithm, to alter the gabb algorithm and to sample a non-protein entity like rna. In Chapter 3, I describe the implementation of a combinatorial optimization algorithm 129

to place protein sidechains optimally in electron density maps. Its utility is demonstrated for renement of approximate mainchain-only models. For similar models, this algorithm is shown to be ecient in accurately identifying the correct sequence registry in low resolution data. In Chapter 4, the suitability of Rapper tk for all-atom protein renement is demonstrated when mainchain and sidechain positional restraints are given. A loop sampling technique is developed for ecient utilisation of both anchors, with which single loops with little positional information can be modelled to almost native-like quality. Even with little information about most loops and an imperfect secondary structure framework, whole proteins can be rened to near-native accuracy. In general, collections of models are found to be a better description of diraction data than single models. A multiconformer sampling technique is developed to create ensemble descriptions of loop heterogeneity. Ensemble and collection methods are compared on articial data to observe that ensembles were better models of heterogeneity. In Chapter 5, all-atom rna sampling technique is developed. Sampling performance of

Rapper tk is benchmarked on a diverse set of rna chains using the Richardson backbone rotamer library. cns/Rapper tk composite crystallographic renement protocol is demonstrated with renement of three regions in the trnaAsp structure and observed to impart similar benets as seen previously for proteins. Thus in a variety of practically relevant crystallographic renement scenarios, knowledge tunnelling approach has been developed and used successfully.

6.2 Methodological challenges in knowledge-tunnelling Any knowledge-based approach faces two challenges: mining of structural information and using the knowledge derived from that process. While statistical preferences should reliably capture structural knowledge, they should not rule out appearance of genuinely new features in the structure being modelled. Hence it is necessary to use ne-grained smoothened preferences such as the rapper φ − ψ grid. Choosing the right granularity remains a balancing act based on available data, intended application, statistical reliability of preferences, computational expense in using them and available restraints. Use of knowledge is trickier in knowledge-tunnelling than, for instance, in energy functions where it is sucient to formulate available knowledge as a dierentiable function. This is due to the many algorithmic possibilities available for discrete structural sampling given a set of restraints and knowledge-based preferences. In principle, dierent algorithms using the same knowledge within the same restraints dier only in their eciency in producing 3D models. Eciency depends on the sensitivity to available restraints and exploitation of parallelization 130

opportunities. An important aspect of an algorithm's restraint sensitivity is restraint propagation, i.e. the inference of new restraints or sharpening of the existing ones given a set of 3D coordinates and restraints. Iterative application of triangle or quadrangle inequalities is useful for this task. As this is an open area of mathematical research, distance geometry smoothing technique is nearly the best available for restraint propagation.

6.3 Short-term objectives Based on the work presented in this thesis, following short-term objectives can be justiably and productively pursued.

6.3.1 Sequence assignment on approximate mainchains in low resolution crystallographic data As found in Chapter 3, sequence assignment exercise with the optimal sidechain placement protocol yielded satisfactory results when complete mainchain was present. But the more substantial problem of fragmented mainchain could not be addressed satisfactorily. A possible approach is to slide a sequence over a fragment by building all-atom models for the fragment within the positional and electron density restraints to choose the assignment with the highest density occupation score or lowest Rf ree . The sliding task should become progressively easier as the number of already assigned fragments increases. The approach should be benchmarked by varying the number and sizes of fragments, condence in fragment mainchain placement, unmodelled protein fraction and data resolution.

6.3.2 Combining secondary structure and loose positional restraints for protein crystallographic renement After assignment of sequence to an approximate fragmented mainchain and identication of secondary structure, the framework-loop scenario in Chapter 4 suggests that a complete allatom model can be developed automatically. Methodological developments necessary would be generalised restraint propagation and choosing the best combination of multiple models for bands. Secondary-structure based sampling as demonstrated in Chapter 2 should be helpful for eciency.

131

6.3.3 Including sidechain heterogeneity in protein crystallographic renement Sidechains can switch between multiple rotameric states to lead to diuse density. Often with high resolution data and a nearly inexible mainchain, it is possible to identify such multiple rotamers. Unlike Chapters 3 and 4, this scenario does not necessitate any prior assumptions about the expected number of distinct conformers. Thus, assuming an accurate mainchain, a procedure for all-atom sampling of a protein chain within electron density restraints will be developed with the capability of assigning multiple rotameric states at each residue. Overtting will be checked by monitoring the Rf ree of such model. This protocol will be greatly helpful in annotating a nearly-correct protein model with sidechain heterogeneity.

6.3.4 Characterizing protein loop heterogeneity In contrast to sidechain heterogeneity, making any judgements about the all-atom heterogeneity is very dicult. Any estimates of loop heterogeneity, e.g. number of distinct conformers, their relative occupancies, the bounding envelope etc. which can be derived from crystallographic diraction data would be of great use. This project would involve the development of a protocol that rst uses collections to estimate the number of conformers and then builds an ensemble of that size to model heterogeneity. This protocol will be benchmarked on articial data to measure the reliability of the modelled heterogeneity. A few instances of missing loops would be chosen from the pdb and subjected to this protocol. This will help characterize loop heterogeneity with a measure of reliability.

6.3.5 Knowledge-based crystallographic renement of trna structures A demonstration of whole-chain renement of a few trna structures is the logical continuation of Chapter 5 which dealt with modelling small fragments in the rna chain. Such procedure will be of much value for automation of rna crystallographic renement. Exploration of heterogeneity will also be very valuable, especially for non-A-form rna fragments.

6.3.6

rna

secondary structure to 3D models

rna secondary structure is mostly hierarchical but sometimes signicantly interrupted by

non-Watson-Crick interactions leading to various structural motifs including pseudoloops. Secondary structure prediction drives 3D model building: models for predicted secondary 132

structures are mined from known structures and joined together. But predicted secondary structure can be translated into interatomic spatial restraints which in turn can be possibly used to obtain multiple 3D models obeying that secondary structure. A tool which does so will be useful as it will break the dependence on known structures to obtain a 3D model. It will also enable theoretical exploration of possible conformations of a sequence adopting a secondary structure.

6.4 Long-term goals 6.4.1 Heterogeneity annotation of the

pdb

Many pdb structures are incomplete because reliable model building cannot be done due to the poor quality of electron density, possibly due to mobility. Thus the issues of missing substructure and conformational heterogeneity are intertwined. Annotation of the whole pdb with putative multiconformer models for missing regions and other low-quality regions (high

B -factor, doubtful density) is an attractive long term goal.

6.4.2 Addressing assemblies Scaling the present methods of automatic atomic structure renement to address large, multicomponent assemblies is essential as integrative and systems biology replaces the reductionist style. An attempt to understand and replicate the process of solving ribosome-scale structures would be a promising start for such an eort. Method development will be required in areas of using sequence to fragment preferences and binding preferences, in the context of structure renement with experimental data.

6.4.3 Combining data from various sources Experimental data from various sources yield complementary information about structure and have been productively combined previously, e.g. modelling missing regions in crystal structures with SAXS data, using high-resolution crystal structures to dock into EM envelope, etc. Concerns in combining dierent experimental data are the relative weighing of data and removing the source-specic structural bias (e.g. a domain with dierent structures in a crystal and an assembly). The importance of weighing arises from the use of derivativedriven continuous sampling: perhaps it would cease to be an important issue when discrete restraints-based sampling is used. Source-specic bias also can be reduced similarly if only relatively invariable structural frameworks are imported and resampled in the context of 133

the host data. Especially for the structures that do not yield good data by any one of the experiments, unforeseen methods of structure determination may emerge if reliable ways of combining data could be developed.

6.5 Final Comment As I close this thesis, the similarities between the processes of crystallographic renement and thesis completion become ever more evident: both are challenging, labour-intensive and nally enriching. The similarity extends to tunnelling also: the former benets from knowledge-tunnelling while the latter benets from experience-tunnelling (advice from the supervisor and colleagues, cross-germination of ideas from other disciplines etc.). Perhaps joys at the end of both are also similar. Structure determination is a fundamental addition to our understanding of biology. The granularity of such understanding depends on the resolution of crystallographic data, e.g. at low resolution, it is possible to understand assembly and at high resolution it is possible to gain insights about biochemistry and exibility, yet the importance of both cannot be overemphasised. The coarse- and ne-grained understanding of structure are two equally important strands in crystallographic renement that are addressable with the rapper approach. It is towards both these goals that future eorts will be directed by using the Rapper tk framework.

134

Bibliography Abagyan,R. and Totrov,M. (1994) Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins. J Mol Biol, 235, 9831002. Adams,P., Grosse-Kunstleve,R., Hung,L.W., Ioerger,T., McCoy,A., Moriarty,N., Read,R., Sacchettini,J., Sauter,N. and Terwilliger,T. (2002) PHENIX: building new software for automated crystallographic structure determination. Acta Cryst., D58, 19481954. Adams,P.D. and Grosse-Kunstleve,R.W. (2000) Recent developments in software for the automation of crystallographic macromolecular structure determination. Curr Op Str Biol,

10, 564568. Agashe,V.R., Shastry,M.C. and Udgaonkar,J.B. (1995) Initial hydrophobic collapse in the folding of barstar. Nature, 377, 754757. Agraotis,D.K., Bandyopadhyay,D., Wegner,J.K. and van Vlijmen,H. (2007) Recent Advances in Chemoinformatics. J Chem Inf Model, 1, 1. Alvarez-Fernandez,M., Liang,Y.H., Abrahamson,M. and Su,X.D. (2005) Crystal structure of human cystatin D, a cysteine peptidase inhibitor with restricted inhibition prole. J Biol

Chem, 280, 1822118228. Andrianantoandro,E., Basu,S., Karig,D.K. and Weiss,R. (2006) Synthetic biology: new engineering rules for an emerging discipline. Mol Sys Biol, 1, 1. Annsen,C.B. (1973) Principles that govern the folding of protein chains. Science, 181, 223230. Ben-Naim,A. (1997) Statistical potentials extracted from protein structures: Are these meaningful potentials? J Chem Phys, 107, 36983706. Blundell,T. and Johnson,L. (1976) protein Crystallography. Academic Press.

135

Borhani,D., Rogers,D., Engler,J. and Brouillette,C. (1997) Crystal structure of truncated human apolipoprotein A-I suggests a lipid-bound conformation. Proc Natl Acad Sci, 94, 1229112296. Bower,M.J., Cohen,F.E. and Dunbrack,R.L. (1997) Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J Mol Biol,

267, 12681282. Bradley,P., Misura,K.M. and Baker,D. (2005) Toward high-resolution de novo structure prediction for small proteins. Science, 309, 18681871. Branden,C. and Tooze,J. (1999) Introduction to Protein Structure. Garland Publishing, New York. Brandl,M., Weiss,M.S., Jabs,A., Suhnel,J. and Hilgenfeld,R. (2001) C-H PI-interactions in proteins. J. Mol. Biol., 307, 357377. Brooks,B.R., Bruccoleri,R.E., Olafson,B.D., States,D.J., Swaminathan,S. and Karplus,M. (1983) CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J. Comp. Chem., 4, 187217. Brunger,A.T. (1992) Free R-value - A novel statistical quantity for assessing the accuracy of crystal structures. Nature, 355, 472. Brunger,A.T., Adams,P.D., Clore,G.M., Delano,W.L., Gros,P., Grosse-Kunstleve,R.W., Jiang,J.S., Kuszewski,J., Nilges,M., Pannu,N.S., Read,R.J., Rice,L.M., Simonson,T. and Warren,G.L. (1998) Crystallography & NMR system: A new software system for macromolecular structure determination. Acta Cryst., D54, 905921. Buchete,N.V., Straub,J.E. and Thirumalai,D. (2003) Anisotropic coarse-grained statistical potentials improve the ability to identify nativelike protein structures. J. Chem. Phys.,

118 (16), 76587671. Buchete,N.V., Straub,J.E. and Thirumalai,D. (2004) Development of novel statistical potentials for protein fold recognition. Curr Op Str Biol, 14, 225232. Burling,F.T., Weis,W.I., Flaherty,K.M. and Brunger,A.T. (1996) Direct observation of protein solvation and discrete disorder with experimental crystallographic phases. Science,

271, 7277. Bystro,C. and Baker,D. (1998) Prediction of Local structure in Proteins Using a Library of Sequence-Structure Motifs. J. Mol. Biol., 281, 565577. 136

Cabral,J.H.M., Lee,A., Cohen,S.L., Chait,B.T., Li,M. and Mackinnon,R. (1998) Crystal Structure and Functional Analysis of the Herg Potassium Channel N-Terminus: A Eukaryotic Pas Domain. Cell, 95, 649655. Canutescu,A.A., Shelenkov,A.A. and Jr.,R.L.D. (2003) A graph theory algorithm for protein side-chain prediction. Protein Science, 12, 20012014. CCP4 (1994) The CCP4 Suite: Programs for Protein Crystallography. Acta Cryst., D50, 760763. Chiti,F. and Dobson,C. (2006) Protein Misfolding, Functional Amyloid and Human Disease.

Annu Rev Biochem, 75, 333366. Choi,H.K., Lee,S., Zhang,Y.P., McKinney,B.R., Wengler,G., Rossmann,M.G. and Kuhn,R.J. (1996) Structural analysis of Sindbis virus capsid mutants involving assembly and catalysis.

J Mol Biol, 262, 151167. Cornell,W.D.,

Cieplak,P.,

Bayly,C.I.,

Gould,I.R.,

Jr,K.M.M.,

Ferguson,D.M.,

Spellmeyer,D.C., Fox,T., Caldwell,J.W. and Kollman,P.A. (1999) A second generation force eld for the simulation of proteins and nucleic acids. J. Am. Chem. Soc., 117, 51795197. Cowtan,K. (2003) The Clipper C++ libraries for X-ray crystallography. IUCr Computing

Commission Newsletter, 2, 49. Creighton,T.E. (1993) Proteins: Structures and Molecular Properties. W. H. Freeman and Company, New York. Davis,I.W., Murray,L.W., Richardson,J.S. and Richardson,D.C. (2004) MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes.

Nucleic Acids Research, 32, W615W619. de Bakker,P.I., Furnham,N., Blundell,T.L. and DePristo,M.A. (2006) Conformer generation under restraints. Current Opinion in Structural Biology 6 (2), 6 (2), 13111319. de Bakker,P.I.W., DePristo,M.A., Burke,D.F. and Blundell,T.L. (2003) Ab Initio Construction of Polypeptide Fragments: Accuracy of Loop Decoy Discrimination by an All-Atom Statistical Potential and the AMBER Force Field With the Generalized Born Solvation Model. Proteins: Struct., Func., and Genet., 51, 2140.

137

Deane,C.M. and Blundell,T.L. (2000) A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteins. Proteins: Struct. Funct. Genet., 40 (1), 135144. Depristo,M. (2003). Ab initio Conformational Sampling r Protein Structure Determina-

tion, Analysis and Prediction. PhD thesis, Department of Biochemistry, University of Cambridge. DePristo,M., de Bakker,P., Shetty,R. and Blundell,T. (2003) Discrete restraint-based protein modeling and the CA-trace problem. Protein Science, 12 (9), 203246. DePristo,M.A., de Bakker,P.I. and Blundell,T.L. (2004) Heterogeneity and inaccuracy in protein structures solved by X-ray crystallography. Structure, 12(5), 831838. DePristo,M.A., de Bakker,P.I., Johnson,R.J. and Blundell,T.L. (2005) Crystallographic renement by knowledge-based exploration of complex energy landscapes. Structure, 13 (9), 13111319. DePristo,M.A., de Bakker,P.I.W., Lovell,S.C. and Blundell,T.L. (2003) Ab-initio construction of polypeptide fragments : Ecient generation of accurate representative samples. Protein:

Structure, Function, and Genetics, 51, 4155. Dill,K.A., Bromberg,S., Yue,K., Fiebig,K.M. and Yee,D.P. (1995) Principles of protein folding - a perspective from simple exact models. Protein Sci., 4, 561602. Dill,K.A. and Chan,H.S. (1997) From Levinthal to pathways to funnels. Nat Struct Biol, 4, 1019. Ding,Y., Chan,C.Y. and Lawrence,C.E. (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res, 32, W135W141. Dore,A.S., Furnham,N., Davies,O.R., Sibanda,B.L., Chirgadze,D.Y., Jackson,S.P., Pellegrini,L. and Blundell,T.L. (2006) Structure of an Xrcc4-DNA ligase IV yeast ortholog complex reveals a novel BRCT interaction mode. DNA Repair, 5, 362368. Drenth,J. (1999) Principles of Protein X-Ray Crystallography. Springer. Duarte,C.M. and Pyle,A.M. (1998) Stepping through an RNA structure: A novel approach to conformational analysis. J Mol Biol, 284, 14651478. Dunbrack,R.L. (2002) Rotamer libraries in the 21st century. Curr Opin Struct Biol, 12, 431440. 138

Dunbrack,R.L. and Cohen,F.E. (1997) Bayesian statistical analysis of protein sidechain rotamer preferences. Protein Science, 6, 16611681. Dunbrack,R.L. and Karplus,M. (1993) Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol, 230, 543574. Dunin-Horkawicz,S., Czerwoniec,A., Gajda,M.J., Feder,M., Grosjean,H. and Bujnicki,J.M. (2006) MODOMICS: a database of RNA modication pathways. Nucleic Acids Res, 34, D145D149. Evans,J.S., Mathiowetz,A.M., Chain,S.I. and Goddard,W.A. (1995) De novo prediction of polypeptide conformations using dihedral probability grid Monte Carlo methodology. Pro-

tein Sci, 4, 12031216. Fan,H. and Mark,A.E. (2004) Renement of homology-based protein structures by molecular dynamics simulation techniques. Protein Science, 13, 211220. Fersht,A.R. and Daggett,V. (2002) Protein folding and unfolding at atomic resolution. Cell,

108, 573582. Fink,A.L. (1999) Chaperone-Mediated Protein Folding. Physiological Reviews, 79, 425449. Fogolari,F., Esposito,G., Viglino,P. and Cattarinussi,S. (1996) Modeling of polypeptide chains as C alpha chains, C alpha chains with C beta, and C alpha chains with ellipsoidal lateral chains. Biophys J, 70, 11831197. Frank,J., J.,M.R., Penczek,P., Zhu,J., Li,Y., Ladjadj,M. and Leith,A. (1996) SPIDER and WEB: Processing and Visualization of Images in 3D Electron Microscopy and Related Fields. J Struct Biol, 116, 190199. Frauenfelder,H. and Leeson,D.T. (1998) The energy landscape in non-biological and biological molecules. Nature Structural Biology, 5, 757759. Furnham,N., Blundell,T.L., DePristo,M.A. and Terwilliger,T.C. (2006a) Is one solution good enough? Nat Struct Mol Biol, 13, 184185. Furnham,N., Dore,A.S., Chirgadze,D.Y., de Bakker,P.I.W., Depristo,M. and Blundell,T.L. (2006b) Knowledge-Based Real-Space Exporations for Low-Resolution Structure Determination. Structure, 14 (8), 13131320. Gan,H.H., Tropsha,A. and Schlick,T. (2001) Lattice Protein Folding With Two and FourBody Statistical Potentials. Proteins: Struct. Func. and Genet., 43, 161174. 139

Goldberg,A.L. (2003) Protein degradation and protection against misfolded or damaged proteins. Nature, 426, 895899. Goldstein,R.F. (1994) Ecient rotamer elimination applied to protein side-chains and related spin glasses. Biophys. J., 66, 13351340. Gore,S.P., Burke,D.F. and Blundell,T.L. (2005) PROVAT: a tool for Voronoi tessellation analysis of protein structures and complexes. Bioinformatics, 21 (15), 33163317. Gore,S.P., Karmali,A.M. and Blundell,T.L. (2007) Rappertk: a versatile engine for discrete restraint-based conformational sampling of macromolecules. BMC Structural Biology,

7:13, doi:10.1186/14726807713. Gunasekaran,K., Ramakrishnan,C. and Balaram,P. (1996) Disallowed Ramachandran conformations of amino acid residues in protein structures. J Mol Biol, 264, 191198. Gunn,J.R. (1997) Sampling protein conformations using segment libraries and a genetic algorithm. J Chem Phys, 106, 42704281. Hinds,D.A. and Levitt,M. (1994) Exploring conformational space with a simple lattice model for protein structure. J Mol Biol, 243, 668682. Hingerty,B., Brown,R.S. and Jack,A. (1978) Further renement of the structure of yeast tRNA. J Mol Biol, 124, 523534. Holbrook,S. (2005) RNA structure: the long and the short of it. Curr Op Struct Biol, 15, 302308. Holbrook,S.R. and Kim,S.H. (1999) RNA crystallography. Biopolymers, 44, 321. Ishii,T., Shibata,R., Bessho,Y., Shirouzu,M. and Yokoyama,S. (2005) Crystal structure of D-Tyr-tRNA(Tyr) deacylase from Aquifex aeolicus. To be Published, 0, 0. Jabs,A., Weiss,M.S. and Hilgenfeld,R. (1999) Non-proline Cis Peptide Bonds in Proteins. J.

Mol. Biol., 286, 291304. Jacobson,M.P., Pincus,D.L., Rapp,C.S., Day,T.J., Honig,B., Shaw,D.E. and Friesner,R.A. (2004) A Hierarchical Approach to All-Atom Protein Loop Prediction. Proteins: Structure,

Function, and Bioinformatics, 55, 351367. Jensen,L.H. (1997) Renement and reliability of macromolecular models based on X-ray diraction data. In Methods in Enzymology, (Jr.,C.C. and Sweet,R., eds),. Academic Press Inc. pp. 353366. 140

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) A new approach to protein fold recognition. Nature, 358, 8689. Jossinet and Westhof,E. (2005) Sequence to Structure (S2S): display, manipulate and interconnect RNA data from sequence to structure. Bioinformatics, 21, 33203321. Kell,D.B. (2004) Metabolomics and systems biology: making sense of the soup. Curr Op

Microbiol, 7, 296307. Kleywegt,G.J. (2000) Validation of protein crystal structures. Acta Cryst, D56, 249265. Kleywegt,G.J. and Jones,A.T. (1996a) Ecient rebuilding of protein structures. Acta Cryst

D, 52, 829832. Kleywegt,G.J. and Jones,T.A. (1996b) Phi/psi-chology - Ramachandran revisited. Structure,

4(12), 13951400. Konarev,P.V., Petoukhov,M.V., Volkov,V.V. and Svergun,D.I. (2006) ATSAS 2.1, a program package for small-angle scattering data analysis. J App Cryst, 39, 277286. Kong,Y., Zhang,X., Baker,T. and Ma,J. (2004a) A Structural-informatics Approach for Tracing Beta-Sheets: Building Pseudo-C-Alpha Traces for Beta-Strands in Intermediateresolution Density Maps. JMB, 1, 110. Kong,Y., Zhang,X., Baker,T.S. and Ma,J. (2004b) A Structural-informatics Approach for Tracing Beta-Sheets: Building Pseudo-C-Alpha Traces for Beta-Strands in Intermediateresolution Density Maps. J Mol Biol, 339, 117130. Kritikou,E., Pulverer,B. and Heinrichs,A. (2006) All systems go! Nat Rev Mol Cell Biol, 7, s3. Krol,M. (2003) Comparison of various implicit solvent models in molecular dynamics simulations of immunoglobulin G light chain dimer. J Comp Chem, 24, 531546. Kuriyan,J., Osapay,K., Burley,S.K., BrÃ 14 nger,A.T., Hendrickson,W.A. and Karplus,M. (1991) Exploration of disorder in protein structures by X-ray restrained molecular dynamics. Proteins, 10, 340358. Kuriyan,J., Petsko,G., Levy,R.M. and Karplus,M. (1986) Eect of anisotropy and anharmonicity on protein crystallographic renement. An evaluation by molecular dynamics. J

Mol Biol, 190, 227254. 141

Lartigue,A., Campanacci,V., Roussel,A., Larsson,A.M., Jones,T.A., Tegoni,M. and Cambillau,C. (2002) X-Ray Structure and Ligand Binding Study of a Chemosensory Protein. J

Biol Chem, 277, 3209432098. Laskowski,R.A., MacArthur,M.W., Moss,D.S. and Thornton,J.M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283291. Lehninger,A., Nelson,D. and Cox,M. (1995) Principles of Biochemistry. Leontis,N.B., Altman,R.B., Berman,H.M., Brenner,S.E., Brown,J.W., Engelke,D.R., Harvey,S.C., Holbrook,S.R., Jossinet,F., Lewis,S.E., Major,F., Mathews,D.H., Richardson,J., Williamson,J.R. and Westhof,E. (2006) The RNA Ontology Consortium: An open invitation to the RNA community. RNA, 12, 533541. Leontis,N.B. and Westhof,E. (2001) Geometric nomenclature and classication of RNA base pairs. RNA, 7, 499512. Leontis,N.B. and Westhof,E. (2003) Analysis of RNA motifs. Curr Op Struct Biol, 13, 300308. Leopold,P.E., Montal,M. and Onuchic,J.N. (1992) Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc Natl Acad Sci, 89, 87218725. Levinthal,C. (1968) Are there pathways for protein folding? Journal de Chimie Physique et

de Physico-Chimie Biologique, 65, 44. Levitt,M., Gerstein,M., Huang,E., Subbiah,S. and Tsai,J. (1997) Protein Folding: The Endgame. Annu. Rev. Biochem, 66, 54979. Lindahl,E., Hess,B. and van der Spoel,D. (2001) GROMACS 3.0: A package for molecular simulation and trajectory analysis. J. Mol. Mod., 7, 306317. Lovell,S., Davis,I., III,W.A., de Bakker,P., Word,J., Prisant,M., Richardson,J., and Richardson,D. (2003) Structure Validation by CA Geometry: P,S and CB Deviation. Proteins:

Structure, Function and Genetics, 50, 437450. Lovell,S., Word,J., Richardson,J. and Richardson,D. (2000) The Penultimate Rotamer Library. Proteins: Structure Function and Genetics, 40, 389408. Ludtke,S.J., Baldwin,P.R. and Chiu,W. (1999) EMAN: semiautomated software for highresolution single-particle reconstructions. J Struct Biol, 128, 8297. 142

Maeyer,M.D., Desmet,J. and Lasters,I. (1997) All in one: a highly detailed rotamer library improves both accuracy and speed in the modelling of sidechains by dead-end elimination.

Fold Des, 2, 5366. Malathi,R. and Yathindra,N. (1985) Backbone conformation in nucleic acids: an analysis of local helicity through heminucleotide scheme and a proposal for a unied conformational plot. J Biomol Struct Dyn, 3, 127144. Mathews,D., Sabina,J., Zuker,M. and Turner,H. (1999) Expanded Sequence Dependence of Thermodynamic Parameters Provides Robust Prediction of RNA Secondary Structure. J

Mol Biol, 288, 911940. Mathews,D.H., Disney,M.D., Childs,J.L., Schroeder,S.J., Zuker,M. and Turner,D.H. (2004) Incorporating chemical modication constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Nat Acad Sci, 101, 72877292. Mecozzi,S., Jr.,A.P.W. and Dougherty,D.A. (1996) Cation-pi interactions in aromatics of biological and medicinal interest: Electrostatic potential surfaces as a useful qualitative guide. Proc Natl Acad Sci, 93, 1056610571. Miller,W., Makova,K.D., Nekrutenko,A. and Hardison,R.C. (2004) Comparative Genomics.

Annual Review of Genomics and Human Genetics, 5, 1556. Moat,K. (1998) Time-Resolved Crystallography. Acta Cryst, A54, 833841. Moult,J. and James,M.N.G. (1986) An algorithm for determining the conformation of polypeptide segments in proteins by systematic search. Proteins: Struct., Func., Genet.,

1, 146163. Murray,L.J.W., III,W.B.A., Richardson,D.C. and Richardson,J.S. (2003) RNA backbone is rotameric. PNAS, 100, 1390413909. Murshudov,G.N., Vagin,A.A. and Dodson,E.J. (1997) Renement of Macromolecular Structures by the Maximum-Likelihood Method. Acta Cryst, D53, 240255. Ohlendorf,D.H. (1994) Accuracy of Rened Protein Structures. II. Comparison of Four Independently Rened Models of Human Interleukin. Acta Cryst, D50, 808812. Park,B.H. and Levitt,M. (1995) The complexity and accuracy of discrete state models of protein structure. J. Mol. Biol., 249, 493507.

143

Pauling,L., Corey,R.B. and Branson,H.R. (1951) The structures of proteins: two hydrogenbonded helical congurations of the polypeptide chain. Proc Natl Acad Sci, 37, 205211. Pellegrini,M., Gronbech-Jensen,N., Kelly,J.A., Puegl,G.M. and Yeates,T.O. (1997) Highly Constrained Multiple-Copy Renement of Protein Crystal Structures. Proteins, 29, 426 432. Perrakis,A., Sixma,T.K., Wilson,K.S. and Lamzin,V.S. (1997) wARP: Improvement and Extension of Crystallographic Phases by Weighted Averaging of Multiple-Rened Dummy Atomic Models. Acta Cryst, D53, 448455. Petrella,R.J. and Karplus,M. (2001) The energetics of O-rotamer Protein Side-chain Conformations. J Mol Biol, 312, 11611175. Petsko,G.A. (1996) Not just your average structures. Nat Struct Biol, 3, 565566. Ponder,J. and Case,D. (2003) Force elds for protein simulations. Adv Prot Chem, 66, 2785. Rader,S.D. and Agard,D.A. (1997) Conformational substates in enzyme mechanism: the 120 K structure of alpha-lytic protease at 1.5 A resolution. Protein Sci, 6, 13751386. Ramachandran,G.N. and Sasisekharan,V. (1968) Conformation of polypeptides and proteins.

Adv Protein Chem, 28, 283437. Raschke,T.M. (2006) Water structure and interactions with protein surfaces. Curr Op Str

Biol, 16, 152159. Read,R.J. (1986) Improved Fourier coecients for maps using phases from partial structures with errors. Acta Cryst, A42, 140149. Rejto,P. and Freer,S. (1996) Protein Conformational Substates from X-ray Crystallography.

Prog Biophys Molec Biol, 66, 167196. Richards,F.M. (1973) The interpretation of protein structures total volume, group volume distributions and packing density. J. Mol. Biol., 82, 114. Rohl,C.A. and Baker,D. (2002) De novo determination of protein backbone structure from residual dipolar couplings using Rosetta. J Am Chem Soc, 124, 27232729. Rojnuckarin,A. and Subramaniam,S. (1999) Knowledge-based interaction potentials for proteins. Proteins, 36, 5467. 144

Rooman,M.J., Kochar,J.P. and Wodak,S.J. (1991) Prediction of protein backbone conformations based on seven structure assignments. Inuence of local interactions. J Mol Biol,

221, 961979. Saibil,H. (2000) Macromolecular structure determination by cryo-electron microscopy. Acta

Cryst, D56, 12151222. Sali,A. and Blundell,T. (1993) Comparative protein modeling by satisfaction of spatial restraints. J Mol Biol, 234, 779815. Samudrala,R. and Moult,J. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol., 275, 895916. Sasisekharan,V. and Ponnuswamy,P.K. (1971) Studies on the conformation of amino acids. X. Conformations of norvalyl, leucyl, aromatic side groups in a dipeptide unit. Biopolymers,

10, 583592. Schlick,T. (2006) RNA: The Cousin Left Behind Becomes a Star. In Computational Stud-

ies of DNA and RNA, (Sponer,J. and Lankas,F., eds),. Springer Verlag, Dordrecht, The Netherlands pp. 259281. Schneider,B., Moravek,Z. and Berman,H.M. (2004) RNA conformational classes. Nucleic

Acids Res, 32, 16661677. Schwieters,C., Kuszewski,J., Tjandra,N. and Clore,G. (2003) The Xplor-NIH NMR Molecular Structure Determination Package. J. Magn. Res., 160, 6674. Sengoku,T., Nureki,O., Nakamura,A., Kobayashi,S. and Yokoyama,S. (2006) Structural basis for RNA unwinding by the DEAD-box protein Drosophila Vasa. Cell, 125 (2), 21921. Senn,H.M. and Thiel,W. (2007) QM/MM studies of enzymes. Curr Op Chem Biol, 11, 182187. Shapiro,B.A., Wu,J.C., Bengali,D. and Potts,M.J. (2001) The massively parallel genetic algorithm for RNA folding: MIMD implementation and population variation. Bioinformatics,

17, 137148. Shapiro,B.A., Yingling,Y.G., Kasprzak,W. and Bindewald,E. (2007) Bridging the gap in RNA structure prediction. Curr Op Struct Biol, 17, 157165. Shetty,R.P., de Bakker,P.I., DePristo,M.A. and Blundell,T.L. (2003) Advantages of negrained side chain conformer libraries. Protein Eng, 16, 963969. 145

Shi,H. and Moore,P.B. (2000) The crystal structure of yeast phenylalanine tRNA at 1.93 A resolution: A classic structure revisited. RNA, 6, 10911105. Shi,J., Blundell,T. and Mizuguchi,K. (2001) FUGUE: sequence-structure homology recognition using environment-specic substitution tables and structure- dependent gap penalties.

J. Mol. Biol., 310, 243257. Siek,J.G., Lee,L.Q. and Lumsdaine,A. (2002) The boost graph library user guide and reference

manual. Addison-Wesley Longman Publishing Co. Inc. Boston, MA, USA. Simons,K., Ruczinski,I., Kooperberg,C., Fox,B., Bystro,C. and Baker,D. (1999) Improved Recognition of Native-Like Protein Structures Using a Combination of SequenceDependent and Sequence-Independent Features of Proteins. Proteins: Structure, Function,

and Genetics, 34, 8295. Simons,K.T., Kooperberg,C., Huang,E. and Baker,D. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol, 268, 209225. Smith,R.E., Lovell,S.C., Burke,D.F., Montalvao,R.W., and Blundell,T.L. (2007) Andante: Reducing side-chain rotamer search space during comparative modeling using environmentspecic substitution probabilities. Bioinformatics, 23, 10991105. Steen,P., Vob,B., Rehmsmeier,M., Reeder,J. and Giegerich,R. (2006) RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics, 22, 500503. Steiner,T. and Koellner,G. (2001) Hydrogen bonds with pi-acceptors in proteins: frequencies and role in stabilizing local 3D structures. J. Mol. Biol., 305, 535557. Sussman,J.L., Holbrook,S.R., Warrant,R.W., Church,G.M. and Kim,S.H. (1978) Crystal structure of yeast phenyl alanine transfer RNA I. Crystallographic renement. J Mol

Biol, 123, 607630. Svergun and Koch (2003) Small-angle scattering studies of biological macromolecules in solution. Rep. Prog. Phys., 66, 17351782. Taylor,I.A., Treiber,M.K., Olivi,L. and Smerdon,S.J. (1997) The X-ray structure of the DNAbinding domain from the Saccharomyces cerevisiae cell-cycle transcription factor Mbp1 at 2.1 A resolution. J Mol Biol, 272, 18.

146

Terwilliger,T.C.,

Grosse-Kunstleve,R.W.,

Afonine,P.V.,

Adams,P.D.,

Moriarty,N.W.,

Zwart,P., Read,R.J., Turk,D. and Hung,L.W. (2007) Interpretation of ensembles created by multiple iterative rebuilding of macromolecular models. Acta Cryst, D63, 597610. Tuery,P., Etchebest,C. and Hazout,S. (1997) Prediction of protein side-chain conformations: a study of the inuence of backbone accuracy on conformation stability in the rotamer space. Protein Eng, 10, 361372. Tzakos,A.G., Grace,C.R.R., Lukavsky,P.J. and Riek,R. (2006) Large Proteins and RNAs in Solution. Annu Rev Biophys Biomol Struct, 35, 319342. van Heel,M., Harauz,G., Orlova,E.V., Schmidt,R. and Schatz,M. (1996) A new generation of the IMAGIC image processing system. J Struct Biol, 116, 1724. Vitkup,D., Ringe,D., Karplus,M. and Petsko,G.A. (2002) Why protein R-factors are so large: a self-consistent analysis. Proteins, 46, 345354. Voet,D. and Voet,J. (1995) Biochemistry. Waldburger,C., Schildbach,J. and Sauer,R. (1995) Are buried salt bridges important for protein stability and conformational specicity? Nat Struct Biol, 2, 122128. Walther,D. and Cohen,F.E. (1999) Conformational attractors on the Ramachandran map.

Acta Cryst D, 55, 506517. Walther,D., Cohen,F.E. and Doniach,S. (2000) Reconstruction of low-resolution threedimensional density maps from one-dimensional small-angle X-ray solution scattering data for biomolecules. J Appl Cryst, 33, 350363. Weeks,C.M. and Miller,R. (1999) The design and implementation of SnB v2.0. J App Cryst,

32, 120124. Westhof,E., Dumas,P. and Moras,D. (1988) Restrained renement of two crystalline forms of yeast aspartic acid and phenylalanine transfer RNA crystals. Acta Cryst, 44, 112123. Westhof,E. and Sundaralingam,M. (1986) Restrained renement of the monoclinic form of yeast phenylalanine transfer RNA: Temperature factors and dynamics, coordinated waters, and base-pair propeller twist angles. Biochemistry, 25, 48684878. Wilson,M.A. and Brunger,A.T. (2000) The 1.0 A crystal structure of Ca(2+)-bound calmodulin: an analysis of disorder and implications for functionally relevant plasticity. J Mol

Biol, 301, 12371256. 147

Winn,M.D., Isupovb,M.N. and Murshudov,G.N. (2001) Use of TLS parameters to model anisotropic displacements in macromolecular renement. Acta Cryst, D57, 122133. Wlodek,S., Skillman,A.G. and Nicholls,A. (2006) Automated ligand placement and renement with a combined force eld and shape potential. Acta. Cryst., D62, 741749. Word,J.M., Lovell,S.C., LaBean,T.H., Taylor,H.C., Zalis,M.E., Presley,B.K., Richardson,J.S. and Richardson,D.C. (1999) Visualizing and Quantifying Molecular Goodness-of-Fit: Small-probe Contact Dots with Explicit Hydrogen Atoms. J. Mol. Biol., 285, 17111733. Xayaphoummine,A., Bucher,T. and Isambert,H. (2005) Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic Acid Res,

33, 605610. Xu,J. (2005) Rapid Protein Side-Chain Packing via Tree Decomposition. Lecture Notes in

Computer Science, 3500, 423439. Yang,J.S., Chen,W.W., Skolnick,J. and Shakhnovich,E.I. (2007) All-Atom Ab Initio Folding of a Diverse Set of Proteins. Structure, 15, 5363. Yang,Y. and Liu,H. (2006) Genetic algorithms for protein conformation sampling and optimization in a discrete backbone dihedral angle space. J Comp Chem, 27, 15931602. Yingling,Y. and Shapiro,B. (2006) The prediction of the wild-type telomerase RNA pseudoknot structure and the pivotal role of the bulge in its formation. J Mol Graph Model, 25, 261274. Yu,B., Blaber,M., Gronenborn,A.M., Clore,G.M. and Caspar,D.L. (1999) Disordered water within a hydrophobic protein cavity visualized by x-ray crystallography. Proc Natl Acad

Sci, 96, 103108. Zhang,Y. and Skolnick,J. (2005) The protein structure prediction problem could be solved using the current PDB library. PNAS, 102, 10291034. Zoete,V., Michielin,O. and Karplus,M. (2002) Relation between sequence and structure of HIV-1 protease inhibitor complexes: a model system for the analysis of protein exibility.

J Mol Biol, 315, 2152. Zuker,M., Mathews,D. and Turner,D. (1999) Algorithms and Thermodynamics for RNA Secondary Structure Prediction. In A Practical Guide in RNA Biochemistry and Biotechnol-

ogy, (Barciszewski,J. and Clark,B., eds),. NATO ASI Series, Kluwer Academic Publishers. 148

Zwieb,C. and Muller,F. (1997) Three-dimensional comparative modeling of RNA. Nucleic

Acids Symp Ser, 36, 6971.

149

Applications of Homogeneous Functions to Geometric Inequalities ...