Modeling Macromolecules

Overview

The purpose of these experiments is to familiarize you with the techniques of modeling, to build a familiarity with the process, and an understanding of its strengths and limitations. The goal of these experiments is for you to be able to critically examine a modeling study, and to be able to initiate a modeling study on a problem of your own. While the program AMMP will be used in the experiments, learning to use this program is not a major point of the lessons. You will not be tested on AMMP. There are many programs available for modeling, but despite the advertising copy they are very similar in choice of algorithms and quality of results.

The focus of these experiments in on homology or similarity modeling. The utility of homology modeling and the general approach to it are well established. (although there is considerable controversy over the details). Other aspects of modeling such as threading or ab initio folding are still experimental and so will not be treated in these lectures.

The experiments

You will perform three experiments. In order to perform these experiments you will need an INTEL PC running Windows (the programs have been tested from 95-XP but not on vista) Other versions of AMMP will run the scripts, but most are not (yet) graphical. The linux graphical version will run these scripts

The first is modeling two similar proteins from each other. BPTI and Dendrotoxin have similar structures. BPTI will be modeled from Dendrotoxin and Dendrotoxin from BPTI. This is an example of a straightforward homology or similarity model. Several modeling algorithms will be available, and the purpose of this experiment is to see how they perform and what kinds of errors are seen in the results.

The second experiment is to model two dissimilar proteins from each other. Protein G and BPTI are similar in size, but not homologous. The same modeling approaches used in the first experiment will be used here. Do they work? How can you tell if they did?

The third experiment is for you to model a protein of your own choice. It can be one that you are working on or one which is interesting for some other reason. Find the homologous structures from the PDB and set up to model the protein.

Accuracy of Structure

The quality of the results is a critical question in modeling. If we're not interested in accurate results, we can model anything. However, if we want to use the results to propose experiments (and this is the goal of theory) or understand biological results, then it is critical that we actually aim for an accurate model.

Evaluating the difference between two structures or the accuracy of a model is not trivial. For example, at the Comparative Assessment of Structure Prediction meetings (CASP I -7 ) most of the discussion centered not on the differences between modeling methods but how they were evaluated. This is still not settled. For the purpose of these experiments, a simple RMS deviation and graphical examination of the differences will suffice.

Two factors enter into accuracy. First is the obvious question of how close is our result to the experimental structure. However, it is equally important to understand the accuracy of the experiment. If the experiment has an expected error of 0.5 Å and our deviation from the experiment is 0.5 Å, then we are essentially correct. On the other hand if the experiment has an expected error of 0.1 Å and our error is 0.5 Å then we have some real errors.

Real errors in protein structure depend on several factors. First of all is the method of structure determination. NMR structures tend to be less accurate than crystal structures simply because the NMR data are sparser than Xray data (NMR data only relate distances between hydrogen atoms, rather than all atoms). In crystal structures, high resolution structures have less inherent error than low resolution structures because the atomic positions are better defined. However, except at the very highest resolutions (about 1Å) the atomic positions are not determined solely by the diffraction data. The crystal lattice can also affect the accuracy of the structure. The same protein crystallized in different crystal forms will have small, but real differences in structure (Note: the overall structure of the protein will remain the same). Examination of differences between high resolution crystal structures of proteins have shown expected errors of 0.4-0.5Å. Differences between NMR structures and crystal structures have shown expected errors of 0.75-1.5Å.

Papers about structural accuracy

  1. Flores, T.P., Orengo, C.A., Moss, D.S. and Thornton, J.M. Comparison of conformational characteristics in structurally similar protein pairs. Prot. Sci. 2, 1811-1826, 1993.
  2. Zegers, I., Maes, D., Dao-Thi, M.-H., Poortmans, F., Palmer, R. and Wyns, L. The structures of RNase A complexed with 3'CMP and d(CpA): active site conformation and conserved water molecules. Prot. Sci. 3, 2322-2339, 1994.
  3. Smith, L.J., Redfield, C., Smith, R.A.G., Dobson, C.M., Clore, G.M., Gronenborn, A.M., Walter, M.R., Naganbushan, T.L., Wlodawer, A. Comparison of four independently determined structures of human recombinant interleukin-4. Nature Struct. Biol. 1, 301-310, 1994.
  4. Chothia, C. and Lesk, A.M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823-826, 1986.
  5. Hilbert, M., Bohm, G. and Jaenicke, R. Structural relationships of homologous proteins as a fundamental principle in homology modeling. Proteins: Struct. Funct. Genet. 17, 138-151, 1993.

Experiment 1: Modeling dendrotoxin and BPTI

BPTI and Dendrotoxin are small proteins with moderate homology. Despite quite different biological functions (BPTI is a serine protease inhibitor, and Dendrotoxin is a neurotoxin which blocks ion channels) these proteins are quite similar in structure.
BPTIDendrotoxin

The sequence alignment between the two proteins, shown below, shows moderate homology. However, notice that all of the Cysteine residues are conserved, which suggests (correctly) that the disulfide bonds are conserved between the two proteins. Conservation of sequence patterns, like this one, can suggest homology even where it is otherwise weak. The conserved patterns are not always cysteines, but can be patterns of hydrophobic residues or active site residues.
DendrotoxinpRrklCilhrnpGrCydkIpafyYNqKkkqCerFdwsGCggnsNrFKtiEeCrRyCig*
BPTI*RpdfCleppytGpCkarIiryfYNaKaglCqtFvygGCrakrNnFKsaEdCmRtCgga

When running this experiment it is important to look at the conserved residues, the surface residues, the tyrosine and tryptophan residues which form the core structure, and the N and C-terminal residues. The strand from residues 15 to 22 is also worth examining for changes.

Required Programs

Required files

Required Command files

Procedure

Ammp specific procedures are colored green in order to differentiate them from generic procedures which any program might perform. The AMMP help file is also available on line and may be useful where this writeup is obscure.
  1. Retrieve the files
    You can't run the examples without the program or the program files.
  2. Build Dendrotoxin from BPTI
    Start AMMP. Read in the file bptidtx.ammp ( use the file menu item, Input AMMP file). Open the display window ( use the Draw menu item). You will see the atoms in common between BPTI and dendrotoxin with many bonds drawn to what appears to be an arbitrary point (which happens to be 0,0,0). These are the atoms which have to be placed to build a complete model.

    Two approaches to building these new atoms should be used for this experiment, and you should monitor the differences in the solutions. Torsion search algorithms are quite popular and are implemented in the script Tsearch.ammp. An attractive alternative is to use distance geometry which is implemented in dgeom.ammp. Since it is important to understand the limits of modeling, as opposed to the limits of a particular algorithm, it is important to run both of these procedures and compare the results.

    Using Tsearch.ammp Tsearch.ammp builds a structure with highly ideal bond-angle covalent geometry and then searches the side chain torsion angles to find the energy minima. Select the AMMP text window by clicking on it. Either use the file menu item to input tsearch.ammp or directly enter "read tsearch.ammp;". The procedure can be watched from the draw window (use the draw menu entry to open this). The molecule can be moved about in the draw window with either the mouse or the arrow keys. The size of the molecule will appear to hop about because of the "autoscaling" feature. You probably want to turn this off using the Controls|toggle autoscale menu item. While you're at it you should play with the other controls to see what they do. You can select an atom by double-clicking on it. The atom name and serial number are displayed. Once you've clicked on two atoms the Geometry|distance menu will give the distance between them. The angle is available once three atoms have been selected. Other items worth exploring are the Veiw|color scheme (force coloring is a good way to see why a structure is not approaching an energy minimum), the Geometry|analyze feature (which gives detailed listings of errors in geometry), and the Geometry|Show tether feature (which draws a purple line between each atom and the tether value associated with it). The size of the molecule and the "slab" or depth of view are controlled by change the control menu item x-y to scale-slab. Use the size (horizontal control) to zoom in, and the slab (vertical control) to narrow the view so that only a limited region of the molecule is visible. The draw window can be closed and re-opened with no effect on the main process.

    Using Dgeom.ammp Dgeom.ammp performs a distance geometry calculation to build the missing atoms. This is the complement or dual of the torsion search. Instead of building a highly ideal structure and then finding the best packing, dgeom.ammp builds a well-packed structure and tries to find the closest ideal structure. If you have already performed the tsearch experiment, open another AMMP window by running AMMP again and read in the file bptidtx.ammp. Select the AMMP text window by clicking on it. Either use the file menu item to input dgeom.ammp or directly enter "read dgeom.ammp;". The procedure can be watched from the draw window (use the draw menu entry to open this).

    Cleaning up the structure The script polish.ammp performs a short run of energy minimization with the whole potential. The use of energy minimization on a homology model is somewhat controversial, as one school of thought believes that the best model is simply to replace the side chains on residues in the conserved regions. Homology modeling provides a valuable test case for the development of potential energy force fields and therefor not performing energy minimization because some molecular mechanics force fields are poor is an unscientific choice.

    Saving the results While multiple copies of AMMP can be run under windows, you may want to save the results for later study. Use the File|output Ammp file menu item. It is a good idea to chose a new unique name for the output.

    Analyzing the results The script rms.ammp superimposes the structure on the experimental coordinates of the target. The coordinates are stored in the tether data structure of AMMP. This script uses a genetic algorithm to superimpose the structures. For highly similar structures, the genetic algorithm is somewhat wasteful, but for dissimilar structures it is an appropriate algorithm. After the superposition, you should examine the detailed differences in structure. Use the Geometry|show tether menu item to display the tethers. A purple line is drawn between each atom and the position of the atom in the dendrotoxin target.

    Note where the largest errors are. (hint - look at the side chains on the surface and the N and C-terminals). Are they significant? Why or why not? turn on the view|color scheme|force menu item. Do the errors correlate with the force (which is proportionate to the residual error in the structure)? Do you see the same errors in both the distance geometry and torsion search models?

  3. Build BPTI from Dendrotoxin
    Start AMMP. Read in the file dtxbpti.ammp ( use the file menu item, Input AMMP file). Open the display window ( use the Draw menu item). You will see the atoms in common between BPTI and dendrotoxin with many bonds drawn to what appears to be an arbitrary point (which happens to be 0,0,0). These are the atoms which have to be placed to build a complete model.

    Two approaches to building these new atoms should be used for this experiment, and you should monitor the differences in the solutions. Torsion search algorithms are quite popular and are implemented in the script Tsearch.ammp. An attractive alternative is to use distance geometry which is implemented in dgeom.ammp. Since it is important to understand the limits of modeling, as opposed to the limits of a particular algorithm, it is important to run both of these procedures and compare the results.

    Using Tsearch.ammp Tsearch.ammp builds a structure with highly ideal bond-angle covalent geometry and then searches the side chain torsion angles to find the energy minima. Select the AMMP text window by clicking on it. Either use the file menu item to input tsearch.ammp or directly enter "read tsearch.ammp;". The procedure can be watched from the draw window (use the draw menu entry to open this). The molecule can be moved about in the draw window with either the mouse or the arrow keys. The size of the molecule will appear to hop about because of the "autoscaling" feature. You probably want to turn this off using the Controls|toggle autoscale menu item. While you're at it you should play with the other controls to see what they do. You can select an atom by double-clicking on it. The atom name and serial number are displayed. Once you've clicked on two atoms the Geometry|distance menu will give the distance between them. The angle is available once three atoms have been selected. Other items worth exploring are the Veiw|color scheme (force coloring is a good way to see why a structure is not approaching an energy minimum), the Geometry|analyze feature (which gives detailed listings of errors in geometry), and the Geometry|Show tether feature (which draws a purple line between each atom and the tether value associated with it). The size of the molecule and the "slab" or depth of view are controlled by change the control menu item x-y to scale-slab. Use the size (horizontal control) to zoom in, and the slab (vertical control) to narrow the view so that only a limited region of the molecule is visible. The draw window can be closed and re-opened with no effect on the main process.

    Using Dgeom.ammp Dgeom.ammp performs a distance geometry calculation to build the missing atoms. This is the complement or dual of the torsion search. Instead of building a highly ideal structure and then finding the best packing, dgeom.ammp builds a well-packed structure and tries to find the closest ideal structure. If you have already performed the tsearch experiment, open another AMMP window by running AMMP again and read in the file dtxbpti.ammp. Select the AMMP text window by clicking on it. Either use the file menu item to input dgeom.ammp or directly enter "read dgeom.ammp;". The procedure can be watched from the draw window (use the draw menu entry to open this).

    Cleaning up the structure The script polish.ammp performs a short run of energy minimization with the whole potential.

    Saving the results While multiple copies of AMMP can be run under windows, you may want to save the results for later study. Use the File|output Ammp file menu item. It is a good idea to chose a new unique name for the output.

    Analyzing the results The script rms.ammp superimposes the structure on the experimental coordinates of the target. The coordinates are stored in the tether data structure of AMMP. This script uses a genetic algorithm to superimpose the structures. For highly similar structures, the genetic algorithm is somewhat wasteful, but for dissimilar structures it is an appropriate algorithm. After the superposition, you should examine the detailed differences in structure. Use the Geometry|show tether menu item to display the tethers. A purple line is drawn between each atom and the position of the atom in the BPTI target.

    Note where the largest errors are. (hint - look at the side chains on the surface and the N and C-terminals). Are they significant? Why or why not? turn on the view|color scheme|force menu item. Do the errors correlate with the force (which is proportionate to the disagreement between ideal molecular geometry and the molecular geometry in the current model)? Do you see the same errors in both the distance geometry and torsion search models?

    Questions

    • What errors do you see?
    • Do the errors relate to the local environment of the side chain?
    • Did the two algorithms produce drastically different structures?
    • Where errors in the placement of the peptide backbone (main-chain errors) corrected?
    • Were there any obvious symptoms of error in the models? (i.e. in the absence of the experimental structure, could you have told where the errors where?)

Experiment 2: Modeling Protein G and BPTI

BPTIProtein G

The sequence alignment between protein G and BPTI shows poor homology. However, there are enough similar residues (Tyr|Phe, Arg|Lys,...) that with a few short gaps a decent threading alignment could be built. In fact, the proteins do share a small sub-structure.
Protein G ***mtyklilnGktlkgettteavdaAtaekvFkqyandngvdgewtydDatkTftvte
BPTI rpdfcleppytGpckariiryfynakAglcqtFvyggcrakrnnfksaeDcmrTcgga*

Required Programs

Required files

Required Command files

Essentially you should follow the same procedure as was followed for the BPTI-Dendrotoxin experiment. IT IS NOT A GOOD IDEA TO PERFORM THIS EXPERIMENT BEFORE THE OTHER ONE. You will need to retrieve the coordinate files bptipg.ammp and pgbpti.ammp. The script files are exactly the same as for BPTI-Dendrotoxin so you should check there for any details about usage.

Questions

Experiment 3: Choosing your own problem

Now we take the training wheels off.

Required Programs

Required Command files

Required Skills

Procedure

  1. Select a protein target Choose a protein where you don't know the structure. This can be one from your work or one that you find interesting for other reasons. While AMMP can handle quite large problems (>10000 atoms), it is probably a good idea to choose a sequence of about 100 residues (unless you can arrange to run the program overnight). Run the alignment program to find similar sequences in the Protein data bank where the structures known. Ideally you will find several structures with stretches of high sequence similarity and few insertions or deletions. The proteins should also be nearly the same size, as it is unlikely that part of a large protein will be the correct fold for a small protein unless it is a separately folded domain.

    Use your judgement on the results. Are the alignments like the BPTI-Dendrotoxin alignment or more like the BPTI-Protein G alignment?

    If you can't find a match, or have no idea of an interesting protein run the sequence search in reverse. For example, take the sequence of Hen egg white lysozyme and find a sequence from another species. A few more exciting examples are Human growth hormone (pdb1hgu.ent), Epidermal Growth factor (pdb1egf.ent), GCSF (pdb1bgc.ent), Beta Nerve growth factor (pdb1bet.ent), CD2 (pdb1cdb.ent), CD4 (pdb3cd4.ent) and Protein G (pdb1pga.ent).

  2. Change the sequence The program sequence_remap can change the sequence from one protein to another. The sequence will need to be changed from the starting model to the target. You will need to supply an alignment which consists of consecutive lines of amino acid sequence (using single letter codes). The first line consists of the target sequence and the second line contains the known sequence. Every other (non-blank) line corresponds to one protein. Insertions and deletions are specified with '*' in the appropriate sequence (many programs supply '.'). Sequence_remap will align the known sequence against the sequence found in the pdb file and (try to) correct for missing residues and database errors.

    The alignment file for BPTI-Dendrotoxin is:
    pRrklCilhrnpGrCydkIpafyYNqKkkqCerFdwsGCggnsNrFKtiEeCrRyCig*
    *RpdfCleppytGpCkarIiryfYNaKaglCqtFvygGCrakrNnFKsaEdCmRtCgga

    Edit the pdb entry file to extract the particular chain in a multi-chain entry, or the average coordinate set in an NMR structure entry. While we usually include ligands and water molecules when generating homology models, these should be removed in order to make the generation of the molecular geometry file straightforward. While multiple chains can easily be handled in AMMP, the writeup below described how to handle a single chain

    Edit the output file and replace the CYS with CSS for any cysteine in a disulfide bond. The preammp dictionary differentiates between CYS and CSS.

    Note how many residues there are in the chain because you will need this later

  3. Generate the molecular geometry file Prewin is the program which converts a Cartesian atomic description of the structure into the molecular geometry file which AMMP reads. It is necessary to match up the atoms with the appropriate terms in the potential set, and ALL molecular modeling programs do this someway or another.

    Prewin first requests a parameter file. Select the file named "atoms.sp4". It then asks to find a template. The template defines the atoms and bonds in each type of amino acid. They should be in a directory "protein", but could be somewhere else depending on how they were unziped. There will be many files in this directory, but you should see files named like "ALA", "ASP", ..., and "TRP". Open any file, because all that is being extracted is the directory name. It then asks for a pdb file. This is the file you just generated with Sequence_remap. Open it. Finally it asks for an output file. Supply a new file name ending in ".ammp" or ".amp".

    The program should run silently without warnings or error messages. You may receive a message about "something.OXT" not being in the geometry. This means the c-terminal residue has both atoms of the terminal acid. If that is the case you will have to add the geometry terms to link that atom into the structure (which is very easy to do).

    Linking the chain. The individual amino acid residues are not joined at this stage. It is necessary to link them. The script linkme.ammp will link all the residues. However, you must set the variable iresm. If there are N amino acids then set the value to N+1 with the command "seti iresm value;" (e.g. seti iresm 101; for a 100 residue structure). Explicitly setting the variable ires to a value, and the variable jres to a value and reading the script "peplink.amp" (in the protein directory with the templates) will link residues ires and jres. Linkme.ammp will also attempt to link an oxt group at the cterminal.

    Disulfides must be specified by hand. Set the variable ires to one cysteine and the variable jres to the other and read "sslink.amp" (also in the protein directory). "seti ires 100; seti jres 110; read protein/sslink.amp;" will generate the disulfide between cysteines 100 and 110.

    The terminal acid group can be linked, if necessary by using the script "oxtlink.amp". (again in the protein directory with the templates). Set the variable ires to the last residue and read oxtlink.amp. "seti ires 100; read protein/oxtlink.amp;" links the c-terminal residue 100. If you used the linkme.ammp script and the variable iresm was properly set this routine will not be needed.

    All of the linking scripts check for valid atoms. You cannot create a disulfide bonds between two non-sulfur atoms, nor improperly link in a non-extant oxt atom.

    It is a good idea to save the file at this point, because generating all these linkages is a bit of a pain. You don't really want to do it again. The main window "file|output AMMP file" menu is the easiest way to do this.

  4. Build the model Use the command files tsearch.ammp or dgeom.ammp which you used before. You only need to use one of them, and the choice is yours. The command files use a variable to determine the range of amino acids to build. You must set this variable. If there are N amino acids then set the value to N+1 with the command "seti iresm value;" (e.g. seti iresm 101; for a 100 residue structure).
Comments? Questions?