some of my neat stuff
git clone git://edryd.org/edryd.org
Log | Files | Refs | LICENSE

index.md (5393B)

      1 ---
      2 title: "Protein Folding with Homology Simulations"
      3 date: 2018-08-12
      4 tags: tech biology simulations
      5 categories: tech
      6 ---
      8 This article is based on research I did as a freshman undergrad, see the [full project
      9 here](http://2018.igem.org/Team:Washington/Model).
     11 ![protein](protein.gif)
     13 Proteins naturally fold into their lowest energy configuration, just like a ball rolling down a
     14 hill to a lower potential. This folded structure gives proteins their function, similar to
     15 how which lock a key can open is determined by its shape, not the metal that makes it up.
     16 Theoretically it is possible to calculate how a protein will fold just by using simple
     17 electrostatic attraction of all a protein's atoms to each other.  However, modeling this process
     18 on an atom-by-atom basis is NP-hard, meaning that it cannot be easily scaled to average size
     19 proteins without a supercomputer.
     21 ![protein](protein-surf.png)
     23 [Rosetta][5] is a software suite for modeling and simulating complex proteins which is designed to help
     24 solve such problems. Rosetta reduces the computational complexity of this problem by dividing the
     25 sequence into fragments and substituting fragments of common folding patterns into the protein. It
     26 simulates random conformations in space and performs a Monte Carlo algorithm to find the lowest
     27 energy conformation. Essentially, random conformations are simulated to reduced the free energy in
     28 an iterative process until optimal conformations are obtained. If the change reduced the free
     29 energy, Rosetta continues from the new conformation. If the change increased the free energy, the
     30 simulation reverts to the old conformation.
     32 [5]: https://www.rosettacommons.org/
     34 Homology modeling can be used to predict the structure of a protein from just
     35 its amino acid sequence very quickly and accurately. Rosetta has builtin
     36 homology modeling support through rosettacm. Homology modeling uses information
     37 from other similar protein sequences structures to see how differences from the
     38 known structure will effect the new structure of the protein you want. Because
     39 you use previously mapped protein structures the quality and accurately of the
     40 models you produce will depend on how similar the structures you find are to
     41 your protein of interest. More similar proteins with less addition gaps lead to
     42 faster and more accurate results.
     44 ![protein](protein-close.png)
     46 The first step is to make a fasta file for your target sequence. This is simple
     47 and can be done manually. An example is shown below. Change `name` to the name
     48 of your protein this should match the name of your file (eg `5vnv.fasta`).
     50 ```
     51 >name
     53 ```
     55 Next you want to get the PDB files for around 2-5 similar proteins. This can be
     56 done using a website called [blast][1]. Simply enter the target proteins sequence,
     57 select to only show proteins with structures already mapped, and press enter.
     58 This will query their database of known proteins and find the ones most
     59 similar. You want to select proteins which are most similar and have the fewest
     60 gaps. When choosing proteins it is often best to use the ones with the least
     61 amount of addition gaps. Rosetta has a harder time adding new amino acids
     62 compared to deleting ones or swapping residues out. So it is commonly a good
     63 idea to select sequences which have a lower similarity score but have less gaps
     64 and make less additions. Depending on the protein you are trying to simulate
     65 there might be more options which are very similar so less than 5 proteins
     66 would be needed.
     68 [1]: https://blast.ncbi.nlm.nih.gov/Blast.cgi
     70 ![protein](protein-lines.png)
     72 Once you have selected your similar proteins you need to use another website
     73 called the [protein databank][2] to download the PDB files you want. However this
     74 PDB from the protein databank most likely contains chains you do not care
     75 about, so you have to remove them for rosetta to understand which chains you
     76 want to simulate. Luckily rosetta comes with a python script called
     77 `clean_pdb.py` to do exactly this. Simply give this script the PDB file name
     78 and the chain you want to keep. This program will also produce a fasta file for
     79 the sequence of the chain which is needed in the next step.
     81 [2]: http://www.rcsb.org/
     83 Yet another website, [clustal omega][3], is used to combine all the similar and the
     84 target protein sequences so they can be easily compared by rosetta. The file it
     85 outputs however can not be understood by rosetta, so I wrote a [python script][4]
     86 to convert this clustal file into one rosetta can read, grishin.
     88 [3]: https://www.ebi.ac.uk/Tools/msa/clustalo/
     89 [4]: https://edryd.org/projects/aln2grishin/
     91 This grishin alignment file and the similar protein PDBs are then given to
     92 rosetta to generate a protein threaded PDBs for each input PDB. These threaded
     93 PDBs, the target sequence fasta file and a rosettacm XML file are then all
     94 given to another rosetta program to generate the target protein's PDB file. It
     95 is recommended you generate anywhere from 50-200 PDBs depending on how quickly
     96 the results seem to converge.
     98 ```
     99 $ rosetta/main/source/bin/partial_thread.default.linuxgccrelease -in:file:fasta molxa3.fasta -in:file:alignment molxa3_6dbeA.grishin -in:file:template_pdb 6dbeA.pdb
    100 ```
    102 A full flow-diagram for this process can be seen below:
    104 ![workflow](workflow.png)