index.md (5393B)
1 --- 2 title: "Protein Folding with Homology Simulations" 3 date: 2018-08-12 4 tags: tech biology simulations 5 categories: tech 6 --- 7 8 This article is based on research I did as a freshman undergrad, see the [full project 9 here](http://2018.igem.org/Team:Washington/Model). 10 11 ![protein](protein.gif) 12 13 Proteins naturally fold into their lowest energy configuration, just like a ball rolling down a 14 hill to a lower potential. This folded structure gives proteins their function, similar to 15 how which lock a key can open is determined by its shape, not the metal that makes it up. 16 Theoretically it is possible to calculate how a protein will fold just by using simple 17 electrostatic attraction of all a protein's atoms to each other. However, modeling this process 18 on an atom-by-atom basis is NP-hard, meaning that it cannot be easily scaled to average size 19 proteins without a supercomputer. 20 21 ![protein](protein-surf.png) 22 23 [Rosetta][5] is a software suite for modeling and simulating complex proteins which is designed to help 24 solve such problems. Rosetta reduces the computational complexity of this problem by dividing the 25 sequence into fragments and substituting fragments of common folding patterns into the protein. It 26 simulates random conformations in space and performs a Monte Carlo algorithm to find the lowest 27 energy conformation. Essentially, random conformations are simulated to reduced the free energy in 28 an iterative process until optimal conformations are obtained. If the change reduced the free 29 energy, Rosetta continues from the new conformation. If the change increased the free energy, the 30 simulation reverts to the old conformation. 31 32 [5]: https://www.rosettacommons.org/ 33 34 Homology modeling can be used to predict the structure of a protein from just 35 its amino acid sequence very quickly and accurately. Rosetta has builtin 36 homology modeling support through rosettacm. Homology modeling uses information 37 from other similar protein sequences structures to see how differences from the 38 known structure will effect the new structure of the protein you want. Because 39 you use previously mapped protein structures the quality and accurately of the 40 models you produce will depend on how similar the structures you find are to 41 your protein of interest. More similar proteins with less addition gaps lead to 42 faster and more accurate results. 43 44 ![protein](protein-close.png) 45 46 The first step is to make a fasta file for your target sequence. This is simple 47 and can be done manually. An example is shown below. Change `name` to the name 48 of your protein this should match the name of your file (eg `5vnv.fasta`). 49 50 ``` 51 >name 52 EVQLQASGGGFVQPGGSLRLSCAASGSTSRQYDMGWFRQAPGKEREFVSAISSNQDQPPYYADSVKGRFTISRDNSKNTVYLQMNSLRAEDTATYYCAFKQHHANGAYWGQGTQVTVSS 53 ``` 54 55 Next you want to get the PDB files for around 2-5 similar proteins. This can be 56 done using a website called [blast][1]. Simply enter the target proteins sequence, 57 select to only show proteins with structures already mapped, and press enter. 58 This will query their database of known proteins and find the ones most 59 similar. You want to select proteins which are most similar and have the fewest 60 gaps. When choosing proteins it is often best to use the ones with the least 61 amount of addition gaps. Rosetta has a harder time adding new amino acids 62 compared to deleting ones or swapping residues out. So it is commonly a good 63 idea to select sequences which have a lower similarity score but have less gaps 64 and make less additions. Depending on the protein you are trying to simulate 65 there might be more options which are very similar so less than 5 proteins 66 would be needed. 67 68 [1]: https://blast.ncbi.nlm.nih.gov/Blast.cgi 69 70 ![protein](protein-lines.png) 71 72 Once you have selected your similar proteins you need to use another website 73 called the [protein databank][2] to download the PDB files you want. However this 74 PDB from the protein databank most likely contains chains you do not care 75 about, so you have to remove them for rosetta to understand which chains you 76 want to simulate. Luckily rosetta comes with a python script called 77 `clean_pdb.py` to do exactly this. Simply give this script the PDB file name 78 and the chain you want to keep. This program will also produce a fasta file for 79 the sequence of the chain which is needed in the next step. 80 81 [2]: http://www.rcsb.org/ 82 83 Yet another website, [clustal omega][3], is used to combine all the similar and the 84 target protein sequences so they can be easily compared by rosetta. The file it 85 outputs however can not be understood by rosetta, so I wrote a [python script][4] 86 to convert this clustal file into one rosetta can read, grishin. 87 88 [3]: https://www.ebi.ac.uk/Tools/msa/clustalo/ 89 [4]: https://edryd.org/projects/aln2grishin/ 90 91 This grishin alignment file and the similar protein PDBs are then given to 92 rosetta to generate a protein threaded PDBs for each input PDB. These threaded 93 PDBs, the target sequence fasta file and a rosettacm XML file are then all 94 given to another rosetta program to generate the target protein's PDB file. It 95 is recommended you generate anywhere from 50-200 PDBs depending on how quickly 96 the results seem to converge. 97 98 ``` 99 $ rosetta/main/source/bin/partial_thread.default.linuxgccrelease -in:file:fasta molxa3.fasta -in:file:alignment molxa3_6dbeA.grishin -in:file:template_pdb 6dbeA.pdb 100 ``` 101 102 A full flow-diagram for this process can be seen below: 103 104 ![workflow](workflow.png)