index.md - edryd.org - some of my neat stuff

index.md (5393B)

1 ---
2 title: "Protein Folding with Homology Simulations"
3 date: 2018-08-12
4 tags: tech biology simulations
5 categories: tech
6 ---
7
8 This article is based on research I did as a freshman undergrad, see the [full project
9 here](http://2018.igem.org/Team:Washington/Model).
10
11 ![protein](protein.gif)
12
13 Proteins naturally fold into their lowest energy configuration, just like a ball rolling down a
14 hill to a lower potential. This folded structure gives proteins their function, similar to
15 how which lock a key can open is determined by its shape, not the metal that makes it up.
16 Theoretically it is possible to calculate how a protein will fold just by using simple
17 electrostatic attraction of all a protein's atoms to each other. However, modeling this process
18 on an atom-by-atom basis is NP-hard, meaning that it cannot be easily scaled to average size
19 proteins without a supercomputer.
20
21 ![protein](protein-surf.png)
22
23 [Rosetta][5] is a software suite for modeling and simulating complex proteins which is designed to help
24 solve such problems. Rosetta reduces the computational complexity of this problem by dividing the
25 sequence into fragments and substituting fragments of common folding patterns into the protein. It
26 simulates random conformations in space and performs a Monte Carlo algorithm to find the lowest
27 energy conformation. Essentially, random conformations are simulated to reduced the free energy in
28 an iterative process until optimal conformations are obtained. If the change reduced the free
29 energy, Rosetta continues from the new conformation. If the change increased the free energy, the
30 simulation reverts to the old conformation.
31
32 [5]: https://www.rosettacommons.org/
33
34 Homology modeling can be used to predict the structure of a protein from just
35 its amino acid sequence very quickly and accurately. Rosetta has builtin
36 homology modeling support through rosettacm. Homology modeling uses information
37 from other similar protein sequences structures to see how differences from the
38 known structure will effect the new structure of the protein you want. Because
39 you use previously mapped protein structures the quality and accurately of the
40 models you produce will depend on how similar the structures you find are to
41 your protein of interest. More similar proteins with less addition gaps lead to
42 faster and more accurate results.
43
44 ![protein](protein-close.png)
45
46 The first step is to make a fasta file for your target sequence. This is simple
47 and can be done manually. An example is shown below. Change `name` to the name
48 of your protein this should match the name of your file (eg `5vnv.fasta`).
49
50 ```
51 >name
52 EVQLQASGGGFVQPGGSLRLSCAASGSTSRQYDMGWFRQAPGKEREFVSAISSNQDQPPYYADSVKGRFTISRDNSKNTVYLQMNSLRAEDTATYYCAFKQHHANGAYWGQGTQVTVSS
53 ```
54
55 Next you want to get the PDB files for around 2-5 similar proteins. This can be
56 done using a website called [blast][1]. Simply enter the target proteins sequence,
57 select to only show proteins with structures already mapped, and press enter.
58 This will query their database of known proteins and find the ones most
59 similar. You want to select proteins which are most similar and have the fewest
60 gaps. When choosing proteins it is often best to use the ones with the least
61 amount of addition gaps. Rosetta has a harder time adding new amino acids
62 compared to deleting ones or swapping residues out. So it is commonly a good
63 idea to select sequences which have a lower similarity score but have less gaps
64 and make less additions. Depending on the protein you are trying to simulate
65 there might be more options which are very similar so less than 5 proteins
66 would be needed.
67
68 [1]: https://blast.ncbi.nlm.nih.gov/Blast.cgi
69
70 ![protein](protein-lines.png)
71
72 Once you have selected your similar proteins you need to use another website
73 called the [protein databank][2] to download the PDB files you want. However this
74 PDB from the protein databank most likely contains chains you do not care
75 about, so you have to remove them for rosetta to understand which chains you
76 want to simulate. Luckily rosetta comes with a python script called
77 `clean_pdb.py` to do exactly this. Simply give this script the PDB file name
78 and the chain you want to keep. This program will also produce a fasta file for
79 the sequence of the chain which is needed in the next step.
80
81 [2]: http://www.rcsb.org/
82
83 Yet another website, [clustal omega][3], is used to combine all the similar and the
84 target protein sequences so they can be easily compared by rosetta. The file it
85 outputs however can not be understood by rosetta, so I wrote a [python script][4]
86 to convert this clustal file into one rosetta can read, grishin.
87
88 [3]: https://www.ebi.ac.uk/Tools/msa/clustalo/
89 [4]: https://edryd.org/projects/aln2grishin/
90
91 This grishin alignment file and the similar protein PDBs are then given to
92 rosetta to generate a protein threaded PDBs for each input PDB. These threaded
93 PDBs, the target sequence fasta file and a rosettacm XML file are then all
94 given to another rosetta program to generate the target protein's PDB file. It
95 is recommended you generate anywhere from 50-200 PDBs depending on how quickly
96 the results seem to converge.
97
98 ```
99 $ rosetta/main/source/bin/partial_thread.default.linuxgccrelease -in:file:fasta molxa3.fasta -in:file:alignment molxa3_6dbeA.grishin -in:file:template_pdb 6dbeA.pdb
100 ```
101
102 A full flow-diagram for this process can be seen below:
103
104 ![workflow](workflow.png)

	edryd.org some of my neat stuff
	git clone git://edryd.org/edryd.org
	Log \| Files \| Refs \| LICENSE