Blogs written by Savita Jayaram, Ph.D., Bioinformatics Scientist

Proteomics

Proteomics was the most natural next step to the genomic revolution – the human genome project. Proteomics essentially means cataloging what is in the genome. Every chemical and biological reaction in the body depends on their services. They form the structure and function of the every cell and the glue that binds the body together; they are the hormones that course through our veins and the antibodies that fight infections; they are the enzymes that catalyze our biochemical reactions, hence essential to life itself.

 What a protein does is largely determined by its shape. Proteins have pockets and grooves into which some molecule fits just as the key fits into a lock. From the necessity to understand this arose the field of structural genomics or proteomics as it is now called.

   aminoacids5

The 20 aminoacids form the building blocks of proteins. Each amino acid differs from the other by the nature of its side chain. The sequence of aminoacids forms the Primary structure. Secondary structure forms when this sequence folds into either, alpha helix, beta sheets or turns/loops. Secondary structural elements assembled together by hydrogen bonding, disulphide bonds, salt bridges, between residues of different strand involving long range interactions to form the teritiary structure. Teritiary structure is generally stabilized by these non-local interactions, most commonly by the formation of a hydrophobic core and sometimes by post-translational modifications.

 220_04_1145

Teritiary structure is synonymous with the term ‘fold’ and is used interchangeably. Complexes built from more than one protein chain form the quarternary structure. Usually called subunits, several polypeptide chains can associate to form multimeric proteins.

haemoglobin3

Example: Hemoglobin

An additional level of complexity is added into the protein functional determination by post- translational modifications that are essential to their function. Some proteins need to go through phosphorylation, methylation, acetylation, glycosylation, ubiquitination, oxidation, nitrosylation etc. in order to function. These post-translational modifications profoundly affect protein function; for instance some proteins are not active until they are phosphorylated. Additionally, many proteins form complexes with other proteins or RNA molecules and only function in the presence of these molecules.

protein-triose-phosphate3

Triose phosphate isomerase shown in 3 different ways: Ball&stick, ribbons, spacefill.

There are more that 56,635 different protein structures reported by the PDB (Protein Data Bank) as of March, 2009 but only a total of 1283 unique folds as defined by SCOP (Structural Classification Of Proteins), showing that in all these proteins are but variations on a handful of themes. Proteins with similar functions – be it an insect, worm or man share structural characteristics that are sometimes reflected in the genomic sequences as well. But often times, proteins that have less than 30 % sequence (called the twilight zone) can still be very closely related – share structural properties, motifs etc., and have identical functions. Below 30% sequence identity, BLAST similarity searches give a lot of false positives. So annotation by structure becomes very important.  Example: Mouse Abl Tyrosine Kinase and human P38 serine kinase having identical functions share only 28% sequence identity.

Small specific secondary structure elements (such as helix-turn-helix) are called Motifs. These may often include loops of variable length that bring together structural elements that are not encoded by adjacent DNA sequences in a gene. Motifs are often called super secondary structures.

An analysis of the protein 3-D structure reveals that proteins have multiple folding units called domains, which has its own hydrophobic core and satisfies most of its residue-residue contacts internally. Several motifs combine to form compact globular structures called domains. Domain is defined as a polypeptide chain or a part of a polypeptide chain that can fold independently into a teritiary structure. Domains are thus units of function. Example: calcium binding domain of Calmodulin and DNA binding, dimerization domain of Lambda repressor. Sequence pairs frequently exhibit localized regions of similarity; the remainder of the proteins being totally dissimilar. Finding multiple domains increases our confidence that the sequence belongs to that protein family even if each domain individually is a weak match. Large polypeptide chains fold into several domains. For such sequences it is better to search a database of protein domains (PFAM). HMMER algorithm is most sensitive at identifying complete domains.

Traditional methods of determining protein structure by X-ray crystallography or NMR are tedious and time consuming. Computational methods have lent scientist a helping hand and can determine structure with considerable accuracy. The three major classes of protein structure prediction include:

  1. Homology (Comparative) Modelling
  2. Fold Recognition or threading – Secondary Structure Prediction
  3. Ab Initio Method – Teritiary Structure Prediction.

The successful prediction of protein structures by either of these methods need to first surmount these issues:

(1) Choosing a representation of protein conformation that includes structures similar to the correct conformation but limits the search space;

(2) Formulating a scoring function that relates a particular protein conformation to its free energy; and

(3) Devising a method to combine the first two elements in a search through conformational space for the state with the globally optimum score.

Homology Modelling:

The first step in modeling of a protein sequence is to attempt to find related known protein structures in PDB for as many domains in the modeled sequence as possible. If your unknown protein has significant homology with another protein of known 3D structure, a fairly accurate model of your protein 3D structure can be obtained using Homology Modelling. But it is recommended if sequence homology is greater than 50%. The average accuracy of such models approaches that of low resolution X-ray structures (3A0 resolution) or medium resolution NMR structures (10 long-range restraints per residue). Less than 30% sequence identity, give less accurate models and fold assignments.

There are various algorithms for determining structural similarity:

  • SSAP (Secondary Structure Alignment Program) – uses Dynamic programming.
  • VAST (Vector Alignment Search Tool) – based on 3D comparison/clustering, allows fro different topology and large insertions. For finding structure neighbors for new proteins in MMDB, a branch of PDB. For proteins already in MMDB, the structural neighbors have been pre-computed and can be viewed through the structure summary pages.
  • DALI http://ekhidna.biocenter.helsinki.fi/dali_server/ Based on distance Matrix, insensitive to insertions even with 10-18% sequence identity. It offers network service for comparing protein structures in 3D. One can submit coordinates of a query protein via web or email and DALI compares them with homologues in PDB. You get email notification when search is completed. The structural neighbors of a protein in PDB can reveal biologically interesting similarities that cannot be detected by comparing sequences alone.
  • FSSP (Fold Classification based on structure-structure alignment of proteins).
  • Modeller Was developed by Dr. Andrej Sali at UCSF. It performs comparative homology modeling of three dimensional structures using a technique called ‘satisfaction of spatial restraints’, using know related structures. It is available for download on most Unix/Linux systems, Windows and Mac. Commercial versions can be licensed from Accelrys.

Various other considerations are:

  • – Is your protein a trans-membrane protein? (TMPhred, TMHMM, TMAP, PredictProtein)
  • – Does it have colied coils? (COILS server)
  • – Does your protein contain regions of low complexity that need to be masked? (SEG program – proteins frequently have runs of poly-glutamine or polyserine that do not predict well)
  • – Does it have a signal peptide (SIGNALP program – to check if it is secreted protein.

Fold Recognition or Threading:

Proteins adopt similar folds despite no significant sequence or functional similarity. Nature is apparently restricted to a limited number of protein folds as seen by the limited number of unique folds in PDB (1283) as apposed to the number of protein structures (56635). Methods of fold recognition detect similarity between protein 3D structures that are not accompanied by any significant sequence similarity. The folds of domains in the target sequence can be assigned by pairwise or multiple sequence similarity searches as well as threading methods that rely explicitly on the known structures of the candidate template proteins.  While fold assignment predicts a structural relationship between two proteins, it does not produce an explicit three dimensional model of the target sequence. It is generally followed by alignment of target sequence with one or more template structures to establish the best possible correspondence between the residues.

Ab Initio Modelling:

If no suitable fold assignments, alignments or models can be obtained, the only recourse is ab initio protein structure prediction methods that attempt to predict the native structure based solely on the sequence of the protein to be modeled. Unfortunately, these methods are still being perfected and are not generally applicable given the complexity of protein folding. So far, the ab intio methods have produced successful models with correct folds for only a few small protein domains. But recent progress in ab initio modeling of long inserted loops and domains <150 residues will benefit structural genomics as these regions are not easily accessible to experimental structure determination methods. IBM intends to apply it Blue Gene supercomputer to ab intio protein folding problem. Its massive supercomputing power will be needed to develop more accurate energy functions and protein representations, as well as to simulate molecular dynamics.

Structure Prediction Flowchart:

 structure-prediction-flowchart2

Courtesy: http://www.russell.embl-heidelberg.de/gtsp/

Protein Structure Prediction Databases and Servers:

CASP http://predictioncenter.org/ stands for Critical Assessment of Techniques for Protein Structure Prediction. Protein Structure modelers and their methods are tested bi-annually at these meetings. The 3D structures of the predicted model are compared to the experimentally determined structure. Thus a bona fide evaluation of the protein structure modeling methods is possible.

PFAM http://pfam.sanger.ac.uk/– Protein Family Database contains a large collection of protein families, each represented by multiple sequence alignments and HMM profiles of protein domains. Conserved features are recognized and given higher weight. HMMER-PFAM can detect distant relationships with less than 15% sequence identity and is more sensitive than pairwise alignment approaches. This is a method to find remote homologs.

UNIROT http://www.ebi.ac.uk/uniprot/ SWISS-PROT, TrEMBL, PIR together formed the Uniprot knowledgebase and is developed by EMBL-EBI (European Bioinformatics Institute). TrEMBL is translated EMBL nucleotide DB and the proteins predicted are hypothetical and poorly annotated while Swissprot has only actual proteins and is well annotated.

PROSITE http://www.expasy.ch/prosite/ is a database of protein families and domains. It is manually curated and tightly integrated with Swiss-Prot Protein annotation. It offers a collection of functionally important protein motifs emphasizing on most highly conserved residues in a protein family. It is a pattern library or profile library, and does not describe complete domains or even protein.

BLOCKS – http://blocks.fhcrc.org/ and

PRINTS – http://www.bioinf.manchester.ac.uk/PRINTS are two different motif databases that represent protein domain families. It is a collection of so called fingerprints of protein families.  A fingerprint is a group of conserved motifs taken from a multiple sequence alignment. Together the motifs form a characteristic ‘signature’ for the aligned protein family. Blocks database is no longer updated and it suggests to use EBI’s INTERPRO instead to annotate your protein sequences

INTERPRO http://www.ebi.ac.uk/interpro/ provided by EBI is a database of protein families, domains and functional sites. Available for text and sequence based searches via a web server and for download by anonymous FTP.

COG http://www.ncbi.nlm.nih.gov/COG/ developed by NCBI stands for Clusters of Orthologous Groups. The most promising approach to prediciting the exact function of a protein is to find its characterized “ortholog” from a different species.

Proteopedia http://proteopedia.org/wiki that is recently becoming popular is a Wikipedia’s 3D interactive encyclopedia of proteins. There is one entry for each of the over 50,000 proteins in the Protein Data Bank. (Ref: Proteopedia – a scientific ‘wiki’ bridging the rift between 3D structure and function of biomacromolecules, Genome Biology 2008, 9:R121 doi:10.1186/gb-2008-9-8-r121 )

PredictProtein http://www.predictprotein.org/ provides services for sequence analysis, structure and function prediction. It retrieves sequences from the database and predicts aspects of protein structure and function.

Leave a comment

Tag Cloud