Protein Sequence Analysis Structure & Function Prediction

Download Report

Transcript Protein Sequence Analysis Structure & Function Prediction

Function Prediction from Protein Sequence

Basic definitions

• Primary structure – the linear sequence of amino acids in a protein • Secondary structure – regions of local regularity •

i.e.,

a -helices, b -strands, -sheets & -turns

Definitions contd.

• Super-secondary structure – the packing of secondary structure elements into stable units •

e.g

., b -barrels, bab units, Greek keys,

etc

..

Definitions contd.

• Tertiary structure – the overall chain fold that results from packing of secondary structure elements

Definitions contd.

• Quaternary structure – the arrangement of separate chains within a protein that has more than one subunit •

e.g

., haemoglobin

Definitions contd.

• Quinternary structure – the arrangement of separate molecules, such as in protein protein or protein-nucleic acid interactions

Importance of sequence analysis

• Feb 2008 >67 million sequences available in Genbank – & millions more (including ESTs) in proprietary dbs – these #s will snowball with completion of more genomes – so what?

• Locked up in sequences is a huge amount of structural, functional & evolutionary info – they're a highly valuable resource • By contrast, the # of solved protein structures is ~ 45000 – a huge information deficit

PDB Feb 2007

The legacy of the genome projects

Sequence-structure deficit 800 700 600 500 400 300 200 100 1988 2002

Non-redundant growth of sequences during 1988-2002 ( ) & the corresponding growth in the number of structures ( ).

Challenges for bioinformatics

• Spurred on by the seq/structure deficit, the challenges – rationalise the mass of sequence data – derive more efficient means of data storage – design more incisive & reliable analysis tools • The imperative - to convert sequence information into biochemical & biophysical knowledge – to decipher the structural, functional & evolutionary clues encoded in the language of biological sequences

The Holy Grail of bioinformatics

...to be able to understand the words in a sequence sentence that form a particular protein structure

Pattern recognition & prediction

• In investigating the meaning of sequences, two distinct analytical approaches have emerged –

pattern recognition

is used to detect similarity between sequences & hence to infer related structures & functions –

ab initio prediction

is used to deduce structure, & to infer function, directly from sequence • These methods are quite different!

– pattern recognition methods characteristic has been seen demand that some before & housed in a db – prediction methods remove the need for template dbs, because deductions are made directly from sequence

Science fact & fiction

• Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition

– which is ~50% reliable even in expert hands

• Prediction is still not possible

– & is unlikely to be so for decades to come (if ever)

• But, to debunk a popular myth, knowing structure alone does not inherently tell us function

The Twilight Zone

• Prediction methods don’t work because we don’t fully understand the Folding Problem – we can’t read the language sequences use to create their folds • But, with sequence analysis techniques, we can try to find similarities between new sequences & those in dbs – whose structures & functions we hope have been elucidated • This is straightforward at high levels of identity, but below 50% it is difficult to establish relationships reliably • Analyses can be pursued with decreasing certainty towards the Twilight Zone – ~20% identity, where results may look plausible to the eye, but are no longer statistically significant

Beyond the Twilight Zone

• To penetrate deeper into the Twilight Zone is the aim of most analytical methods – whether using single sequences, motifs, complex weighting schemes or raw amino acid frequencies • Each offers a different perspective , depending on the type of information used in the search – none gives the

right

answer • It is good practice to devise an analysis protocol that uses a variety of methods – but don’t expect the impossible – no method is infallible!

Application areas of analysis tools

• • The scale indicates % identity between aligned sequences Alignment of 2 random seqs can produce ~20% identity – less than 20% does not constitute a significant alignment – around this threshold is the Twilight Zone , where alignments may appear plausible to the eye, but can’t be proved by conventional methods

Homology & analogy

• The term homology is confounded & abused!

– sequences are homologous from a common ancestor if they are related by divergence – analogy relates to the acquisition of common features from unrelated ancestors via • convergent evolution

e.g

., b -barrels occur in soluble & membrane proteins; enzymes chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities

The Midnight Zone

• In the genome era, prediction of function from sequence is of more immediate value than is the prediction of structure • However, between distantly-related proteins, structure is more conserved than the underlying sequences – thus, some relationships are only apparent at the structural level • Such relationships can’t be detected by even the most sensitive sequence comparison methods – the region of identity where sequence comparisons fail completely to detect structural similarity is the Midnight Zone – there is thus a theoretical limit to the effectiveness of sequence analysis methods

Family (pattern) databases

• SWISS-PROT is emerging as a standard, & most pattern dbs use it as their basis

PROSITE PRINTS (fingerprints) Pfam (HMMs) Profiles Blocks (blocks) eMOTIF SWISS-PROT SWISS-PROT/TrEMBL SWISS-PROT/TrEMBL SWISS-PROT InterPro/PRINTS Blocks/PRINTS Regular expressions (patterns) Aligned motifs Hidden Markov Models Weight matrices (profiles) Weighted motifs Permissive regular expressions

Why create pattern databases?

• Pattern dbs arise from the need to make more specific functional diagnoses than are possible simply by searching • They are built on the principle that homologous sequences may be gathered together in multiple alignments, within which are regions ( motifs ) that show little variation – these motifs usually reflect some vital biological role in terms of either structure or function • Motifs are exploited in different ways to build diagnostic patterns for protein families – new sequences can be searched against dbs of such patterns to see if they can be assigned to known families – hence they offer a fast track to the inference of function

What's in a sequence?

Methods for family analysis

Single motif methods

Fuzzy regex (eMOTIF) Exact regex (PROSITE)

Full domain alignment methods

Profiles (PROFILE LIBRARY) HMMs (Pfam) Identity matrices (PRINTS)

Multiple motif methods

Weight matrices (Blocks)

PROSITE

• The first pattern db – based on the idea that a protein family can be characterised by a pattern of conserved residues within a single motif • Sequence information in motifs is reduced to consensus or regular expressions (regexs) & the seed regex used to search SP – results are inspected manually to achieve optimal results • Some families can’t be characterised by single motifs – here, additional regexs are created until an optimal set is achieved that captures most or all of the family – results are then manually annotated for inclusion in the db

R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)

Regular expressions (patterns)

• These are derived from single conserved regions in alignments – they are minimal expressions, so sequence information is lost – the more divergent the sequences used, the more fuzzy & poorly discriminating the regex becomes

Alignment Regex

GAVDFIALCDRYF GPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2 GRVEFLNRCDRYY • Regexs do not tolerate similarity – sequences either match or not, regardless of how similar they are – matching is a binary ‘on-off’ event & frequently misses true matches – single-motif methods are very hit-or-miss – you've encoded the ‘best’ region?

how do you know if

Regular expressions (rules)

• Regex patterns are most effective when applied to highly conserved, family-specific motifs • It is often possible to identify, shorter generic patterns within sequences, characteristic of common functional sites

Functional site

N-glycosylation Protein kinase C phosphorylation Casein kinase II phosphorylation

Rule

N-{P}-[ST]-{P} [ST]-X-[RK] [ST]-X2-[DE] • Such features result from convergence to a common property – glycosylation sites, phosphorylation sites,

etc

.

• They cannot be used for family diagnosis & don’t discriminate – they can only be used to suggest whether a certain functional site might exist (which must then be tested by experiment) – such patterns are normally termed rules

Residue groups for fuzzy regexs

• It is possible to assign residues to groups based on various biochemical properties –

e.g

., charge & size – using such groups theoretically ensures that resulting regexs have sensible biochemical interpretations small small hydroxyl basic aromatic aliphatic acidic/amide small/polar Ala, Gly Ser, Thr His, Lys, Arg Phe, Tyr, Trp Val, Leu, Ile, Met Asp, Glu, Asn, Gln Ala, Gly, Ser, Thr, Pro • This is more flexible than exact regex matching

Profiles & Pfam

• An alternative to motif-based methods exploits regions between motifs , which also contain valuable information – the full alignment effectively becomes the discriminator • A complex scoring scheme allowing for substitutions & INDELs is used to create family-specific profiles • These profiles can be used to detect distant relation ships, where only few residues are conserved – this is the basis of the Profile library • In an extension of this approach, alignments are encoded as probabilistic models termed HMMs – this is the basis of Pfam

Profiles

• Profiles are scoring tables derived from full domain alignments – these define which residues are allowed at given positions – which positions are conserved & which degenerate – which positions, or regions, can tolerate insertions – the scoring system is intricate, & may include evolutionary weights, results from structural studies, & data implicit in the alignment – variable penalties are specified to weight against INDELs occurring in core 2' structure elements • Within a profile, the I & M fields contain position-specific scores for insert & match positions – in conserved regions, INDELs aren't totally forbidden, but are strongly impeded by large penalties defined in the DEFAULT field – these are superseded by more permissive values in gapped regions – the inherent complexity of profiles renders them highly potent discriminators, but they are time-consuming to derive

S N B

A HMM

E D

M1 I1

C L L C A W R Y D G E

I2

L Y

I3

L W

M2 M3 M4

C C

E C T D1 D2 D3 D4 J

Fingerprints

• Fingerprints are groups of conserved (ungapped) motifs excised from alignments & used for iterative db searching – no weighting scheme is used – searches depend only on residue frequencies – resulting scoring matrices are thus sparse • Each motif trawls the db independently – search results are correlated to determine which sequences match all the motifs & which match only partially – no information is thrown away • The iterative process refines the fingerprint & increases its power – potency is gained from the mutual context of motif neighbours – results are biologically more meaningful than those from single motifs

TM domain loop region TM domain

Composite pattern databases

• To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro – release 4.0 contains 4691 entries – a central annotation resource, with pointers to its satellite dbs – initial partners were PRINTS, PROSITE, profiles & Pfam – new partners include ProDom, TIGRfam, SMART & hopefully others (

e.g

., Blocks, MetaFam) – lags behind its sources – major role in fly & human genome annotation