Transcript Slide 1

Sequence Based Analysis Tutorial

NIH Proteomics Workshop

Cecilia Arighi, Ph.D.

Protein Information Resource at Georgetown University Medical Center

Retrieval, Sequence Search & Classification Methods

    Retrieve protein info by text / UID Sequence Similarity Search  BLAST, FASTA, Dynamic Programming Family Classification  Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks Integrated Search and Classification System 2

Sequence Similarity Search (I)

   Based on Pair-Wise Comparisons Dynamic Programming Algorithms   Global Similarity: Needleman-Wunch Local Similarity: Smith-Waterman Heuristic Algorithms  FASTA: Based on K-Tuples (2-Amino Acid)     BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search 3

Sequence Similarity Search (II)

  Similarity Search Parameters  Scoring Matrices – Based on Conserved Amino Acid Substitution    Dayhoff Mutation Matrix, e.g., PAM250 (~20% Identity) Henikoff Matrix from Ungapped Alignments, e.g., BLOSUM 62 Gap Penalty Search Time Comparisons   Smith-Waterman: 10 Min FASTA: 2 Min  BLAST: 20 Sec 4

Feature Representation

  Features of Amino Acids: Physicochemical Properties, Context (Local & Global) Features, Evolutionary Features Alternative Amino Acids: Classification of Amino Acids To Capture Different Features of Amino Acid Residues

Alphabet Size Features Membership AA Identity Exchange Group 20 6 Charge/Polarity Hydrophobicity 4 3 Structural 2D Propensity 3 3 Sequence Identity EvolutionSubstitution {HRK}{DENQ}{C}{STPAG}{MILV}{FYW} Charge and Polarity A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y {HRK} {DE} {CTSGNQY} {PMLIVFW} Hydrophobicity {DENQRK} {CSTPGHY} {AMILVFW} Surface Exposure {DENQHRK} {CSTPAGWY} {MILVF} Secondary Structure {AEQHKMLR} {CTIVFYW} {SGPDN}

5

   

Substitution Matrix

Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys) 6

 

Secondary Structure Features

a Helix Patterns of Hydrophobic Residue Conservation Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic) b Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6 7

BLAST

BLAST (Basic Local Alignment Search Tool)  Extremely fast   Robust Most frequently used It finds very short segment pairs (“seeds”) between the query and the database sequence These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached 8

 

BLAST Search

From BLAST Search Interface Table-Format Result with BLAST Output and SSEARCH (Smith Waterman) Pair-Wise Alignment

Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Click to see SSearch alignment alignment

Blast Result & Pairwise Alignment

BLAST Aligment 10

Classification

     What is classification?

Why do we need protein classification?

Different levels of classification Basis for functional protein classification How to classify a protein of unknown function?

11

Classification Databases

Protein motif

Protein domain

3-D structure

Whole-protein C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H

The 2 C's and the 2 H's are zinc ligands

Group proteins according to the presence Group proteins according to common 3D structure common domain architecture and length 12

Family Classification Methods

      Based on Other Classification Information Multiple Sequence Alignment (ClustalW) ProSite Pattern Search Profile Search Hidden Markov Models (HMMs) Domain (Pfam); Whole protein (PIRSF) Neural Networks 13

How do you build a tree?

      Pick sequences to align Align them Verify the alignment Keep the parts that are aligned correctly Build and evaluate a phylogenetic tree Integrated Analysis 14

Multiple Sequence Alignment: CLUSTALW

Pairwise alignment: Calculate distance matrix Mean number of differences per residue Unrooted Neighbor-Joining Tree Rooted NJ Tree (guide tree) Progressive Alignment guided by the tree Branch length drawn to scale Root place at a position where the means of the branch lengths on either side of the root are equal Alignment starts from the tips of the tree towards the root Thompson et al., NAR 22 , 4675 (1994).

15

PIR Multiple Alignment and Tree

From Text/Sequence Search Result or CLUSTAL W Alignment Interface 16

17

 

PIR Pattern Search

Signature Patterns for Functional Motifs From Text/Sequence Search Result or Pattern Search Interface Alignment of a region involved in catalytic activity A P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N Create Pattern and search in database: P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N Test sequence against PROSITE database B O05689 18

A.

Pattern Search Result (I)

One Query Pattern Against UniProtKB or UniRef100 DBs

Display the query pattern Indicate pattern sequence region(s) Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report

19

B.

Pattern Search Result (II)

One Query Sequence Against PROSITE Pattern Database 20

Profile Method

  Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments   Num of Rows = Num of Aligned Positions Each row contains a score for the alignment with each possible residue.

Profile Searching   Summation of Scores for Each Amino Acid Residue along Query Sequence Higher Match Values at Conserved Positions 21

Prosite PS50157 profile for Zinc finger C2H2

22

PIRSF scan

  Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HMMER The matched regions and statistics will be displayed .

1 Shows PIRSF that the query belongs to Statistical data for all domains Statistical data per domain Alignment with consensus sequence

23

Creation and Curation of PIRSFs

24

Integrated Bioinformatics System for Function and Pathway Discovery

  Data Integration Associative Analysis 25

Query Sequence UniProt Family Classification & Functional Analysis BLAST Search HMM Domain Search Top-Matched Superfamilies/Domains HMM Motif Search Pattern Search SignalP/TMHMM Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs SSEARCH CLUSTALW Superfamily/Domain/Motif Alignments Family Relationships & Functional Features

Analytical Pipeline

26

Integrated Bioinformatics System

Gene/Peptide-Protein Mapping Expression Pattern  Global Bioinformatics Analysis of 1000’s of Genes and Proteins  Pathway Discovery, Target Identification Functional Analysis (Sequence Analysis & Information Retrieval) Comprehensive Protein Information Matrix Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis) 27

Lab Section

28

(-) Rat eye lens phosphoproteomics in normal and cataract

Kamei et al., Biol. Pharm. Bull., 2005.

Normal pI (+) Cataract More phosphorylated spots in cataract sample.

Digestion and MS from Spot 16 gave these peptides: ALGPFYPSR CSLSADGMLTFSG YRLPSNVDQSALS We want to identify the protein(s) that contain these peptides

Use Peptide Search

29

Peptide Search

Restrict search to an organism

30

Peptide Search & Results

Species restricted search

Links to iProClass and UniProtKB reports Sorting arrows

Search in UniProtKB, 23 proteins

Link to NCBI taxonomy Link to PIRSF report Matching peptide highlighted in the sequence

31

Batch Retrieval Results (I)

• Retrieve multiple proteins in from iProClass using a specific identifier or a combination of them • Provides a means to easily retrieve and analyze proteins when the identifiers come from different databases

Retrieve more sequences

32

Blast Similarity Search

What proteins are related to rat CRYAA?

• Perform sequence similarity search >P24623 http://pir.georgetown.edu/pirwww/search/blast.shtml

33

Pairwise Alignment

35

PIR Text Search

( http://pir.georgetown.edu/search/textsearch.shtml) UniProtKBDatabase and unique UniParc sequences Let’s search for human crystallins PIR protein family classification database 36

Let’s look for crystallins which have 3D structure

Display PDB ID Refine your search or start over

37

Domain Display allows to compare simultaneously Pfam domains present in multiple proteins Share same domain architecture Let’s perform a multiple alignment on the sequences containing PF00030 38

Multiple Alignment 39

Interactive Phylogenetic Tree and Alignment Beta B1 and gamma crystallins share the same domains, SCOP fold and share significant sequence similarity suggesting that they are related 40

Pattern Search (I)

Select P07320 and perform a pattern search Search for proteins containing this pattern (PS00225) in rat 41

Pattern Search Result

Beta and gamma Crystallins have multiple copies of this pattern 42

PIRSF provides a single platform where all the previous analysis has been done by curators

Pfam domains assigned with high confidence Validation tag Represents extent of manual curation Link to PIRSF report

43

Taxonomic Distribution Alpha-crystallin is exclusively found in metazoans Domain Architecture Multiple Alignment 44

PIRSF scan

45

PIRSF report (I): a single platform to study proteins Subfamily level 46

PIRSF report (II)

Cross-links to other databases

http://www.geneontology.org/ 47

alpha-Crystallin and Related Proteins

Alpha crystallin beta chain HSPs Alpha crystallin alpha chain 48