Protein Family Classification for Functional Genomics

Download Report

Transcript Protein Family Classification for Functional Genomics

Tutorial:

Bioinformatics Resources

( http://pir.georgetown.edu/pirwww/workshop/bioinfo_resource.html

)

Bio-Trac 25 (Proteomics: Principles and Methods)

March 23, 2007 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center 1

What is Bioinformatics?

computer + mouse = bioinformatics (information) (biology) • NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000) -

Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health

data

, including those to

acquire

,

store

,

organize

,

archive

,

analyze

, or

visualize

such data.

2

Molecular Biology Database Collection ( http://nar.oxfordjournals.org/cgi/content /full/35/suppl_1/D3/DC1 ) - 968 key databases of 14 categories 3

Database Collection in

Nucleic Acids Res.

4

Online Access to Database Collection http://pir.georgetown.edu/pirwww/workshop/2005_database_update.html

2007

5 http://www.oxfordjournals.org/nar/database/cap/

Overview

Database Contents, Search and Retrieval I.

Text search / Information retrieval II. Sequence & genomics databases III. Protein family databases IV. Database of protein functions V. Databases of protein structures VI. Proteomics databases 6

Entrez Text Searches

( http://www.ncbi.nlm.nih.gov/Entrez/ )

7

PubMed Literature Database

( http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed )

Literature mining 8

iProLINK: Protein Literature Mining Resource

Text mining for protein phosphorylation Gene/protein name thesaurus: synonyms, ambiguous names… http://pir.georgetown.edu/iprolink/ 9

BioThesaurus: Gene/protein name searches - synonyms, ambiguous names…

Synonyms:

CRYAA crystallin, alpha A CRYA1 HSPB4… http://pir.georgetown.edu/iprolink/biothesaurus 10

RLIMS-P: Text mining for protein phosphorylation http://pir.georgetown.edu/iprolink/rlimsp/ 11

UniProt Text Search

( http://www.pir.uniprot.

org/cgi-bin/textSearch ) Google type search vs. Boolean searches: AND, OR, NOT 12

PIR Text Search (I)

( http://pir.georgetown.edu/pirww w/search/textsearch.html

) Search: alpha crystallin A chain that are in protein families?

Search for synonyms

13

PIR Text Search (II)

Search: what crystallins are enzymes and what families they belong to?

Can you find which crystallins have 3D structure determined ?

14

I. Sequence & Genomics Databases

• • • • • • • • •

GenBank

: An annotated collection of all publicly available nucleotide and protein sequences.

RefSeq

: NCBI non-redundant set of reference sequences, including genomic DNA, transcript (RNA), and protein products

UniProt

Consortium Database: Universal protein resource, a central repository of protein sequence and function.

Entrez Gene

: Gene-centered information at NCBI.

UniGene

: Unified clusters of ESTs and full-length mRNA sequences .

OMIM

: Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders.

Model Organism

Genome Databases: MGD, RGD, SGD, Flybase…

GeneCards

: Integrated database of human genes, maps, proteins and diseases.

SNP

Consortium Database; International associated with human disease

HapMap

Project: Genes 15 ( http://www.oxfordjournals.org/nar/database/cap/ )

UniProt Consortium Databases

Universal Protein Resource 4.1 million ( http://www.uniprot.org

) UniProtKB UniRef UniParc 16

UniProt Sequence Report (I)

UniProtKB What’s the difference between CRYAA_RABIT & CYRBAA? ( http://www.pir.uniprot.org/cgi bin/unipEntry?id=CRYAA_RABIT ) 17

UniProt Report (II): UniRef100 & 90

UniRef100 ( http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef100_P02489 ) UniRef90 18 ( http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef90_P02489 )

Entrez Gene –

Gene centric information

19 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq

OMIM: Online Mendelian inheritance in man

( http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580 ) 20

II. Protein Family Databases

• • • •

Whole Proteins

– PIRSF: Network Classification Based on Evolutionary Relationship of Whole Protein – COG (Clusters of Orthologous Groups) of Complete Genomes – PANTHER: Proteins Classified into Families/Subfamilies of Shared Function – ProtoNet: Automated Hierarchical Classification of Proteins

Protein Domains

– Pfam: Alignments and HMM Models of Protein Domains – SMART: Protein Domain Families – CDD: Conserved Domain Database

Protein Motifs

– PROSITE: Protein Patterns and Profiles – BLOCKS: Protein Sequence Motifs and Alignments – PRINTS: Compendium of Protein Fingerprints (a group of conserved motifs)

Integrated Family Databases

– InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF, SuperFamily… 21

Protein Clustering

Initial version

COGs:

( http://www.ncbi.nlm.

nih.gov/COG/ )

New version: Includes Eukaryotic Clusters KOGs

22

PIRSF:

Full Length Classification

iProClass

Family Report 23 ( http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280 )

Domain Classification – Pfam Domain ( http://www.sanger.ac.uk/cgi bin/Pfam/swisspfamget.pl?name= CRYAA_RABIT ) ( http://pir.georgetown.edu/cgi bin/ipcEntry?id=P02493 ) 24

Pfam Domain

( http://www.sanger.ac.uk/cgi bin/Pfam/getacc?PF00525

) 25

Protein Motifs:

PROSITE

A database of protein families and domains. It consists of biologically significant sites, patterns and profiles.

( http://us.expasy.org/prosite/ ) 26

Integrated Family Classification InterPro : An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. ( http://www.ebi.ac.uk/ interpro/search.html

)

Mapping of families

27

III. Databases of Protein Functions

• Metabolic Pathways, Enzymes, and Compounds – Enzyme Classification: Classification and Nomenclature of Enzyme Catalysed Reactions (EC-IUBMB) – KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways – LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes – EcoCyc: Encyclopedia of

E. coli

Genes and Metabolism – MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) – BRENDA: Enzyme Database – UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways • Inter-Molecular interactions and Regulatory Pathways – IntAct: Protein interaction data from literature and user submission – BIND: Descriptions of interactions, molecular complexes and pathways – DIP: Catalogs experimentally determined interactions between proteins – Reactome - A curated knowledgebase of biological pathways – BioCarta: Biological pathways of human and mouse – GO: Gene Ontology Consortium Database • Pathway Resources - Pathguide 28

Biological Pathway Resource Collection

http://www.pathguide.org/ • Protein-protein interactions • Metabolic pathways • Signaling pathways • Pathway diagrams • Transcription factors / gene regulatory networks • Protein-compound interactions • Genetic interaction networks 29

KEGG Metabolic & Regulatory Pathways

KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks, the information of genes and proteins, and of chemical compounds and reactions. ( http://www.genome.ad.jp/kegg/kegg2.html

) ( http://www.genome.ad.jp/dbget bin/show_pathway?hsa00220+4.3.2.1

) 30

BioCyc: EcoCyc/MetaCyc Metabolic Pathways The BioCyc Knowledge Library is a collection of Pathway/Genome Databases ( http://biocyc.org/ ) 31

BioCarta Cellular Pathways

( http://www.biocarta.com/index.asp

) 32

Reactome:

http://www.reactome.org/ • Collaboration of CSHL, EBI and GO Consortium • Curated resource of core pathways and reactions in human biology • Authored by biological researchers of field experts • Cross-referenced with NCBI, Ensembl and UniProt, HapMap, KEGG… • Inferred orthologous events in 22 non-human species (mouse, rat…) 33

Transforming Growth Factor (TGF) beta signaling [Homo sapiens] ( http://reactome.org/cgi bin/eventbrowser?DB=gk_curre nt&FOCUS_SPECIES=Homo% 20sapiens&ID=170834& ) Reactome:

events

and

objects

(including modified forms and complex) Event ->REACT_6879.1: Activated type I receptor phosphorylates R-SMAD directly [Homo sapiens] Object -> REACT_7364.1: Phospho-R-SMAD [cytosol] Event -> REACT_6760.1: Phospho-R-SMAD forms a complex with CO-SMAD [Homo sapiens] Object -> REACT_7344.1: Phospho-R-SMAD:CO-SMAD complex [cytosol] Event -> REACT_6726.1: The phospho-R-SMAD:CO-SMAD transfers to the nucleus Object -> REACT_7382.2: Phospho-R-SMAD:CO-SMAD complex [nucleoplasm] …… 34

Protein-Protein Interaction Database - IntAct ( http://www.ebi.ac.uk/intact/ ) 35

Gene Ontology (GO)

( http://www.geneontology.org/ ) Molecular Function - Biological Process Cellular Component 36

IV. Databases of Protein Structures

• Protein Structure – PDB: Structure Determined by X-ray Crystallography and NMR – PDBsum: Summaries and analyses of PDB structures – MMDB: NCBI’s database of 3D structures, part of NCBI Entrez – SWISS-MODEL Repository: Database of annotated protein 3D models – ModBase: Annotated comparative protein structure models • Structure Classification – CATH: Hierarchical Classification of Protein Domain Structures – SCOP: Familial and Structural Protein Relationships – FSSP: Protein Fold Classification Based on Structure--Structure Alignment 37

PDB: Experimental 3D Structure Repository Rat gamma-crystallin (chain A, B.) Can you do a text search at PIR to find this (CRGE_RAT)?

( http://www.rcsb.org/pdb/ ) 38

PDBsum:

Pictorial Database to Provide Summary and Analysis to PDB Entries

Search

( http://www.ebi.ac.uk/thornto n-srv/databases/pdbsum/ )

3-D structure summary 2-D structure

39

Protein Structural Classification (1)

CATH : Hierarchical domain classification of protein structures

( http://www.cathdb.info/latest/index.html

) 40

Protein Structural Classification (2)

SCOP:

comprehensive description of structural and evolutionary relationships between all proteins whose structure is known.

( http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html

) 41

SWISS-MODEL Repository

A database of annotated three-dimensional comparative protein structure models ( http://swissmodel.expasy.org/repository/smr.php?spt

r_ac=CRGE_RAT&job=2 ) 42

VI. Proteomic Resources

• GELBANK ( http://gelbank.anl.gov

): 2D-gel patterns of species with completed genomes. • SWISS-2DPAGE ( http://www.expasy.org/ch2d/ ): index of 2D-gels • PEP ( http://cubic.bioc.columbia.edu/ pep/ ): Predictions for Entire Proteomes: summarized analyses of protein sequences • Integr8 ( http://www.ebi.ac.uk/integr8/ ): A browser for information relating to completed genomes and proteomes, based on data contained in Genome Reviews and the UniProt proteome sets • PRIDE ( http://www.ebi.ac.uk/pride/ ): PRoteomics IDEntifications database Expression Profiling databases • GPMdb ( http://gpmdb.thegpm.org/ ): Mass Spec Proteomics Databases 43

2D-Gel Image Databases

( http://us.expasy.org/ch2d/ ) Part of WORLD-2DPAGE:

index to 2-D PAGE databases and services

44 ( http://us.expasy.org/swiss-2dpage/ac=P02489 )

GPMdb:

MS Data Search

( http://gpmdb.thegpm.org

/ )

Craig, et al., J Proteome Res.

2004, 3:1234-42. 45

HUPO Plasma Proteome Project http://www.ebi.ac.uk/pride/

PRIDE

: centralized, standards compliant, public data repository for proteomics data 46

Lab:

I.

Text search / Information retrieval

1. Literature search and text mining – Finding synonyms (BioThesaurus) – Information extraction (e.g., protein phosphorylation sites) 2. Find the sequence for the rabbit alpha crystallin A chain 3. Find all alpha crystallin A chain classified in protein families 4. Search crystallins that have active enzyme activities 5. Find crystallins that have determined 3D structures

II. Database contents (reports)

1. Sequence & genomics databases (UniProt) 2. Protein family databases (PIRSF) 3. Database of protein functions (KEGG) 4. Databases of protein structures (PDB) 5. Proteomics databases (Swiss-2D) Protein Examples • • • Rabbit alpha crystallin A (UniProtKB: CRYAA_RABIT/P02493) Delta crystallin II (Argininosuccinate lyase) (UniProtKB: ARLY2_ANAPL/P24058) Any additional proteins of your interest for search and retrieval 47