Protein Family Classification for Functional Genomics

Download Report

Transcript Protein Family Classification for Functional Genomics

Tutorial:
Bioinformatics Resources
(http://pir.georgetown.edu/~huz/class/bioinfo_resource.html)
Bio-Trac 25 (Proteomics: Principles and Methods)
March 24, 2006
Zhang-Zhi Hu, M.D.
Senior Bioinformatics Scientist, Protein Information Resource
Research Assistant Professor, Department of
Biochemistry and Molecular Biology
Georgetown University Medical Center
What is Bioinformatics?
computer + mouse = bioinformatics
(information)
(biology)
NIH Biomedical Information Science and Technology
Initiative (BISTI) Working Definition (2000) - Research,
development, or application of computational tools and
approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data.
2
Molecular Biology Database Collection
-- 858 key databases
of 15 categories
(http://nar.oxfordjournals.org/cgi/content
/full/34/suppl_1/D3/DC1)
3
Database Collection in Nucleic Acids Res.
4
Online Access to Database Collection
2006
http://pir.georgetown.edu/~huz/class/2005_database_update.html
http://www.oxfordjournals.org/nar/database/cap/
5
Overview
Database Contents, Search and Retrieval
I.
II.
III.
IV.
V.
VI.
Text search / Information retrieval
Sequence & genomics databases
Protein family databases
Database of protein functions
Databases of protein structures
Proteomics databases
6
Entrez Text Searches
(http://www.ncbi.nlm.nih.gov/Entrez/)
7
PubMed Literature Database
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed)
8
UniProt Text Search
(http://www.pir.uniprot.
org/cgi-bin/textSearch)
Google type search vs. Boolean
searches: AND, OR, NOT
9
PIR Text Search (I)
(http://pir.georgetown.edu/pirwww/
search/textsearch.html)
Search: Alpha crystallin A chain and
protein family?
10
PIR Text Search (II)
Search:
Crystallins
that are
enzymes ?
Can you find
which crystallin
that has 3D
structure
determined?
11
I. Sequence & Genomics Databases
GenBank: An annotated collection of all publicly available nucleotide
and protein sequences.
RefSeq: NCBI non-redundant set of reference sequences, including
genomic DNA, transcript (RNA), and protein products
UniProt Consortium Database: Universal protein knowledgebase, a
central resource of protein sequence and function from Swiss-Prot,
TrEMBL and PIR.
Entrez Gene: Gene-centered information at NCBI.
UniGene: Unified clusters of ESTs and full-length mRNA sequences .
OMIM: Online Mendelian inheritance in man: a catalog of human
genetic and genomic disorders.
Model Organism Genome Databases: MGD, RGD, SGD, Flybase…
GeneCards: Integrated database of human genes, maps, proteins and
diseases.
SNP Consortium Database
12
UniProt Consortium Databases
Universal Protein Resource
2.85 million
(http://www.uniprot.org)
UniProtKB
UniRef
UniParc
13
UniProt Sequence Report (I)
What’s the difference between
CRYAA_RABIT & CYRBAA?
(http://www.pir.uniprot.org/cgibin/unipEntry?id=CRYAA_RABIT)
14
UniProt Sequence Report (II)
(http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef100_P02489)
(http://www.pir.uni
prot.org/cgibin/unipEntry?id=
UniRef90_P02489)
15
Entrez Gene
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd
=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq
16
OMIM: Online Mendelian inheritance in man
(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580)
17
II. Protein Family Databases
Whole Proteins
 PIRSF: A Network Classification System of Protein Families
 COG (Clusters of Orthologous Groups) of Complete Genomes
 ProtoNet: Automated Hierarchical Classification of Proteins
Protein Domains
 Pfam: Alignments and HMM Models of Protein Domains
 SMART: Protein Domain Families
 CDD: Conserved Domain Database
Protein Motifs
 PROSITE: Protein Patterns and Profiles
 BLOCKS: Protein Sequence Motifs and Alignments
 PRINTS: Protein Sequence Motifs and Signatures
Integrated Family Databases
 iProClass: Superfamilies/Families, Domains, Motifs, Rich Links
 InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF,
SuperFamily
18
Protein Clustering
COGs:
(http://www.ncbi.nlm.
nih.gov/COG/)
19
KOGs:
Eukaryotic
Clusters
(http://www.ncbi.nlm.nih.
gov/COG/new/shokog.cgi?
KOG3591)
20
Domain Classification
(http://www.sanger.ac.uk/cgibin/Pfam/swisspfamget.pl?na
me=CRYAA_RABIT)
(http://pir.georgetown.edu/cgi-bin/ipcEntry?id=CRYAA_RABIT)
21
Pfam Domain
(http://www.sanger.ac.uk/cgibin/Pfam/getacc?PF00525)
22
Integrated Family Classification
InterPro:
An integrated
resource unifying
PROSITE,
PRINTS, ProDom,
Pfam, SMART, and
TIGRFAMs,
PIRSF.
(http://www.ebi.ac.
uk/interpro/search.
html)
23
PIRSF:
Full Length
Classification
iProClass
Family Report
(http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280)
24
Protein Motifs
PROSITE is a database of protein families and domains. It consists of
biologically significant sites, patterns and profiles. (http://us.expasy.org/prosite/)
25
III. Databases of Protein Functions
Metabolic Pathways, Enzymes, and Compounds







Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed
Reactions (EC-IUBMB)
KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways
LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes
EcoCyc: Encyclopedia of E. coli Genes and Metabolism
MetaCyc: Metabolic Encyclopedia (Metabolic Pathways)
BRENDA: Enzyme Database
UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways
Cellular Regulation and Gene Networks





EpoDB: Genes Expressed during Human Erythropoiesis
BIND: Descriptions of interactions, molecular complexes and pathways
DIP: Catalogs experimentally determined interactions between proteins
BioCarta: Biological pathways of human and mouse
GO: Gene Ontology Consortium Database
26
KEGG Metabolic & Regulatory Pathways
KEGG is a suite of databases and associated software, integrating our current knowledge
on molecular interaction networks, the information of genes and proteins, and of chemical
compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html)
(http://www.genome.ad.jp/dbgetbin/show_pathway?hsa00220+4.3.2.1)
27
BioCyc (EcoCyc/MetaCyc
Metabolic Pathways)
The BioCyc Knowledge Library is a collection of
Pathway/Genome Databases (http://biocyc.org/)
28
BioCarta Cellular Pathways
(http://www.biocarta.com/index.asp)
29
Protein-Protein Interaction: BIND
(http://www.bind.ca/)
30
Gene Ontology
(http://www.geneontology.org/)
Three GOs:
Molecular Function
Biological Process
Cellular Component
31
IV. Databases of Protein Structures
Protein Structure





PDB: Structure Determined by X-ray Crystallography and NMR
PDBsum: Summaries and analyses of PDB structures
MMDB: NCBI’s database of 3D structures, part of NCBI Entrez
SWISS-MODEL Repository: Database of annotated protein 3D models
ModBase: Annotated comparative protein structure models
Structure Classification



CATH: Hierarchical Classification of Protein Domain Structures
SCOP: Familial and Structural Protein Relationships
FSSP: Protein Fold Classification Based on Structure--Structure
Alignment
32
PDB: Experimental 3D Structure Repository
Rat gamma-crystallin,
chain A, B.
Can you do a text
search at PIR to
find this?
(http://www.rcsb.org/pdb/)
33
PDBsum:
Summary and Analysis
(http://www.ebi.ac.uk/thorntonsrv/databases/pdbsum/)
Search
3-D structure summary
2-D structure
34
Protein Structural Classification (1)
CATH: Hierarchical domain
classification of protein structures
(http://www.biochem.
ucl.ac.uk/bsm/cath_new/)
35
Protein Structural Classification (2)
SCOP: comprehensive description of structural and evolutionary relationships
between all proteins whose structure is known.
(http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html)
36
SWISS-MODEL Repository
A database of annotated three-dimensional
comparative protein structure models
(http://swissmodel.expasy.org/repository/s
mr.php?sptr_ac=CRGE_RAT&job=2)
37
VI. Proteomic Resources
GELBANK (http://gelbank.anl.gov): 2D-gel patterns from completed
genomes; SWISS-2DPAGE (http://www.expasy.org/ch2d/)
PEP (http://cubic.bioc.columbia.edu/ pep/): Predictions for Entire
Proteomes: summarized analyses of protein sequences
Integr8 (http://www.ebi.ac.uk/integr8/): A browser for information
relating to completed genomes and proteomes, based on data
contained in Genome Reviews and the UniProt proteome sets
PRIDE (http://www.ebi.ac.uk/pride/): PRoteomics IDEntifications
database Expression Profiling databases
GPMdb (http://gpmdb.thegpm.org/): Mass Spec Proteomics
Databases
38
2D-Gel Image Databases (1)
(http://us.expasy.org/ch2d/2d-index.html)
(http://us.expasy.org/cgi-bin/nice2dpage.pl?P02489)
39
2D-Gel Image Databases (2)
(http://gelbank.anl.gov/2dgels/index.asp)
40
GPMdb MS Data Search
Craig, et al., J Proteome Res. 2004, 3:1234-42.
http://gpmdb.thegpm.org/
41
iProLINK: Protein Literature Mining Resource
Text mining of Protein phospohrylation
Gene/protein name thesaurus:
synonyms, ambiguous names…
http://pir.georgetown.edu/iprolink/
42
Lab:
Alpha crystallin A (UniProt: CRYAA_RABIT/P02493)
Delta crystallin II (Argininosuccinate lyase)
(UniProt: ARLY2_ANAPL/P24058)
Choose additional protein IDs to browse the variety of
molecular biology databases each sequence report links to.
43