Protein Family Classification for Functional Genomics

Download Report

Transcript Protein Family Classification for Functional Genomics

Tutorial:
Bioinformatics Resources
(http://pir.georgetown.edu/~huz/class/bioinfo_resource.html)
Bio-Trac 25 (Proteomics: Principles and Methods)
March 26, 2004
Zhang-Zhi Hu, M.D.
Senior Bioinformatics Scientist
Protein Information Resource
National Biomedical Research Foundation, GUMC
What is Bioinformatics?
computer + mouse = bioinformatics
(information)
(biology)
NIH Biomedical Information Science and Technology
Initiative (BISTI) Working Definition (2000) - Research,
development, or application of computational tools and
approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data.
2
Molecular Biology Database Collection
(http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D3)
-- 548 key databases
of 11 categories
3
(http://pir.georgetown.edu/~huz/class/2004_database_update.html)
4
Overview
Database Contents, Search and Retrieval
I.
II.
III.
IV.
V.
VI.
VII.
Text search / Information retrieval
Sequence & genomics databases
Protein family databases
Database of protein functions
Databases of protein structures
2D-gel databases
Proteomics databases
5
Entrez Text Searches
(http://www.ncbi.nlm.nih.gov/Entrez/)
6
PubMed Literature Database
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed)
7
UniProt Text Search
(http://www.pir.uniprot.
org/cgi-bin/textSearch)
8
PIR Text
Search (I)
(http://pir.georgetown.edu/pir
www/search/textsearch.html)
What’s different
between
CRAA_RABIT
& CYRBAA?
How about
Search:
Crystallin and
SuperFamily?
9
PIR Text
Search
(II)
Can you find which
crystallin that has
3D structure
determined using
PIR text search?
10
I. Sequence & Genomics Databases
GenBank: An annotated collection of all publicly available nucleotide
and protein sequences.
RefSeq: NCBI non-redundant set of reference sequences, including
genomic DNA, transcript (RNA), and protein products
UniProt Consortium Database: Universal protein knowledgebase, a
central resource of protein sequence and function from Swiss-Prot,
TrEMBL and PIR.
LocusLink: Curated sequences and descriptions of genetic loci.
UniGene: Unified clusters of ESTs and full-length mRNA sequences .
OMIM: Online Mendelian inheritance in man: a catalog of human
genetic and genomic disorders.
Model Organism Genome Databases: MGD, RGD, SGD, Flybase…
GeneCards: Integrated database of human genes, maps, proteins and
diseases.
SNP Consortium Database
11
UniProt Consortium Database
UniProt
(knowledgebase)
UniRef
(100,90,50)
UniParc
(archive)
(http://www.uniprot.org)
12
UniProt Sequence Report (I)
(http://www.pir.uniprot.org/cgibin/unipEntry?id=CRAA_RABIT)
13
UniProt Sequence Report (II)
(http://www.pir.uni
prot.org/cgibin/unipEntry?id=
UniRef90_P02489)
14
NCBI LocusLink
(http://www.ncbi.nlm.nih.gov/LocusLink)
15
OMIM: Online Mendelian inheritance in man
(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580)
16
II. Protein Family Databases
Whole Proteins
 PIRSF: A Network Classification System of Protein Families
 COG (Clusters of Orthologous Groups) of Complete Genomes
 ProtoNet: Automated Hierarchical Classification of Proteins
Protein Domains
 Pfam: Alignments and HMM Models of Protein Domains
 SMART: Protein Domain Families
 CDD: Conserved Domain Database
Protein Motifs
 PROSITE: Protein Patterns and Profiles
 BLOCKS: Protein Sequence Motifs and Alignments
 PRINTS: Protein Sequence Motifs and Signatures
Integrated Family Databases
 iProClass: Superfamilies/Families, Domains, Motifs, Rich Links
 InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF,
SuperFamily
17
Domain
Classification
(http://www.sanger.ac.uk/cgibin/Pfam/swisspfamget.pl?na
me=CRAA_RABIT)
(http://pir.georgetown.edu/cgi-bin/ipcEntry?id=CRAA_RABIT)
18
Pfam Domain
(http://www.sanger.ac.uk/cgibin/Pfam/getacc?PF00525)
19
Integrated Family Classification
InterPro:
An integrated
resource unifying
PROSITE,
PRINTS, ProDom,
Pfam, SMART, and
TIGRFAMs,
PIRSF.
(http://www.ebi.ac.
uk/interpro/search.
html)
20
PIRSF:
Full Length
Classification
iProClass
Family Report
(http://pir.georgetown.edu/c
gi-bin/ipcSF?id=SF002280)
21
III. Databases of Protein Functions
Metabolic Pathways, Enzymes, and Compounds








Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed
Reactions (EC-IUBMB)
KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways
LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes
EcoCyc: Encyclopedia of E. coli Genes and Metabolism
MetaCyc: Metabolic Encyclopedia (Metabolic Pathways)
WIT: Functional Curation and Metabolic Models
BRENDA: Enzyme Database
UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways
Cellular Regulation and Gene Networks





EpoDB: Genes Expressed during Human Erythropoiesis
BIND: Descriptions of interactions, molecular complexes and pathways
DIP: Catalogs experimentally determined interactions between proteins
BioCarta: Biological pathways of human and mouse
GO: Gene Ontology Consortium Database
22
KEGG Metabolic & Regulatory Pathways
KEGG is a suite of databases and associated software, integrating our current knowledge
on molecular interaction networks, the information of genes and proteins, and of chemical
compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html)
(http://www.genome.ad.jp/dbgetbin/show_pathway?hsa00220+4.3.2.1)
23
BioCyc (EcoCyc/MetaCyc
Metabolic Pathways)
The BioCyc Knowledge Library is a collection of
Pathway/Genome Databases (http://biocyc.org/)
24
BioCarta Cellular Pathways
(http://www.biocarta.com/index.asp)
25
Protein-Protein Interaction: BIND
(http://www.bind.ca/)
26
Gene Ontology
(http://www.geneontology.org/)
Three GOs:
Molecular Function
Biological Process
Cellular Component
27
IV. Databases of Protein Structures
Protein Structure





PDB: Structure Determined by X-ray Crystallography and NMR
PDBsum: Summaries and analyses of PDB structures
MMDB: NCBI’s database of 3D structures, part of NCBI Entrez
SWISS-MODEL Repository: Database of annotated protein 3D models
ModBase: Annotated comparative protein structure models
Structure Classification



CATH: Hierarchical Classification of Protein Domain Structures
SCOP: Familial and Structural Protein Relationships
FSSP: Protein Fold Classification Based on Structure--Structure
Alignment
28
PDB 3D Structure
Rat gamma-crystallin,
chain A, B.
Can you do a text search
at PIR to find this?
(http://www.rcsb.org/pdb/)
29
PDBsum:
Summary and Analysis
(http://www.biochem.ucl.
ac.uk/bsm/pdbsum)
30
Protein Structural Classification (1)
CATH: Hierarchical domain
classification of protein structures
(http://www.biochem.
ucl.ac.uk/bsm/cath_new/)
31
Protein Structural Classification (2)
SCOP: comprehensive description of structural and evolutionary relationships
between all proteins whose structure is known.
(http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html)
32
SWISS-MODEL Repository
A database of annotated three-dimensional
comparative protein structure models
(http://swissmodel.expasy.org/repository/s
mr.php?sptr_ac=CRGE_RAT&job=2)
33
VI. Proteomic Resources
GELBANK (http://gelbank.anl.gov): 2D-gel patterns from completed
genomes; SWISS-2DPAGE (http://www.expasy.org/ch2d/)
PEP: Predictions for Entire Proteomes: (http://cubic.bioc.columbia.edu/
pep/): Summarized analyses of protein sequences
Proteome BioKnowledge Library: (http://www.proteome.com): Detailed
information on human, mouse and rat proteomes
Proteome Analysis Database (http://www.ebi.ac.uk/proteome/): Online
application of InterPro and CluSTr for the functional classification of
proteins in whole genomes
Expression Profiling databases: GNF (http://expression.gnf.org/cgibin/index.cgi, human and mouse transcriptome), SMD (http://genomewww5.stanford.edu/MicroArray/SMD/, Stanford microarray data
analysis), EBI Microarray Informatics (http://www.ebi.ac.uk/microarray/
index.html , managing, storing and analyzing microarray data)
34
2D-Gel Image Databases (1)
(http://us.expasy.org/ch2d/2d-index.html)
(http://us.expasy.org/cgi-bin/nice2dpage.pl?P02489)
35
2D-Gel Image Databases (2)
(http://gelbank.anl.gov/2dgels/index.asp)
36
Expression Profiling
Human and Mouse Transcriptome
(http://genomewww.stanford.edu
/serum/)
(http://expression.gnf.org/cgi-bin/index.cgi)
(http://expression.gnf.org/
cgi-bin/index.cgi/)
37
Lab:
Alpha crystallin (UniProt: CRAA_RABIT)
Delta crystallin II
(Argininosuccinate lyase)
(UniProt: CRD2_ANAPL)
Choose additional
protein IDs to browse the
variety of molecular
biology databases each
sequence report links to.
38