Transcript Protein Family Classification for Functional Genomics
Tutorial:
Bioinformatics Resources
( http://pir.georgetown.edu/pirwww/workshop/bioinfo_resource.html
)
Bio-Trac 25 (Proteomics: Principles and Methods)
March 23, 2007 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center 1
What is Bioinformatics?
computer + mouse = bioinformatics (information) (biology) • NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000) -
Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health
data
, including those to
acquire
,
store
,
organize
,
archive
,
analyze
, or
visualize
such data.
2
Molecular Biology Database Collection ( http://nar.oxfordjournals.org/cgi/content /full/35/suppl_1/D3/DC1 ) - 968 key databases of 14 categories 3
Database Collection in
Nucleic Acids Res.
4
Online Access to Database Collection http://pir.georgetown.edu/pirwww/workshop/2005_database_update.html
2007
5 http://www.oxfordjournals.org/nar/database/cap/
Overview
Database Contents, Search and Retrieval I.
Text search / Information retrieval II. Sequence & genomics databases III. Protein family databases IV. Database of protein functions V. Databases of protein structures VI. Proteomics databases 6
Entrez Text Searches
( http://www.ncbi.nlm.nih.gov/Entrez/ )
7
PubMed Literature Database
( http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed )
Literature mining 8
iProLINK: Protein Literature Mining Resource
Text mining for protein phosphorylation Gene/protein name thesaurus: synonyms, ambiguous names… http://pir.georgetown.edu/iprolink/ 9
BioThesaurus: Gene/protein name searches - synonyms, ambiguous names…
Synonyms:
CRYAA crystallin, alpha A CRYA1 HSPB4… http://pir.georgetown.edu/iprolink/biothesaurus 10
RLIMS-P: Text mining for protein phosphorylation http://pir.georgetown.edu/iprolink/rlimsp/ 11
UniProt Text Search
( http://www.pir.uniprot.
org/cgi-bin/textSearch ) Google type search vs. Boolean searches: AND, OR, NOT 12
PIR Text Search (I)
( http://pir.georgetown.edu/pirww w/search/textsearch.html
) Search: alpha crystallin A chain that are in protein families?
Search for synonyms
13
PIR Text Search (II)
Search: what crystallins are enzymes and what families they belong to?
Can you find which crystallins have 3D structure determined ?
14
I. Sequence & Genomics Databases
• • • • • • • • •
GenBank
: An annotated collection of all publicly available nucleotide and protein sequences.
RefSeq
: NCBI non-redundant set of reference sequences, including genomic DNA, transcript (RNA), and protein products
UniProt
Consortium Database: Universal protein resource, a central repository of protein sequence and function.
Entrez Gene
: Gene-centered information at NCBI.
UniGene
: Unified clusters of ESTs and full-length mRNA sequences .
OMIM
: Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders.
Model Organism
Genome Databases: MGD, RGD, SGD, Flybase…
GeneCards
: Integrated database of human genes, maps, proteins and diseases.
SNP
Consortium Database; International associated with human disease
HapMap
Project: Genes 15 ( http://www.oxfordjournals.org/nar/database/cap/ )
UniProt Consortium Databases
Universal Protein Resource 4.1 million ( http://www.uniprot.org
) UniProtKB UniRef UniParc 16
UniProt Sequence Report (I)
UniProtKB What’s the difference between CRYAA_RABIT & CYRBAA? ( http://www.pir.uniprot.org/cgi bin/unipEntry?id=CRYAA_RABIT ) 17
UniProt Report (II): UniRef100 & 90
UniRef100 ( http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef100_P02489 ) UniRef90 18 ( http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef90_P02489 )
Entrez Gene –
Gene centric information
19 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq
OMIM: Online Mendelian inheritance in man
( http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580 ) 20
II. Protein Family Databases
• • • •
Whole Proteins
– PIRSF: Network Classification Based on Evolutionary Relationship of Whole Protein – COG (Clusters of Orthologous Groups) of Complete Genomes – PANTHER: Proteins Classified into Families/Subfamilies of Shared Function – ProtoNet: Automated Hierarchical Classification of Proteins
Protein Domains
– Pfam: Alignments and HMM Models of Protein Domains – SMART: Protein Domain Families – CDD: Conserved Domain Database
Protein Motifs
– PROSITE: Protein Patterns and Profiles – BLOCKS: Protein Sequence Motifs and Alignments – PRINTS: Compendium of Protein Fingerprints (a group of conserved motifs)
Integrated Family Databases
– InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF, SuperFamily… 21
Protein Clustering
Initial version
COGs:
( http://www.ncbi.nlm.
nih.gov/COG/ )
New version: Includes Eukaryotic Clusters KOGs
22
PIRSF:
Full Length Classification
iProClass
Family Report 23 ( http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280 )
Domain Classification – Pfam Domain ( http://www.sanger.ac.uk/cgi bin/Pfam/swisspfamget.pl?name= CRYAA_RABIT ) ( http://pir.georgetown.edu/cgi bin/ipcEntry?id=P02493 ) 24
Pfam Domain
( http://www.sanger.ac.uk/cgi bin/Pfam/getacc?PF00525
) 25
Protein Motifs:
PROSITE
–
A database of protein families and domains. It consists of biologically significant sites, patterns and profiles.
( http://us.expasy.org/prosite/ ) 26
Integrated Family Classification InterPro : An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. ( http://www.ebi.ac.uk/ interpro/search.html
)
Mapping of families
27
III. Databases of Protein Functions
• Metabolic Pathways, Enzymes, and Compounds – Enzyme Classification: Classification and Nomenclature of Enzyme Catalysed Reactions (EC-IUBMB) – KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways – LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes – EcoCyc: Encyclopedia of
E. coli
Genes and Metabolism – MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) – BRENDA: Enzyme Database – UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways • Inter-Molecular interactions and Regulatory Pathways – IntAct: Protein interaction data from literature and user submission – BIND: Descriptions of interactions, molecular complexes and pathways – DIP: Catalogs experimentally determined interactions between proteins – Reactome - A curated knowledgebase of biological pathways – BioCarta: Biological pathways of human and mouse – GO: Gene Ontology Consortium Database • Pathway Resources - Pathguide 28
Biological Pathway Resource Collection
http://www.pathguide.org/ • Protein-protein interactions • Metabolic pathways • Signaling pathways • Pathway diagrams • Transcription factors / gene regulatory networks • Protein-compound interactions • Genetic interaction networks 29
KEGG Metabolic & Regulatory Pathways
KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks, the information of genes and proteins, and of chemical compounds and reactions. ( http://www.genome.ad.jp/kegg/kegg2.html
) ( http://www.genome.ad.jp/dbget bin/show_pathway?hsa00220+4.3.2.1
) 30
BioCyc: EcoCyc/MetaCyc Metabolic Pathways The BioCyc Knowledge Library is a collection of Pathway/Genome Databases ( http://biocyc.org/ ) 31
BioCarta Cellular Pathways
( http://www.biocarta.com/index.asp
) 32
Reactome:
http://www.reactome.org/ • Collaboration of CSHL, EBI and GO Consortium • Curated resource of core pathways and reactions in human biology • Authored by biological researchers of field experts • Cross-referenced with NCBI, Ensembl and UniProt, HapMap, KEGG… • Inferred orthologous events in 22 non-human species (mouse, rat…) 33
Transforming Growth Factor (TGF) beta signaling [Homo sapiens] ( http://reactome.org/cgi bin/eventbrowser?DB=gk_curre nt&FOCUS_SPECIES=Homo% 20sapiens&ID=170834& ) Reactome:
events
and
objects
(including modified forms and complex) Event ->REACT_6879.1: Activated type I receptor phosphorylates R-SMAD directly [Homo sapiens] Object -> REACT_7364.1: Phospho-R-SMAD [cytosol] Event -> REACT_6760.1: Phospho-R-SMAD forms a complex with CO-SMAD [Homo sapiens] Object -> REACT_7344.1: Phospho-R-SMAD:CO-SMAD complex [cytosol] Event -> REACT_6726.1: The phospho-R-SMAD:CO-SMAD transfers to the nucleus Object -> REACT_7382.2: Phospho-R-SMAD:CO-SMAD complex [nucleoplasm] …… 34
Protein-Protein Interaction Database - IntAct ( http://www.ebi.ac.uk/intact/ ) 35
Gene Ontology (GO)
( http://www.geneontology.org/ ) Molecular Function - Biological Process Cellular Component 36
IV. Databases of Protein Structures
• Protein Structure – PDB: Structure Determined by X-ray Crystallography and NMR – PDBsum: Summaries and analyses of PDB structures – MMDB: NCBI’s database of 3D structures, part of NCBI Entrez – SWISS-MODEL Repository: Database of annotated protein 3D models – ModBase: Annotated comparative protein structure models • Structure Classification – CATH: Hierarchical Classification of Protein Domain Structures – SCOP: Familial and Structural Protein Relationships – FSSP: Protein Fold Classification Based on Structure--Structure Alignment 37
PDB: Experimental 3D Structure Repository Rat gamma-crystallin (chain A, B.) Can you do a text search at PIR to find this (CRGE_RAT)?
( http://www.rcsb.org/pdb/ ) 38
PDBsum:
Pictorial Database to Provide Summary and Analysis to PDB Entries
Search
( http://www.ebi.ac.uk/thornto n-srv/databases/pdbsum/ )
3-D structure summary 2-D structure
39
Protein Structural Classification (1)
CATH : Hierarchical domain classification of protein structures
( http://www.cathdb.info/latest/index.html
) 40
Protein Structural Classification (2)
SCOP:
comprehensive description of structural and evolutionary relationships between all proteins whose structure is known.
( http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html
) 41
SWISS-MODEL Repository
A database of annotated three-dimensional comparative protein structure models ( http://swissmodel.expasy.org/repository/smr.php?spt
r_ac=CRGE_RAT&job=2 ) 42
VI. Proteomic Resources
• GELBANK ( http://gelbank.anl.gov
): 2D-gel patterns of species with completed genomes. • SWISS-2DPAGE ( http://www.expasy.org/ch2d/ ): index of 2D-gels • PEP ( http://cubic.bioc.columbia.edu/ pep/ ): Predictions for Entire Proteomes: summarized analyses of protein sequences • Integr8 ( http://www.ebi.ac.uk/integr8/ ): A browser for information relating to completed genomes and proteomes, based on data contained in Genome Reviews and the UniProt proteome sets • PRIDE ( http://www.ebi.ac.uk/pride/ ): PRoteomics IDEntifications database Expression Profiling databases • GPMdb ( http://gpmdb.thegpm.org/ ): Mass Spec Proteomics Databases 43
2D-Gel Image Databases
( http://us.expasy.org/ch2d/ ) Part of WORLD-2DPAGE:
index to 2-D PAGE databases and services
44 ( http://us.expasy.org/swiss-2dpage/ac=P02489 )
GPMdb:
MS Data Search
( http://gpmdb.thegpm.org
/ )
Craig, et al., J Proteome Res.
2004, 3:1234-42. 45
HUPO Plasma Proteome Project http://www.ebi.ac.uk/pride/
PRIDE
: centralized, standards compliant, public data repository for proteomics data 46
Lab:
I.
Text search / Information retrieval
1. Literature search and text mining – Finding synonyms (BioThesaurus) – Information extraction (e.g., protein phosphorylation sites) 2. Find the sequence for the rabbit alpha crystallin A chain 3. Find all alpha crystallin A chain classified in protein families 4. Search crystallins that have active enzyme activities 5. Find crystallins that have determined 3D structures
II. Database contents (reports)
1. Sequence & genomics databases (UniProt) 2. Protein family databases (PIRSF) 3. Database of protein functions (KEGG) 4. Databases of protein structures (PDB) 5. Proteomics databases (Swiss-2D) Protein Examples • • • Rabbit alpha crystallin A (UniProtKB: CRYAA_RABIT/P02493) Delta crystallin II (Argininosuccinate lyase) (UniProtKB: ARLY2_ANAPL/P24058) Any additional proteins of your interest for search and retrieval 47