Transcript ppt
NGS Bioinformatics Workshop
2.5 Meta-Analysis of
Genomic Data
May 30th, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Acknowledgment:
Several slides courtesy of Professor Fiona Brinkman, MBB
Today’s Agenda
A brief overview of the bioinformatics for
SNP detection software
Proteins
Systems biology
Metagenomics (some resources; very brief…)
Group feedback: bioinformatics needs at SFU?
NGS-based SNP Analysis Programs
From: Nielsen et al. 2011. Nature Reviews Genetics 12:443-451
NGS Bioinformatics Workshop
2.5 Meta-Analysis of Genomic Data
BIOINFORMATICS OF PROTEINS
From DNA to Protein to Systems
ATGGAATTC…
5
Amino Acid Properties – Venn Diagram
Polypeptides
O
H3 N +
H
R1
H
N
H
R2 H
N
O H
O
H
R4
O
R3
N
H
O
Ramachandran Plot
Secondary Structure (SS) Prediction
Note major assumptions in all
Entire information for forming ss is contained in the primary sequence
Side groups of residues will determine structure
Pattern recognition
Looks for patterns in common ss’s like amphipathic alpha-helices (e.g. pattern
of polar and non-polar residues)
Homology
Predict ss of the central residue of a given segment from homologous segments
(neighbors)
Based on alignments of homologous residues from a protein family
Assumption: homologous proteins = similar structure
Extension: Use BLOSUM to detect similarity, or, better, use Position Specific
Scoring Matrix (PSSM)
SS Prediction Programs
• PredictProtein-PHD (72%)
– http://www.predictprotein.org/
• PREDATOR (75%)
– http://www-db.embl heidelberg.de/jss/servlet/
de.embl.bk.wwwTools.GroupLeftEMBL/argos/
predator/predator_info.html
• PSIpred (77%)
– http://bioinf.cs.ucl.ac.uk/psipred/ (PSSM generated by
PSI-BLAST, better sequence database, won CASP
competition for many years)
• Jpred (81%)
– http://www.compbio.dundee.ac.uk/jpred/
Tertiary Structure
Lactate
Dehydrogenase:
Mixed a / b
Immunoglobulin
Fold: b
Hemoglobin B
Chain: a
Tertiary Structure: Protein Folds
Holm, L. and Sander, C. (1996)
Mapping the protein universe.
Science, 273, 595-603.
Protein Folds
Folds: definition difficult and different criteria
used for different classification systems
– Normally formed around a separate hydrophobic core
Current protein fold taxonomy
– Very roughly …
– Approx. 1000-2000 different estimated folds,
depending on method of analysis – of which about half
are estimated to be known (500-1000)
– Average domain size approx. 150 aa
(50 – 250 aa approx std dev)
Protein Fold Major Classes
All alpha proteins (all a)
All beta proteins (all b)
Alpha/beta proteins (a/b)
- Parallel strands connected by helices
(bab motifs)
Alpha plus beta proteins (a+b)
- More irregular a and b combinations
“Other”
- Often subclassified now
Protein Fold Classification
• Curated/Semi Manual Classification
– SCOP (Structural Classification Of Proteins)
http://scop.mrc-lmb.cam.ac.uk/scop/
– CATH (Class, Architecture, Topology, Homologous
superfamily)
http://www.cathdb.info/
SCOP classification
Family: clear evolutionarily relationship
–
–
Residue identities >= 30%
OR known similar functions and structures (example:
globins form family though some only 15% identical)
Superfamily: Probable common evolutionary
origin
–
Low sequence identities, but structural and functional
features suggest common evolutionary origin.
(example: actin, ATPase domain of heat shock
proteins, and hexakinase form a superfamily).
Fold: major structural similarity
–
Same major ss in same arrangement with the same
topological connections
– May occur by convergent evolution
SCOP example
17
CATH example
18
Protein Fold Classification
• Automated Classification
– DALI
http://ekhidna.biocenter.helsinki.fi/dali
– VAST (Vector Alignment Search Tool)
http://www.ncbi.nlm.nih.gov/Structure/
VAST/vast.shtml
DALI/FSSP – Automated classification
Exhaustive all-against-all 3D structure comparison of
protein structures currently in the PDB
Domain Classification # (DC_l_m_n_p)
l: fold space attractor region
m: globular folding topology/fold type (clusters of structural neighbours in fold
space with average pairwise Z-scores, by Dali, above 2)
n: functional family (PSI-Blast, clusters of identically conserved functional
residues, E.C. numbers, Swissprot keywords)
p: sequence family (>25% identities)
VAST – Automated classification
http://www.ncbi.nlm.nih.gov/Structure/VAST/vasthelp.html
All against all BLAST comparison of NCBI’s MMDB (database of
known protein structure at NCBI, derived from the PDB)
Clustered into groups by a neighbor joining procedure, using
BLAST p-value cutoffs of C or less (where C=10e-7, 10e-40 or
10e-80, to reflect three different levels of redundancy). A fourth
level of classification is based on sequence identity
Motif and Domain Searching
• InterPro – an integration of tools (PROSITE,
PFAM, PRINTS, PRODOM)
– http://www.ebi.ac.uk/interpro/
• Expasy Tools has more…
– PATTINPROT, to search for patterns in proteins yourself, etc…
But first… Check if the analysis you want to do has
already been done!
i.e. www.ebi.ac.uk/proteome/
db.psort.org
22
Phylofacts
http://phylogenomics.berkeley.edu/phylofacts/
PhyloFacts includes hidden Markov models for classification of usersubmitted protein sequences to protein families across the Tree of Life.
Subcellular Localization Prediction – Example of the
benefit of integrating results with a Baysian approach
Localization Prediction - methods
Several programs analyze single features:
TargetP
Initially one program analyzed multiple features:
PSORT I (eukaryotes and prokaryotes)
Developed in 1990
PSORT I prediction method: Rule based
Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991)
Compositional Analysis
Molecular Weight
Amino Acid Frequency
Isoelectric Point
UV Absorptivity
Solubility, Size, Shape
27
NGS Bioinformatics Workshop
2.1 Meta-Analysis of Genomic Data
SYSTEMS BIOLOGY
Systems Biology
What is systems biology?
① Considers all (or many) of the proteins and genes in
the system
② Links proteins and genes using interactions and
functions
③ Uses computational models to study system
④ Provides insights into mechanisms, system
dynamics, global properties
Molecular Interaction (MI) Network
Nodes = Gene / Protein
Edge = Interaction
Possible interactions:
phosphorylation
physical binding
transcriptional regulation
others?
Cytoscape
Cytoscape supports many use cases
in molecular and systems biology,
genomics, and proteomics:
Load molecular and genetic
interaction data sets in many
formats
Project and integrate global
datasets and functional
annotations
Establish powerful visual
mappings across these data
Perform advanced analysis and
modeling using Cytoscape
plugins
Visualize and analyze humancurated pathway datasets such
as Reactome or KEGG.
http://www.cytoscape.org/
Cytoscape
Control tabs: Network,
VizMapper, plugin tabs
Search for nodes
Visible networks
Network navigation
Change visible attributes
Attributes for highlighted
nodes / edges
Cytoscape – Loading Data
Data Files:
1. Network (Simple Interaction Format)
2. Node attributes (tab-delimited)
3. Gene expression (tab-delimited)
Cytoscape – Loading Data
1. Network (Simple Interaction Format)
• Format:
gene1 interaction_type gene2
• E.g.:
C1QB
C1R
C2
pp
pp
pp
…
C1R
C2
C4
Cytoscape – Loading Data
2. Gene Attribute (tab-delimited table)
•
Maps data values to nodes
Load File
Check off “Show
Text File Import
Options”
Check off “Transfer
first line as attribute
names..”
Preview
Cytoscape – Loading Data
3. Gene expression (tab-delimited table)
• Format:
gene1 exp_cond1 exp_cond2 … sig_cond1 sig_cond2 …
• Expression value: fold-change or intensity from
microarray
• Significance value: P-value indicating how likely
the expression value is different between
conditions.
Cytoscape – Network Style
In “Vizmapper”
tab…
Double-click “Node
color”
Select expression
fold-change values
(CMexp)
Select “Continuous
Mapping” as
mapping type
Can change color by
double-clicking on
arrows
Systems Biology Analyses
1. Differentially-expressed subnetworks
•
jActiveModules
2. Functional enrichment
• BiNGO
Differentially-Expressed Subnetworks
Search for sub-networks that contain a significant
number differentially-expressed genes (nodes)
All genes in sub-network interact…
SO these highly differentially-expressed sub-networks
may represent a critical pathway or complex involved in
a condition of interest
Differentially-Expressed Subnetworks
jActive algorithm:
Searches for sub-networks that contain a significant
number differentially-expressed genes (or nodes)
Heuristic – won’t always find the optimum result
Z-score signifies how likely to find a subnetwork
with a similar number of DE genes.
jActive - Inputs
Select expression
significance
(p-values)
Search from
highlighted nodes
jActive - Results
Subnetworks listed
here
Highlight result and
click “Create
Network”
Functional Enrichment
Functional Enrichment:
Also called over-representation analysis
Searches for common or related functions in a gene set
Is there a common annotation (e.g. pathway, GO term)
for a set of genes that is more frequent than you would
expect by chance?
Gene Ontology
• Controlled vocabulary describing functions, processes and cell
components
• Consistency between organisms and gene products
• GO terms linked by relationships (is-a, part-of) and have
hierarchy (parent – child)
protein complex
organelle
mitochondrion
[other protein
complexes]
fatty acid beta-oxidation
multienzyme complex
[other organelles]
is-a
part-of
Functional Enrichment
BiNGO:
Looks for GO terms that are over-represented in a set of
genes.
Displays the results in two ways
A table with p-values
A graph showing relationships between terms
Uses the hypergeometric test to statistically test for overrepresentation of each GO term.
Performs multiple hypothesis correction (since we are testing
multiple GO terms for over-representation).
BiNGO - Inputs
Fill in Name
Lower significance level
Select “Custom” and then
load go.annot file
Click Start BiNGO
BiNGO - Results
BiNGO - Results
General GO Terms
Significance
Specific GO Terms
EGAN: Exploratory Gene Association Networks
http://akt.ucsf.edu/EGAN/
NGS Bioinformatics Workshop
2.5 Meta-Analysis of Genomic Data
METAGENOMICS
What is Metagenomics?
The culture-independent isolation and characterization
of DNA from uncultured microorganism communities
Nice reading list on the topic:
http://www.cbcb.umd.edu/confcour/CMSC828Gmaterials/reading-list.html
See also: Torsten Thomas Jack Gilbert and Folker Meyer. 2012.
Metagenomics - a guide from sampling to data analysis.
Microb. Inform. Exp. doi:10.1186/2042-5783-2-3
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351745/
I will just mention a few relevant bioinformatics tools
here (no specific endorsements implied).
MG-RAST server
http://metagenomics.nmpdr.org/
Meyer, F. et al. 2008. The metagenomics RAST server –
a public resource for the automatic phylogenetic and
functional analysis of metagenomes. BMC
Bioinformatics. 9:386 doi:10.1186/1471-2105-9-386
MEGAN - MEtaGenome ANalyzer
http://ab.inf.uni-tuebingen.de/software/megan/
Huson DH et al. 2007. MEGAN analysis of metagenomic data. Genome Res. 17: 377-386