Status and plans, human vs. mouse alignments

Download Report

Transcript Status and plans, human vs. mouse alignments

Comparative Genomics
Ross Hardison, Penn State University
Major collaborators: Webb Miller, Francesca Chiaromonte, Laura
Elnitski, David King, et al., PSU
James Taylor: Courant Institute, New York University
David Haussler, Jim Kent, Univ. California at Santa Cruz
Ivan Ovcharenko, Lawrence Livermore National Lab
PSU Nov. 28, 2006
Major goals of comparative genomics
• Identify all DNA sequences in a genome that are
functional
– Selection to preserve function
– Adaptive selection
• Determine the biological role of each functional sequence
• Elucidate the evolutionary history of each type of
sequence
• Provide bioinformatic tools so that anyone can easily
incorporate insights from comparative genomics into their
research
Three major classes of evolution
• Neutral evolution
– Acts on DNA with no function
– Genetic drift allows some random mutations to become fixed in a
population
• Purifying (negative) selection
– Acts on DNA with a conserved function
– Signature: Rate of change is significantly slower than that of neutral DNA
– Sequences with a common function in the species examined are under
purifying (negative) selection
• Darwinian (positive) selection
– Acts on DNA in which changes benefit an organism
– Signature: Rate of change is significantly faster than that of neutral DNA
Ideal case for interpretation
Negative selection
(purifying)
Similarity
Neutral DNA
Positive selection
(adaptive)
Position along chromosome
Exonic segments coding for regions
of a polypeptide with common
function in two species.
Exonic segments coding for regions
of a polypeptide in which change is
beneficial to one of the two species.
Taxonomic distribution of homologs of mouse proteins
Waterston et al.
Conservation in different parts of genes
Average percent identity (black) or percent aligned (blue) for 10,000
orthologous genes
Waterston et al, Mouse Genome, Nature
Levels of conservation (Human vs Mouse) in
different types of proteins
Black: all orthologous proteins (Hum-mouse)
12,845 1:1 gene pairs
Red: proteins with recognized domains
Gray: proteins without recognized domains
Waterston et al. Nature 2002
Black: Nuclear proteins
Red: Cytoplasmic proteins
Gray: Extracellular proteins; positive,
diversifying selection
KA= rate of nonsynonymous substitutions
KS= rate of synonymous substitutions
Rat-specific gene expansions
•
Genes that have expanded in number in rats are enriched in
– Immune function/ antigen recognition
• immunoglobulins, T-cell receptor alpha
– Detoxification
• cytochrome P450
– Reproduction
• alpha2u-globulin
– Olfaction and odorant detection
• Olfactory receptors
• Also are rapidly evolving
• Segmental duplications are enriched for the same genes
Rat Genome SPC 2004 Nature
Adaptive remodeling of gene clusters
Figure 13 Adaptive remodeling of genomes and genes. a, Orthologous regions of rat, human and mouse
genomes encoding pheromone-carrier proteins of the lipocalin family (a2u-globulins in rat and major
urinary proteins in mouse) shown in brown. Zfp37-like zinc finger genes are shown in blue. Filled arrows
represent likely genes, whereas striped arrows represent likely pseudogenes. Gene expansions are
bracketed. Arrowhead orientation represents transcriptional direction. Flanking genes 1 and 2 are TSCOT
and CTR1, respectively.
Rat Genome SPC 2004 Nature
DCODE.org Comparative Genomics: Align your
own sequences
blastZ
multiZ and TBA
zPicture interface for aligning sequences
Automated extraction of sequence and annotation
Pre-computed alignment of genomes
• blastZ for pairwise alignments
• multiZ for multiple alignment
– Human, chimp, mouse, rat, chicken, dog
– Also multiple fly, worm, yeast genomes
– Organize local alignments: chains and nets
Webb Miller
• All against all comparisons
– High sensitivity and specificity
• Computer cluster at UC Santa Cruz
– 1024 cpus Pentium III
– Job takes about half a day
Jim Kent
• Results available at
– UCSC Genome Browser http://genome.ucsc.edu
– Galaxy server: http://www.bx.psu.edu
Schwartz et al., 2003, blastZ, Genome Research
Blanchette et al., 2004, TBA and multiZ, Genome Research
David Haussler
Genome-wide local alignment chains
Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.
Human
Mouse
blastZ: Each segment of human is given the opportunity to align with all mouse sequences.
Run blastZ in parallel for all human segments. Collect all local alignments above threshold.
Organize local alignments into a set of chains based on position in assembly and orientation.
Level 1 chain
Level 2 chain
Net
Comparative genomics to find functional sequences
Genome
size
2,900
Find
common
sequences
blastZ,
multiZ
2,400
Human
Identify
functional
sequences: ~ 145
Mbp
All mammals
1000 Mbp
2,500
Mouse
Rat
1,200
million base pairs
(Mbp)
Also birds: 72Mb
Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004
Use measures of alignment quality to discriminate
functional from nonfunctional DNA
• Compute a conservation score adjusted for the local
neutral rate
• Score S for a 50 bp region R is the normalized fraction of
aligned bases that are identical
– Subtract mean for aligned ancestral repeats in the
surrounding region
– Divide by standard deviation
p = fraction of aligned sites in R that are
identical between human and mouse
m = average fraction of aligned sites that
are identical in aligned ancestral repeats in
the surrounding region
Waterston et al., Nature
n = number of aligned sites in R
Decomposition of conservation score into
neutral and likely-selected portions
Neutral DNA (ARs)
All DNA
Likely selected DNA
At least 5-6%
S is the conservation score adjusted for variation in the local substitution rate.
The frequency of the S score for all 50bp windows in the human genome is shown.
From the distribution of S scores in ancestral repeats (mostly neutral DNA), can
compute a probability that a given alignment could result from locally adjusted
neutral rate.
Waterston et al., Nature
DNA sequences of mammalian genomes
• Human: 2.9 billion bp, “finished”
– High quality, comprehensive sequence, very few gaps
• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.
• About 40% of the human genome aligns with mouse
– This is conserved, but not all is under selection.
• About 5-6% of the human genome is under purifying selection since the
rodent-primate divergence
• About 1.2% codes for protein
• The 4 to 5% of the human genome that is under selection but does not
code for protein should have:
– Regulatory sequences
– Non-protein coding genes (UTRs and noncoding RNAs)
– Other important sequences
Conservation
score S in
different types
of regions
Red: Ancestral repeats
(mostly neutral)
Blue: First class in label
Green: Second class in
label
Waterston et al., Nature
Leverage
many species
to improve
accuracy and
resolution of
signals for
constraint
ENCODE multi-species
alignment group
Margulies et al., 2007
Coverage of human by alignments with other
vertebrates ranges from 1% to 91%
5.4
Millions of
years
Human
91
92
173
220
310
360
450
5%
Distinctive divergence rates for different types
of functional DNA sequences
Percentofofregions
humannot
genome
not in
Percent
in alignments
alignments
100
100
9090
8080
7070
Genome
Coding exons
Ultraconserved (HM)
Log. (Genome)
6060
5050
4040
3030
2020
1010
00
00
100
200
300
400
500
100
200
300
400
500
Time of divergence from common ancestor to
Time of divergence
from common
human,
Myr ago ancestor to human,
Myr ago
Large divergence in cis-regulatory modules
from opossum to platypus
cis-Regulatory modules conserved from human
to fish
•
•
Millions of
years
91
173
310
450
About 20% of CRMs
Tend to regulate genes whose
products control transcription and
development
cis-Regulatory modules conserved in eutherian
mammals and marsupials
• Human-marsupial alignments
capture about 60% of CRMs
Millions of
years
91
173
310
450
– Tend to occur close to genes
involved in aminoglycan synthesis,
organelle biosynthesis
• Human-mouse alignments capture
about 87% of CRMs
– Tend to occur close to genes
involved in apoptosis, steroid
hormone receptors, etc.
• Within aligned noncoding DNA of
eutherians, need to distinguish
constrained DNA (purifying
selection) from neutral DNA.
Score multi-species alignments for features
associated with function
• Multiple alignment scores
– Margulies et al. (2003) Genome Research 13: 2505-2518
– Binomial, parsimony
• PhastCons
– Siepel et al. (2005) Genome Research 15:1034-1050
– Phylogenetic Hidden Markov Model
– Posterior probability that a site is among the most highly conserved
sites
• GERP
– Cooper et al. (2005) Genome Research 15:901-913
– Genomic Evolutionary Rate Profiling
– Measures constraint as rejected substitutions = nucleotide
substitution deficits
phastCons: Likelihood of being constrained
• Phylogenetic Hidden
Markov Model
• Posterior probability that
a site is among the
most highly conserved
sites
• Allows for variation in
rates along lineages
c is “conserved” (constrained)
n is “nonconserved” (aligns but
is not clearly subject to
purifying selection)
Siepel et al. (2005) Genome
Research 15:1034-1050
Larger genomes
have more of the
constrained DNA
in noncoding
regions
Siepel et al. 2005,
Genome Research
Some constrained introns are editing complementary
regions:GRIA2
Siepel et al. 2005,
Genome Research
3’UTRs can be highly constrained over large
distances
Siepel et al. 2005,
Genome Research
3’ UTRs contain RNA processing signals, miRNA targets,
other regions subject to constraints
Ultraconserved elements = UCEs
• At least 200 bp with no interspecies differences
–
–
–
–
Bejerano et al. (2004) Science 304:1321-1325
481 UCEs with no changes among human, mouse and rat
Also conserved between out to dog and chicken
More highly conserved than vast majority of coding regions
• Most do not code for protein
– Only 111 out of 481overlap with protein-coding exons
– Some are developmental enhancers.
– Nonexonic UCEs tend to cluster in introns or in vicinity of genes
encoding transcription factors regulating development
– 88 are more than 100 kb away from an annotated gene; may be
distal enhancers
GO category analysis of UCE-associated genes
• Genes in which a
coding exon overlaps a
UCE
– 91 Type I genes
– RNA binding and
modification
– Transcriptional
regulation
• Genes in the vicinity of
a UCE (no overlap of
coding exons)
– 211 Type II genes
– Transcriptional
regulation
– Developmental
regulators
Bejerano et al. (2004) Science
Intronic UCE in SOX6 enhances expression
in melanocytes in transgenic mice
UCEs
Tested UCEs
Pennacchio et al.,
http://enhancer.lbl.gov/
The most stringently conserved
sequences in eukaryotes are mysteries
• Yeast MATa2 locus
– Most conserved region in 4 species of yeast
– 100% identity over 357 bp
– Role is not clear
• Vertebrate UCEs
– More constrained than exons in vertebrates
– Noncoding UCEs are not detectable outside chordates, whereas coding
regions are
• Were they fast-evolving prior to vertebrate/invertebrate divergence?
• Are they chordate innovations? Where did they come from?
– Role of many is not clear; need for 100% identity over 200 bp is not
obvious for any
• What molecular process requires strict invariance for at least 200 nucleotides?
• One possibility: Multiple, overlapping functions
Use measures of alignment texture to
discriminate functional classes of DNA
• Mouse Cons track (L-scores) are measures of alignment quality.
– Match > Mismatch > Gap
• Alternatively, can analyze the patterns within alignments
(texture) to try to distinguish among functional classes
– Regulatory regions vs bulk DNA
– Patterns are short strings of matches, mismatches, gaps
– Find frequencies for each string using training sets
• 93 known regulatory regions
• 200 ancestral repeats (neutral)
• Regulatory potential genome-wide
– Elnitski et al. (2003) Genome Research 13: 64-72.
Evaluate patterns in alignments to discriminate
functional classes of DNA
1. Collapse the alignment to a small alphabet, e.g.
Match involving G or C = S
Transition = I
Gap = G
Match involving A or T = W
Transversion = V
Alignment
seq1 G T A C C T A C T A C G C A
seq2 G T G T C G - - A G C C C A
Collapsed alphabet S W I I S V G G V I S V S W
5/10
= 3
1/6
1/4
= 1
2/8
1/4
= 0.5
3/6
2. Is a pattern, e.g., SWIIS followed by V found more
frequently in alignments of
known cis-regulatory modules (set of 93)
or neutral DNA (200 ancestral repeats)?
3. The regulatory potential for any alignment is a loglikelihood estimate of the extent to which its patterns are
more like those in regulatory regions than in neutral DNA.
Regulatory potential (RP) to distinguish
functional classes
Good performance of regulatory potential (RP)
for finding cis-regulatory modules
Taylor et al. (2006) Genome Research, in press (October or November)
Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1.
Can rescue by expressing an estrogen-responsive form of GATA-1
Rylski et al., Mol Cell Biol. 2003
Predicted cis-Regulatory Modules (preCRMs)
Around Erythroid Genes
Conservation of predicted binding sites for
transcription factors
Binding site for GATA-1
See poster from Yuepin Zhou, Yong Cheng, Hao Wang et al.
preCRMs with conserved consensus GATA-1 BS
tend to be active on transfected plasmids
preCRMs with conserved consensus GATA-1
BS tend to be active after integration into a
chromosome
Examples of validated preCRMs
Correlation of Enhancer Activity with RP Score
Validation status for 99 tested fragments
preCRMs with High RP and Conserved
Consensus GATA-1 Tend To Be Validated
Conclusions
• Multispecies alignments can be used to predict whether a
sequence is functional (signature of purifying selection).
• Patterns in alignments and conservation of some TFBSs
can be used to predict some cis-regulatory elements.
• The predictions of cis-regulatory elements for erythroid
genes are validated at a good rate.
• Databases and servers such as the UCSC Table Browser,
Galaxy, and others provide access to these data.
– http://genome.ucsc.edu/
– http://www.bx.psu.edu/
Many thanks …
Wet Lab: Yuepin Zhou, Hao Wang, Ying
Zhang, Yong Cheng, David King
Alignments, chains, nets, browsers, ideas, …
Webb Miller, Jim Kent, David Haussler
PSU Database crew: Belinda Giardine,
Cathy Riemer, Yi Zhang, Anton Nekrutenko
RP scores and other bioinformatic input:
Francesca Chiaromonte, James Taylor, Shan Yang,
Diana Kolbe, Laura Elnitski
Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Regulatory Potential (RP) features
MGC,
V, T, GAP5th order Markov model
Computation ofM2-way
RP score MAT,
using
5-symbol,
AT-MAT-MAT-MAT-MAT *
*
*
*
*
Alignment
MAT-MAT-MAT-MAT-MGC *
*
.
Hum
G T A C C T
.
MAT-TG -TT -MGC
Mus
G -V T * C *G
.
.
MGC MAT T
T MGC V
*
*
*
A
*
-
C T A
*-ln(10)
A G
GAP GAP MAT
T
C
C
C
C
C
C
A
A
MGC MGC MGC MAT
Negative
Training
set-200 ancestral repeats
A set-93
score known
matrixCRMs
is formed by taking
log-odds
ratio
Positive Training
MAT, MGC, V, T, GAP
MAT, MGC, V, T, GAP
To
measure
how
more
isATregulatory
as compared
MAT-M
AT-MAT-MAT
-MATmuch
*
*
*likely
* an
* alignment
MAT-M
-MAT-MAT-MAT
*
*
* with
* netural,
*
M
AT-M
AT-MAT-MAT
-MGC for
*
*
*
*
*over the entire
MAT-MATlength
-MAT-MAT
GC alignments
*
*
* are
* summed
*
the
log-odds
ratios
each
symbol
of-Mthe
.
.
and normalized for the length of the alignments
.
MAT-T
.
.
-T
-MGC-V
*
*
*
*
0.001
.
MAT-T
.
.
-T
-MGC-V
*
*
*
*
0.0001
Finding and analyzing genome data
NCBI Entrez
Ensembl/BioMart
UCSC Table Browser
Galaxy
http://www.ncbi.nlm.nih.gov
http://www.ensembl.org
http://genome.ucsc.edu
http://www.bx.psu.edu
Browsers vs Data Retrieval
• Browsers are designed to show selected information on one locus or
region at a time.
– UCSC Genome Browser
– Ensembl
• Run on top of databases that record vast amounts of information.
• Sometimes need to retrieve one type of information for many
genomics intervals or genome-wide.
• Access this by querying on the tables in the databases or “data marts”
– UCSC Table Browser
– EnsMart or BioMart
– Entrez at NCBI
Retrieve all the protein-coding exons in humans
Galaxy: Data retrieval and analysis
• Data can be retrieved from multiple
external sources, or uploaded from
user’s computer
• Hundreds of computational tools
– Data editing
– File conversion
– Operations: union, intersection,
complement …
– Compute functions on data
– Statistics
– EMBOSS tools for sequence
analysis
– PHYLIP tools for molecular
evolutionary analysis
– PAML to compute substitutions per
site
• Add your own tools
Galaxy via Table Browser: coding exons
Retrieve human mutations
Find exons with human mutations: Intersection
Compute length using “expression”
Statistics on exon lengths
Plot a histogram of exon lengths
Distribution of (human mutation) exon lengths
What is that really long exon? Sort by length
SACS has an 11kb exon