CNV - University of Pittsburgh

Download Report

Transcript CNV - University of Pittsburgh

Copy Number Variation
Eleanor Feingold
University of Pittsburgh
March 2012
What do we mean by “copy number variation?”
kb - Mb
(gene or gene region)
GCTCATATATATTTG
Copy number variation in a gene or gene region
“normal”
duplication of one gene
duplication of several genes
duplication of part of a gene
deletion
Classical copy number study types
Cancer genetics
Clinical pediatrics
What
What
Find chromosomal segments (usually
large ones) that are duplicated and/or
deleted in tumor cell lines
Why
Learn something about cancer biology
or
Implications for treatment and
prognosis
Detect inherited or de novo
deletions in individuals
Why
“Diagnose” birth defects
And now:
Genetic association studies for CNVs
1) Collect cases and controls.
2) “Genotype” everyone at a CNV.
2
0
5
0
4
1
3) Test genotype/phenotype association.
cases
0
1
2+
65
133
202
81
316
controls 16
1
2
4
1
1
3
2
2
1
16
0
How do we assay copy number
variation?
Generation 1 - Array CGH
What
Microarray of clones (e.g. BACs)
Usually on glass slide
Competitive hybridization of test
and reference samples.
Measure fluorescence ratio clone
by clone.
Limitations
Large clones.
Sparse coverage.
High noise due to spotting process.
Generation 2 - SNP chips
What
High-throughput SNP
genotyping platforms (e.g.
Affymetrix, Illumina)
Advantage
Hundreds of thousands of
points of info.
Disadvantages
Technology was never intended for
measuring copy number.
SNPs on chip selected to avoid CNV
regions by design.
Generation 3 - SNP chips with CNV markers
(Affy 6.0, Illumina 1M)
Advantages
Illumina
SNPs in known CNV regions are
now included.
1M markers in 10K regions of
various types and sizes
Also have “non-polymorphic SNPs”
(SNs?)
Affymetrix
200K probes in 5K known large CNV regions
700K probes “evenly spaced along the genome”
Generation 4 (Illumina 2.5M, 5M)
Changes
Got rid of the non-polymorphic markers.
Special coverage of CNV regions???
Are these better or worse for CNVs than the previous
generation?
What data do these technologies give
us, and how do we use it?
Standard genotyping
Genotype information is in
the angle (relative intensity
of the two alleles).
BB
AB
AA
Copy number information
is in the distance from the
origin (total intensity).
In theory
AAA
AAB
ABB
AA
AB
A
null
B
BB
BBB
But when you look at the data …
AAA and AA
trisomic
(Down
Syndrome)
AAB
AB
disomic
ABB
BBB and BB
All SNPs on chromosome 21
disomic
total intensity
total intensity (trisomic)
trisomic
total intensity
(disomic)
In theory
AAA
AAB
ABB
AA
AB
A
null
B
BB
BBB
In practice
A
null
B
So how are copy numbers called?
Look for runs of SNPs that are high or low in
intensity
Many available algorithms
e.g. HMM, CBS, change-point
Basic picture
Komura et al.
Genome
Research
2006
More complex examples (cancer
genetics)
Peiffer et al.
Genome Research, 2006
amplification
total intensity
AA
AB
Angle
(genotype info)
BB
deletion
deletion
Extra copy of whole chromosome
total
intensity
high over
whole
chromosome
3 genotype
groups
No copy number change, but a region of homozygosity
(LOH)
LOH
Basic picture
Wang et al. Genome Research, 2007
Chromosome 9
29
A few statistical issues to think about …
(there’s still a lot to do)
Many run-calling algorithms are
oriented towards clinical applications.
Many CNV detection algorithms are very
conservative - aim for zero false positive rate.
Most use normalization methods that assume a
large reference population is not available.
Many use models that make assumptions about
what kinds of variation are likely (e.g. cancer).
Family data should be modeled
together.
CNV “calls” will be much more accurate if you
use the whole family, but the model you use
should depend on whether you are expecting de
novo mutations or not.
For some diseases you’ll expect associations with
de novo changes. For others you might expect
inherited variants.
How do we group CNVs for association
testing?
deletion
deletion
deletion
deletion
duplication
Separate methods for deletions?
Deletions are easier to detect than other changes.
Deletions are likely to have simpler biological
effects.
The most important one …
The technology is still NOT intended for reliably
and comparably measuring total intensity!
Total intensity numbers are very sensitive to DNA
source, sample handling, etc., so extreme
measures must be taken to ensure that cases and
controls are comparable.