Hunting Disease Genes HMGP Richard A. Spritz, M.D. April 13, 2015 [email protected] 303-724-3107 Why Find Disease Genes? Accelerated by finding the disease gene.

Download Report

Transcript Hunting Disease Genes HMGP Richard A. Spritz, M.D. April 13, 2015 [email protected] 303-724-3107 Why Find Disease Genes? Accelerated by finding the disease gene.

Hunting Disease Genes
HMGP
Richard A. Spritz, M.D.
April 13, 2015
[email protected]
303-724-3107
Why Find Disease Genes?
Accelerated
by finding
the disease
gene
Why Find Disease Genes?
• Virtually all diseases result from a combination of
genes and environmental factors
• We have no systematic ways to discover
environmental risk factors
• We do have systematic ways discover disease
genes
• Discovery of disease genes will provide clues to
pathogenic mechanisms, new
approaches to treatment, inference of
environmental risk factors, and
ultimately disease prevention
• Personalized medicine ( = “Precision
Medicine”)
The Holy Grail
Personalized/Precision Medicine
Paradigm
• Discover risk genes for common diseases, specific
risk variants, high-risk combinations
• Carry out accurate DNA-based predictive
diagnostics of disease susceptibilities based on
individualized genetic risks
• Apply optimized individualized treatment or
prevention based on genetic diagnosis of disease
susceptibilities and pharmacogenetic
• analysis of optimized drug
efficacy/specificity
• This is why there was a
Human Genome Project
Personalized/Precision Medicine
Paradigm--Problems
•
For most common complex traits, individual
genes/variants confer low odds ratio
OR = Risk of disease having a given gene variant / Risk of disease not having variant
Population/study wide; no meaning at level of individual
•
We do not yet know how to do “combinatorial”
complex trait risk prediction
Genetic risk scores
•
For most complex diseases it has been hard to
account for much of the ‘heritability’ of the trait
H2 = (Var G) / (Var P)
•
Low positive predictive value
of genetic tests for complex
traits
significant non-genetic component
late onset
Hunting for Disease
Genes
1. In a “Mendelian”, single-gene
trait, one gene is sufficient to cause
(most of) the disease phenotype
2. In a polygenic/multifactorial,
“complex” trait, no one gene is
sufficient to cause the disease
phenotype
How Do You Find Disease
Genes?
I. Hypothesis-driven approaches
Candidate gene association
Candidate gene sequencing
II. Hypothesis-free approaches
Genomewide linkage
Genomewide association
(Genomewide expression)
Genomewide sequencing
Exome
Full-genome
Disease Gene Identification—
“Functional Cloning” vs. “Positional Cloning”
Positional Cloning: Determine a Disease
Gene’s Genomic Position, and then
Identify the Gene
Obviated by
Human Genome
Project
Gene Mapping Technology
Polymorphic DNA Markers
•
•
•
•
You can only track/measure differences
between people and through families
Polymorphic DNA markers constitute any
scorable differences at known genomic
positions
Surrogates for disease mutations; some
polymorphisms cause disease; most don’t
Most commonly used marker types:
– microsatellites
– single-nucleotide polymorphisms (SNPs)
– copy-number variations (CNVs)
The First Goal of the HGP was to
Assemble a High-Density Genome Map
of Polymorphic Markers
How Do You Find Disease
Genes?
I. Hypothesis-driven approaches
Candidate gene association
Candidate gene sequencing
II. Hypothesis-free approaches
Genomewide linkage
Genomewide association
(Genomewide expression)
Genomewide sequencing
Exome
Full-genome
Genetic Linkage Studies
•Studies families
•Search for regions of genome that are
systematically co-inherited along with
disease on passage through families
•Requires families with multiple affected
relatives (multiplex families)
•Best at detecting genes with Mendelian
effects (uncommon alleles with strong
effects)
•Unit of genetic linkage is LOD (“Log of the
Odds) score (>3)
Principle of genetic linkage—
Loci close by on a chromosome tend not to be
separated by recombination vs. loci far apart
Loci on the same chromosome
Loci on different
chromosomes
Very close Nearby Far Apart
Freq. of crossover Rare
between 2 loci
Some
Frequent
-
Linkage
Tight
Some
Absent
Absent
0%
1-49%
50%
Recombination
50%
• Unit of genetic “distance” is centiMorgan
(cM) = 1% recombination/meiosis; ~ 1 Mb
Genetic Linkage Analysis
• Statistical measure is LOD (log of odds)
score
LOD = Log10
Likelihood of data if loci linked at 
Likelihood of data if loci unlinked
• Significance level: LOD >3.0 for Mendelian trait
LOD >3.3 for Polygenic trait
Restriction Fragment Length
Polymorphism (RFLP)
EcoRI
Allele 1
AGAGCCTCAACTTGAATTCGTTTAGTAA
Allele 2
AGAGCCTCAACTTGAATTTGTTTAGTAA
Restriction enzyme EcoRI cuts at sequence
5’-GAATTC-3’
Allele 1 has an EcoRI cut site; Allele 2 does not
• This RFLP is assaying a SNP
“Genetic linkage analysis”
Co-segregation of disease gene in “multiplex
families” with alleles of polymorphic DNA
“markers” (initially RFLPs)
“Microsatellites” (SSLPs; STRPs, SSRs)
[multi-allelic; ~ 1/30,000 bp; mostly used for
linkage analysis, forensics]
ggctgcacacacacacacacacacacacatgctt
ggctgcacacacacacacacacacacatgctt
ggctgcacacacacacacacacacatgctt
ggctgcacacacacacacacacatgctt
ggctgcacacacacacacacatgctt
Can follow “segregation” of ancestral
“haplotypes” of linked marker alleles
along a chromosome through families
Recombination events prune marker
haplotypes, defining “genetic interval” that
must contain the disease gene
Single-Nucleotide Polymorphisms (SNPs)
[bi-allelic; ~1/50-300 bp; mostly used for
association analysis]
SNP1 Allele 1 CCGAGATCCAGAAATCCTGAACATAA
SNP1 Allele 2 CTGAGATCCAGAAATCCTGAACATAA
SNP2 Allele 1 CCGAGATCCAGAAATCCTGAACATAA
SNP2 Allele 2 CCGAGATCCAGAAAGCCTGAACATAA
• Occurrence/allele frequencies differ in different ethic
groups/populations
• Can be in genes (~4,000,000) on not (~8,000,000), can result in
amino acid substitutions or not
• Each occurs in local context (haplotype) of surrounding SNPs
(in example above, SNP2 is on background of SNP1 C allele)
Haplotype Map of Human Genome
International HapMAP Project
•Recombination breaks macro-patterns of polymorphic
genotypes on the same chromosome into haplotypes
•Recombination is not truly random, so very close
polymorphism genotypes on the same chromosome
cluster into ~10-50 kb haplotype blocks in which SNP
alleles are in linkage disequilibrium (marker alleles
within blocks tend to be co-inherited, because
recombination within blocks is uncommon)
•Blocks smaller in African than Caucasian or Asian
pops. because African pop. is more ancient
•HapMap genotyped SNPs in different populations to
characterize haplotype block distributions
Copy-Number Variants (CNVs)
[bi-allelic]
Basically are common genomic deletions, hundreds to
tens of thousands of nucleotides in size
May be detected by LD with local SNP patterns:
Allele
--1---1---1----1---2----1----2----1-----1----2----1----1----1----1--Allele
--2---2---2----1---1----2----2----1-----1----2----2----2----1----2--CNV Allele --1---1—[
]--1----2---
• Tens of thousands known
• Like SNPS, occurrence/allele frequencies differ in different
ethic groups/populations
• Individually most are rare (< 1%), collectively common
• Can be in genes or not, can include genes
• NOT commonly definitively causal for human disease
1000 Genomes Project, UK10K Project
International projects to sequence 1000/10000
genomes from different ethnic groups
• Catalog human genetic variations (particularly SNPs, indels)
– ~60,000,000 SNPs now known
– Essential for sequence-based analysis of rare variants that
may be causal for common diseases
How Do You Find Disease
Genes?
I. Hypothesis-driven approaches
Candidate gene association
Candidate gene sequencing
II. Hypothesis-free approaches
Genomewide linkage
Genomewide association
(Genomewide expression)
Genomewide sequencing
Exome
Full-genome
Common, Complex Diseases
•
•
•
•
•
•
•
•
•
Asthma
Autism
Obesity
Preterm birth
Cleft lip/palate
IBD
Diabetes
Cancers
Common traits like height
Common, Complex Diseases
Utility of Experimental Approaches
Common
GWAS
RISK ALLELE
FREQUENCY
Rare
Re-Sequencing
Linkage
Small
EFFECT SIZE (OR)
Large
How Do You Find Disease
Genes?
I. Hypothesis-driven approaches
Candidate gene association
Candidate gene sequencing
II. Hypothesis-free approaches
Genomewide linkage
Genomewide association
(Genomewide expression)
Genomewide sequencing
Exome
Full-genome
Hypothesis-Driven Approaches
Candidate genes
Depends on:
biological hypothesis (biological candidate)
positional hypothesis / information (positional
candidate)



Sometimes successful in Mendelian disorders
Low yield in polygenic, multifactorial
(“complex”) disorders—pathogenic sequence
variants not obvious, often present in normal
individuals
Most hypotheses wrong!
Candidate Gene Association Study
Concept:
Causal disease variation in gene suggested by
known biology ‘tagged’ by nearby polymorphic
DNA markers; test for co-occurrence.
Because:
DNA sequence variations very close together
on the same piece of DNA will tend to not be
separated by recombination over long periods,
and so will be non-randomly co-inherited even
on a populationbasis (“linkage
disequilibrium”).
Most hypotheses wrong!
Candidate Gene Association Studies










Compares SNP allele frequencies in cases
versus controls (“case-control” study design)
Easy statistics (Fisher exact test, Chi-square)
Must Bonferroni correct for multiple-testing
Must ethnically match cases and controls
Easy, cheap
Most powerful for common risk alleles
Can detect common alleles with small allelespecific effects (i.e. “complex”, polygenic
traits)
Most common published type of “genetic
study”
Most hypotheses wrong!
Most (~96%) such published studies wrong!!
Three Fatal Flaws in Gene-by-Gene
Case-Control Design
• Must apply multiple-testing correction; true
denominator often not known
• Must ethnically match cases & controls;
otherwise, differences in allele frequencies
may reflect different genetic backgrounds of
cases vs. controls
• Positive studies result in publication bias
“Population stratification” and false-positive
case-control genetic association studies
Population 1
Population 2
blue/green
just
indicates
overall
genetic
background
Disease
Admixed Study Population 1/2
Cases
Prof. Wizard’s
Case-Control
Study
Eureka!
Controls
Hypothesis-Free Approaches
Genome-Wide Association Studies (GWAS)
Relatively recent approach (>300 published):
•Genotype hundreds of thousands to millions of
SNPs across genome using microarrays;
extremely expensive
•Case-control or family-based (trio) design
•Requires no hypotheses about pathogenesis; can
discover new genes
•Can discover common alleles with small effects
•Can provide very fine localization
How Do You Find Disease
Genes?
I. Hypothesis-driven approaches
Candidate gene association
Candidate gene sequencing
II. Hypothesis-free approaches
Genomewide linkage
Genomewide association
(Genomewide expression)
Genomewide sequencing
Exome
Full-genome
Hypothesis-free approaches
Genome-wide association studies (GWAS)
• Study self-contained; can apply appropriate multiple
testing correction
- “Genomewide significance” P < 5 x 10-8
• Still requires ethnic matching of cases and controls
- Can correct for population stratification by
“Principal components” analysis
- Can correct for residual “Genomic inflation
factor” by “genomic control”
• Can discover new, unknown genes; power similar to
candidate gene case-control study
• Case-control “associations” require independent
confirmation
The Genomewide Association Study (GWAS)
Manolio TA. N Engl J Med 2010;363:166-176.
Meta-Analysis of Multiple Genomewide Association
Studies
Genome-Wide Association Studies
“Manhattan plot”
Per-SNP
-log(P values)
across genome
for association of
SNP allele freq.
differences
between patients
with generalized
vitiligo versus
controls (all
Caucasian)
Genome-Wide Association Studies
• Very large number of SNPs tested (500,000 –
2,000,000) presents huge multiple-testing problem;
requires at least ~1000 cases and ~1000 controls
• Many SNPs in linkage disequilibrium (i.e.
correlated); simple Bonferroni correction too strict
(assumes independence)
•“Significant” associations require confirmation by
independent follow-up association study of
specific SNPs to reduce multiple-testing
complexity
Personalized Medicine
The case of the ‘missing heritability’
• Disease risk genes found by GWAS
account for only a small fraction of genetic risk
>Type 1 diabetes-- ~100 genes, ~70% of genetic risk
50% of risk due to HLA class II
• Are there a virtually unlimited number of additional genes,
each conferring small additional risk?
>Maybe
• Have we under-estimated fraction of genetic risk already
accounted for?
>Maybe. GWAS misses rare risk alleles
• Have we over-estimated total genetic component of risk?
>Maybe, but not ten-fold
Hypotheses of Common, “Complex”
Disease
• Common disease, common variant hypothesis
(Reich & Lander, 2001)
versus
• Rare variant hypothesis (Pritchard, 2001;
Prixhard and Cox, 2002)
Complex Diseases
Utility of Experimental Approaches
Common
GWAS
RISK ALLELE
FREQUENCY
Rare
Re-Sequencing
Linkage
Small
EFFECT SIZE (OR)
Large
Combined hypothesis-based and
hypothesis-free approaches
Deep re-sequencing
• High-throughput DNA sequencing
• Biological candidate genes
• GWAS signals (specific genes or genes
within regions)
• Must distinguish potentially causal variants
from non-pathological variation (1000
Genomes Project data will help)
• Prioritize for follow-up functional analyses
How Do You Find Disease
Genes?
I. Hypothesis-driven approaches
Candidate gene association
Candidate gene sequencing
II. Hypothesis-free approaches
Genomewide linkage
Genomewide association
(Genomewide expression)
Genomewide sequencing
Exome
Full-genome
Hypothesis-free approach
Exome/Genome sequencing
• High-throughput DNA sequencing
- Genome
- Exome (1% of genome)
• Must distinguish potentially causal variants
from non-pathological variation (1000
Genomes Project data will help)
- Predict based on Mendelian inheritance
- Compare across unrelated families
• Prioritize for follow-up functional analyses
Exome Sequencing in Mendelian Diseases
Method
E
Exome = Gene coding regions; ~ 3 Mb (1% of genome)
How Do You Find Disease Genes?
Exome/Genome Sequencing in Mendelian
Diseases
There is a lot of genomic ‘noise’
There is a lot
of noise!!
E
Variant Filtering in Exome/Genome
Sequencing
• Missense (non-synonymous) substitutions
- Most rare (<1%) missense may be deleterious
• Nonsense, frameshift mutations
• Splice junction mutations
• Exonic splice enhancer mutations
• INDELs, CNVs, translocations
• Regulatory Feature variants
How Do You Find Disease Genes?
Exome/Genome Sequencing in Mendelian
Diseases
Filtering Schemes
E
How Do You Find Disease Genes?
Exome/Genome Sequencing in Mendelian
Diseases
Exome sequencing is rapidly becoming a
fairly routine clinical test, costing ~$1000,
ordered in lieu of tens of thousands of dollars
worth of functional clinical tests in a patient
one believes might have a genetic (principally
single-gene Mendelian) cause for their
disorder.
Who will do the interpretation of the data, how
will “variants of unknown significance” (VUS)
be addressed, and what will that cost?