Transcript Document

Bioinformatics
SNPs and haplotypes
Kristel Van Steen, PhD, ScD
([email protected])
Université de Liege - Institut Montefiore
2008-2009
Acknowledgements
Parts of these slides have been adapted or
taken over from existing course notes and
online material:
Practical: Heather Cordell
Slides: Stuart M Brown
Outline



Practical in R on genetic association
analysis
SNPs and Haplotypes
A tour in FBAT
Genetic Association Analysis in R
Computer Practical Exercise

Heather Cordell
http://www.staff.ncl.ac.uk/heather.cordell/WTACcasecon
2007.html

Using R for


Case-control association
Gene-gene interactions (future class)
SNPs and Haplotypes
A gentle introduction
of relevant issues
Mutations create Alleles
•Mutations occur randomly throughout the DNA
•Most have no phenotypic effect (non-coding regions,
equivalent codons, similar AAs)
•Some damage the function of a protein or regulatory
element
•A very few provide an evolutionary advantage
Human Alleles



The OMIM (Online Mendelian Inheritance in Man)
database at the NCBI tracks all human mutations with
known pheontypes.
It contains a total of about 2,000 genetic diseases [and
another ~11,000 genetic loci with known phenotypes - but
not necessarily known gene sequences]
It is designed for use by physicians:
 can search by disease name
 contains summaries from clinical studies
Population Genetics



Chromosome pairs segregate and recombine in every
generation.
Every allele of every gene has its own independent
evolutionary history (and future!)
Frequencies of various alleles differ in different subpopulations of people.
SNPs




Single nucleotide polymorphisms (SNPs) are DNA
sequence variations occurring when a single nucleotide
(A, C, T, G) in the genome is altered.
The inherited allelic variation must have >1% population
frequency.
SNPs can occur in both coding and non-coding regions,
making up 90% of all human genetic variation
Frequency: roughly, every 100 to 300 bases along the
about 3 billion base human genome
Remark: Some definitions include methylated and
deaminated dinucleotides
Distribution of SNPs and Power
SNPs are Very Common



SNPs are very common in the human
population.
Between any two people, there is an
average of one SNP every 1000 bases.
Most of these have no phenotypic effect



only <1% of all human SNPs impact protein
function (non-coding regions)
Selection against mis-sense mutations (think about
what would happen to dominant lethal mutations?)
Some are alleles of genes.
Why are SNPs Important?


Alleles of health related genes
Genetic Markers that are linked to every gene
(and to non-transcribed loci that may also
affect health)

Fast, cheap, accurate genotypes

Population diversity & history


Genetic Association studies in populations
Pharmacogenomics
Genome Sequencing finds SNPs



The Human Genome Project involves
sequencing DNA cloned from a number
of different people.
Even in a library made from from one
person’s DNA, the homologous
chromosomes have SNPs
This inevitably leads to the discovery of
SNPs - any single base sequence
difference
We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the
human genome, providing an average density on available sequence of one SNP every 1.9 kilobases.
These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone
overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly
available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall
within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP.
Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard
population genetic model of human history. This high-density SNP map provides a public resource for
defining haplotype variation across the genome, and should help to identify biomedically important
genes for diagnosis and therapy.
GenBank has a dbSNP
“As of Mar. 2007 , dbSNP has submissions for
31,035,607 human SNPs”

It is possible to search dbSNP by BLAST comparisons to a
target sequence
>gnl|dbSNP|rs1042574_allelePos=51 total len = 101 |taxid = 9606|snpClass = 1
Length = 101
Score = 149 bits (75), Expect = 3e-33
Identities = 79/81 (97%)
Strand = Plus / Plus
Query: 1489 ccctcttccctgacctcccaactctaaagccaagcactttatatttttctcttagatatt 1548
||||||||||||||||||||||||||||||||||||||||||||||| || |||||||||
Sbjct: 1
ccctcttccctgacctcccaactctaaagccaagcactttatattttcctyttagatatt 60
If a matching
SNP is found,
then it can be
directly
located on the
Genome map
Query: 1549 cactaaggacttaaaataaaa 1569
|||||||||||||||||||||
Sbjct: 61
cactaaggacttaaaataaaa 81
Linkage


Meiosis (sexual cell
division) involves a
process of crossing
over, which gives new
combinations of alleles
Genes that are located
close to each other on
the chromosome rarely
show recombination of
alleles
HapMap Project




The HapMap Project tests linkage between SNPs in
various sub-populations.
For a group of linked SNPs recombination may be
rare over tens of thousands of bases
A few "tag SNPs" can be used to identify genotypes
for groups of linked SNPs
Makes it possible to survey the whole genome with
fewer markers (1/3-1/10th)
Haplotype




Linkage is common in the human population,
particularly in genetically isolated sub-populations.
A group of alleles for neighboring genes on a segment
of a chromosome are very often inherited together.
Such a combination of linked alleles is known as a
haplotype.
When linked alleles are shared by members of a
population, it is called a linkage disequilibrium.
Haplotype Map of the Human
Genome
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Goals:
• Define patterns of genetic variation across human genome
• Guide selection of SNPs efficiently to “tag” common variants
• Public release of all data (assays, genotypes)
Phase I:
1.3 M markers in 269 people
Phase II: +2.8 M markers in 270 people
HapMap Samples


90 Yoruba individuals (30 parent-parent-offspring
trios) from Ibadan, Nigeria (YRI)
90 individuals (30 trios) of European descent from
Utah (CEU)

45 Han Chinese individuals from Beijing (CHB)

45 Japanese individuals from Tokyo (JPT)
Recombination hotspots:
widespread - LD structure
7q21
Common Haplotypes


For a single locus in a population, 55 percent of
people may have one version of a haplotype, 30
percent may have another, 8 percent may have a
third, and the rest may have a variety of less common
haplotypes.
These haplotype blocks may contain 5-20 SNPs
Common Haplotypes


All of these halplotypes can be identified by
genotyping 1-3 "tag SNPs"
Tag SNPs that contain most of the information about
the patterns of human genetic variation are estimated
to be about 300,000 to 600,000, which is far fewer
than the 10 million common SNPs.
Applications of HapMap

Pick better SNPs for genotyping study



Choose SNPs with high heterozygosity in target
population
Whole genome coverage with reduced set of
"tag SNPs" (capture all "common variants")
Interpret genotyping results


What genes are in LD with this SNP?
What coding variants and putative functional variants
are in LD with this SNP?
Example:
Complement Factor H - AMD
rs380390
SNP Testing




Genotyping
SNPs are permanent features of genomic DNA
May be homozygous or heterozygous
Many different technologies are available
Genotyping Technologies





Sequencing (whole genome or targeted)
PCR (allele specific primers)
Oligonucleotide ligation
Primer extension (incorporate labeled nucleotides)
Hybridization (microarray)
TaqMan - rtPCR
•4 oligos must be designed and tested for each SNP
•Fast & cheap for lots of samples
Primer Extension
Oligonucleotide Ligation
(ABI)
can multiplex 48 SNPs
Preliminary data from Affy 10K SNP
Microarrays



Screening large numbers of SNP markers on a
sample of genomic DNA is one highly promising
application for microarray technology.
Many other “high-throughput” SNP genotyping
technologies are under development.
Affymetrix 1million SNP product on sale now!
Comparison of Methods?




Array-based methods can cover the whole genome
PCR (& variants) are cheaper for defined numbers of
SNPs on lots of samples
Whole genome: may be too much data
 false positives
 privacy concerns
Whole genome may work for discovery research, but
clinical applications favor targeted assays
Pharmacogenomics

The use of DNA sequence information
to measure and predict the reaction of
individuals to drugs.

Personalized drugs

Faster clinical trials

Less drug side effects
Some Gene Products Interact
with Drugs

There are proteins that chemically activate
or inactivate drugs.

Other proteins can directly enhance or block
a drug's activity.

There are also genes that control side effects
Example

10% of African Americans have
polymorphic alleles of Glucose-6phosphate dehydrogenase that lead to
haemolyitic anemia when they are
given the anti-malarial drug primaquine.
Collect Drug Response Data

These drug response phenotypes are associated
with a set of specific gene alleles.

Identify populations of people who show specific
responses to a drug.

In early clinical trials, it is possible to identify
people who react well and react poorly.
Make Genetic Profiles




Scan these populations with a large number of SNP
markers.
Find markers linked to drug response phenotypes.
It is interesting, but not necessary, to identify the
exact genes involved.
Can work with “associated populations,” does not
require detailed information on disease in family
history(pedigree).
Huge Database Problem


Physicians collect tons of data
 patient age, sex, weight, blood pressure, family
disease history, date of symptom onset
 Cancer data: tumor size, location, stage, etc.
 Data specific to each type of disease
Now integrate thousands (or 100K’s) of SNPs that are
correlated with some of these clinical factors in
complex relationships
Use the Profiles



Genetic profiles of new patients can then be
used to prescribe drugs more effectively &
avoid adverse reactions.
Can also speed clinical trials by testing on
those who are likely to respond well.
Can "rescue" drugs that don't work well on
everybody, or that have bad side effects on
a few.
Real World Applications




Most of the major pharmaceutical
companies are currently collecting
pharmacogenomic data in their clinical
trials.
Data is yet to be published.
Genetic indications for drug use are
becoming available.
Plan to sell the drug with the gene test
Multi-locus SNP Profiles



There will be a few hundred to a few
thousand SNPs linked to medically
important alleles in the next ~10 years.
Haplotypes will reduce the number that
need to be screened (one SNP gives
information about a group of linked
genes)
Some genes will turn out to be involved
in many important pathways
Will People Want This
Information??


Genetic determinism and possible
discrimination.
Even a simple test to see what drug you
should take could reveal information
about your risk of cancer or heart
disease.
A tour in FBAT testing
A tour in Python
Homework Assignment 4
(R)
check website for exercise and
supplementary info: due 28 Oct
Homework Assignment 6
(FBAT)
check website: due 4 Nov