Single Nucleotide Polymorphism Anshu Bhardwaj Research Fellow

Download Report

Transcript Single Nucleotide Polymorphism Anshu Bhardwaj Research Fellow

Single Nucleotide
Polymorphism
Anshu Bhardwaj
Research Fellow
Centre for Cellular & Molecular Biology
Hyderabad
8th November, 2003
Single Nucleotide Polymorphism
Single base-pair differences occurring in a
population with a frequency of >1%
...C C A T T G A C...
…G G T A A C T G...
...C C G T T G A C...
…G G C A A C T G...
SNPs can be found in..
•NON-CODING REGION:
* 5’ and 3’ UTR’s
* Introns
* splice sites
•CODING REGION:
* Non-synonymous • Amino acid substitution
* Synonymous
• Silent
Single base-pair differences
occurring in a population with a
frequency of >1%
MUTATION
POLYMORPHISM
GENOTYPIC FREQUENCY
Relative distribution of genotypes in a population
for a particular locus
ALLELIC FREQUENCY
The relative abundance of an allele of a particular
gene with reference to its other alleles
Percent
Location
MM
MN
NN
p
q
Greenland
83.5
15.6
0.9
0.92
0.08
Let p=f(M) and q=f(N). Thus, p=f(MM) + ½ f(MN) and q=f(NN) + ½ f(MN).
ALLELIC FREQUENCY : The relative abundance of an allele of a particular gene
with reference to its other alleles
Percent
Location
MM
MN
NN
p
q
Greenland
83.5
15.6
0.9
0.92
0.08
Let p=f(M) and q=f(N). Thus, p=f(MM) + ½ f(MN) and q=f(NN) + ½ f(MN).
GENOTYPIC FREQUENCY : Relative distribution of genotypes in a population for a
particular locus
Genotype
# of Individuals
Genotypic frequencies
MM
5118
MM = 5118/6129 = 83.5%
MN
956
MN = 956/6129 = 15.6%
NN
55
NN = 55/6129 = 0.9%
Total
6129
WHY SNPs ? ?
 SNPs are distributed non-randomly throughout the genome
 On an average a significant SNP is found for every 1kb of
the human genome, resulting in approximately 3 million SNPs
 Large number
 Unambiguous assay techniques
 High levels of polymorphisms in population
 Most of the phenotypic differences arise from SNPs in
genes, but these form only a small fraction of the total number
dbSNP DENSITY DISTRIBUTION IN HUMAN
• Mean Density :
0.001765 SNPs per base (17.652 SNPs per 10 kb)
• Mean Spacing :
566.5118 bases per SNP
SNP Discovery
 SNP Discovery refers to the initial identification of new
SNPs
 The established method is electrophoresis(DNA sequencing)
with subsequent data analysis. Some indirect Discovery
techniques (e.g., dHPLC, SSCP) only indicate that a SNP
(or other mutation) exists
 DNA sequencing of multiple individuals is used to determine
the point and type of polymorphism
SNP Validation
 SNP Validation refers to genetic validation, the
process of ensuring that the SNP is not due to
sequencing error
 Confirmation of SNPs found in discovery
 Larger numbers of individual samples to get statistical
data on occurrence in the population
THE EXPERIMENTAL APPROACH
 RESTRICTION FRAGMENT LENGTH POLYMORPHISM
 SINGLE STRANDED CONFORMATIONAL POLYMORPHISM
 DENATURING HIGH PRESSURE LIQUID CHROMATOGRAPHY
 HYBRIDIZATION METHOD
 MALDI-TOF METHOD
 SEQUENCING & ALIGNMENT THEREAFTER
THE EXPERIMENTAL APPROACH
 RESTRICTION FRAGMENT LENGTH POLYMORPHISM
 SINGLE STRANDED CONFORMATIONAL POLYMORPHISM
 DENATURING HIGH PRESSURE LIQUID CHROMATOGRAPHY
 HYBRIDIZATION METHOD
 MALDI-TOF METHOD
 SEQUENCING & ALIGNMENT THEREAFTER
IN SILICO SNP PREDICTION
POLYBAYES
SEAN SNP Prediction Program
SNP Finder
IN SILICO SNP PREDICTION
POLYBAYES
SEAN SNP Prediction Program
SNP Finder
Restriction Fragment Length Polymorphisms
Botstein et al (1980)
CHANGES IN MIGRATION PATTERNS THAT REPRESENT ALLELIC VARIATION
A
Homolog 1
3 Kb
12 A
Homolog 2
12 B
12 C
1 Kb
2 Kb
PROBE
B
Homolog 1 & 2
3 Kb
C
Homolog 1 & 2
2 Kb
1 Kb
CAN BE USED TO DETECT SNPs DIFFERENTIALLY IN HOMOZYGOUS &
HETEROZYGOUS INDIVIDUALS
MALDI-TOF METHOD
Matrix-assisted laser desorption ionization-time of flight
Laser
source
Detector
Drift region
SEQUENCING METHOD:
POLYBAYES
BAYESIAN INFERENCE ENGINE TO CALCULATE THE
PROBABILITY THAT A GIVEN SITE IS POLYMORPHIC
FRAGMENT CLUSTERING
PARALOGUE IDENTIFICATION
MULTIPLE ALIGNMENT
SNP DETECTION IN REDUNDANT SEQUENCE DATA
SEQUENCE CLUSTERING
CLUSTER REFINEMENT
MULTIPLE ALIGNMENT
SNP DETECTION
The PolyBayes Approach
• Use genomic sequence as reference
– cluster and align all available sequences
– remove repeats/paralogs
• Use Bayesian statistics to
– distinguish polymorphic sites from artifacts
– estimate likelihood
•
Marth, GT, Korf, I, Yandell, MD, Yeh, RT, Gu, Z, Zakeri, H, Stitziel, NO, Hillier, L, Kwok, P-Y, Gish,
WR: A general approach to single-nucleotide polymorphism discovery. Nature Genet. 1999;
23:452-456.
1. Known repeat sequences are masked using RepeatMasker
2. FRAGMENT
CLUSTERING
(a) WU-BLAST used to search against dbEST
(b) Sequence traces processed with PHRED base-calling values
(c) Distinct group of matching ESTs registered as clusters
3. Each cluster member pair-wise aligned to the genomic
anchor sequence with CROSS_MATCH
PARALOGUE IDENTIFICATION
1. May give rise to false SNP predictions & points to
difficulties during marker development
2. Calculate probability PNAT that a cluster member is derived
from genomic region.
3. Distinguish between
less accurate sequences that nevertheless originate from the
same underlying genomic location
More accurate sequences with high-quality discrepancies that
are likely to be paralogous
4. Using a threshold value PNAT,MIN paralogous cluster members
are removed
DNAT = L * PPOLY.2 + E (PPOLY.2 = 0.001)
DPAR = L * PPAR + E
(PPAR = 0.02)
d = discrepancies
1
P(MODELNAT|D) =
1+e(DNAT- DPAR).(DPAR/DPAR)
PNAT,MIN = 0.75
MULTIPLE ALIGNMENT
1. Depth of coverage
2. The base-quality values of the sequences
3. The a priori expected rate of polymorphic sites in
the region
• PSNP  PROBABILITY THAT THE SITE IS POLYMORPHIC
• DISTRIBUTION OF PROBABILITY SCORES EXHIBITS A
HIGH LEVEL OF SPECIFICITY
THRESHOLD VALUE
PSNP = 0.4
THE POLYBAYES SOFTWARE
OTHER SNP PREDICTION & SNP FINDING SOFTWARE
 SEAN:
Search for localized SNPs
and predict SNPs
(http://zebrafish.doc.ic.ac.uk/Sean/)
 SNP Finder:
For analyzing user-submitted
trace data
(http://gai.nci.nih.gov/)
SIGNIFICANCE OF SNPs
 IN DISEASE DIAGNOSIS
 IN FINDING PREDISPOSITION TO DISEASES
 IN DRUG DISCOVERY & DEVELOPMENT
 IN DRUG RESPONSES
 INVESTIGATION OF MIGRATION PATTERNS
ALL THESE ASPECT WILL HELP TO LOOK FOR MEDICATION &
DIAGNOSIS AT INDIVIDUAL LEVEL
SNP Screening
Two different screening strategies
- Many SNPs in a few individuals
- A few SNPs in many individuals
 Different strategies will require different tools
 Important in determining markers for complex genetic
states
SNP genotyping methods for detecting genes
contributing to susceptibility or resistance to
multifactorial diseases, adverse drug reactions:
=> case-control association analysis
case
….GCCGTTGAC….
….GCCATTGAC….
control
….GCCATTGAC….
….GCCATTGAC….
allele frequency
A %, G%
genotype frequency
AA %, AG %, GG%
haplotype frequency
SNP1, SNP2, SNP3
HAPLOTYPE
A set of closely linked genetic markers present
on one chromosome which tend to be inherited
together (not easily separable by recombination)
BLACK EYE
BROWN EYE
BLACK EYE
BLUE EYE
BROWN EYE
BROWN EYE
DNA Sequence
1 2 3 4 5 6
Phenotype
SNP
SNP
SNP-Haplotype
GATATTCGTACGGA-T
GATGTTCGTACTGAAT
GATATTCGTACGGA-T
GATATTCGTACGGAAT
GATGTTCGTACTGAAT
GATGTTCGTACTGAAT
Haplotypes
AG 2/6(BLACK EYE)
GTA 3/6(BROWN EYE)
AGA 1/6 (BLUE EYE)
HAPLOTYPE CORRELATION WITH PHENOTYPE
 The “Haplotype centric” approach combines the information
of adjacent SNPs into composite multilocus haplotypes.
 Haplotypes are not only more informative but also capture
the regional LD information, which is assumed to be robust
and powerful
 Association of haplotype frequencies with the presence of
desired phenotypic frequencies in the population will help in
utilizing the maximum potential of SNP as a marker.
ADVANTAGES:
1. SNPs ARE THE MOST FREQUENT FORM OF DNA VARIATIONS
2. THEY ARE THE DISEASE CAUSING MUTATIONS IN MANY GENES
3. THEY ARE ABUNDANT & HAVE SLOW MUTATION RATES
4. EASY TO SCORE
5. MAY WORK AS THE NEXT GENERATION OF GENETIC MARKERS
LIMITATIONS:
1. EXPERIMENTAL DETECTION OF SNPs REQUIRES
IMPLEMENTATION OF EXPENSIVE TECHNOLOGIES
2. NEED FOR LARGE POPULATION DATASETS FOR
ASSOCIATION STUDIES
Some important SNP database Resources
1. dbSNP (http://www.ncbi.nlm.nih.gov/SNP/)
LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/list.cgi)
2. TSC (http://snp.cshl.org/)
3. SNPper (http://snpper.chip.org/bio/)
4. JSNP (http://snp.ims.u-tokyo.ac.jp/search.html)
5. GeneSNPs (http://www.genome.utah.edu/genesnps/)
6. HGVbase (http://hgvbase.cgb.ki.se/)
7. PolyPhen (http://dove.embl-heidelberg.de/PolyPhen/)
OMIM (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM)
8. Human SNP database
(http://www-genome.wi.mit.edu/snp/human/)
Feb. 25. 2003 SI Hung