Genotyping of James Watson from Low

Download Report

Transcript Genotyping of James Watson from Low

Genotyping of James Watson’s genome
from Low-coverage Sequencing Data
Sanjiv Dinakar
and
Yözen Hernández
Genetic Variation

Variations in the genome are called Single Nucleotide
Polymorphisms (SNPs).

Most of the genetic material between individuals is the same.

SNPs make up < 1% of the human genome.
Genotypes




The possible values for a
SNP are called alleles.
We consider only “bi-allelic”
SNPs (SNPs with exactly two
alleles in the human
population).
Each person has two alleles
for each SNP.
A “genotype” is said to be
“homozygous” if the same
allele is present in both
copies and “heterozygous” if
both alleles are present.
A Heterozygous genotype
Shotgun Sequencing
“Shotgun sequencing” is the most popular method for sequencing genomes.

In shotgun sequencing, the DNA molecule is broken into many small, random
fragments and the DNA sequence of each fragment is determined. These
fragments are called “reads”.

The reference genome is then searched for each of these reads, to find the
position in the genome, which the read most likely came from.

Since the position of SNPs within the reference genome is known, the alleles
can be extracted from the reads.

James Watson's Genome





~74.4 million reads from James Watson's genome are freely
available on the internet (generated by Wheeler et al.)
These reads were generated using technology from 454 Life
Sciences.
Wheeler used “DNA microarrays” from AffyMetrix, to genotype
James Watson, but not all known SNPs were found.
Microarrays are expensive and can quickly become obsolete
due to the reference genome changing.
We used different methods to determine the genotypes, based
on the reads, and compared our results to those of Wheeler.
Our method




We need...

to have an already sequenced genome (reference genome) to compare
this individual against.

to have a number of fragments derived from the individual's genome

to be able to locate where the fragments belong in the genome.

to know the location and alleles of the SNPs in the genome.

to determine whether both copies of the SNP are covered
The HapMap project stores a database of the alleles and
position for most known SNPs.
The Human Genome Project, has created a reference genome,
based on multiple volunteers.
If we're sure we've got both copies covered, then determining
the genotype is easy.
Problems


The same copy may be sequenced much more often than the
other or reads may be mapped to the wrong position in the
genome.
Sequencing and mapping accuracy affect genotype
determination. Using older methods, the genotype cannot be
determined with a high confidence if the coverage is low.


It is estimated about 13x coverage is needed to accurate
genotyping.
Can we determine genotypes with high accuracy from lowcoverage data? How?
Read Mapping


In order to map the reads to the reference
genome, Nucmer was used.
Nucmer uses the following procedure to map
a read to the genome:
1. Find MUMs between the read and the reference
genome, using suffix trees.
2. Cluster matches into closely grouped sets,
dropping inconsistent matches.
3. Align the region between MUMs in each cluster.
Read mapping




After using Nucmer to map all the reads, we filtered Nucmer's
output.
We removed:

all reads which were mapped to more than one position in the genome.

all reads which had more than 10 errors (substitutions, insertions,
deletions) and less than 90% of the read covered by a cluster.
Using simulated reads (reads generated randomly from the
reference genome, using ReadSim), this method achieved a
0.37% false positive rate.
Our accuracy was further improved by removing reads which
gave an invalid allele.
SNP Genotyping



Once we have mapped the reads, we know what SNPs each read contains,
but how do we find a SNP's genotype?
Binomial distribution: there are only two alleles per SNP, so the allele
combinations for a heterozygous SNP follow a binomial distribution.
Using a binomial distribution, we can infer possible genotypes.
20
18
16
14
12
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9
10 11 12
13 14 15 16 17 18 19 20 21
Locating and identifying SNPs





There is a tremendous number of SNPs to search through and this search
needs to be done for each read.
A binary search algorithm is used to find the first SNP within the read. We
want to find the SNP contained in the read with the smallest position.
Consecutive SNPs which follow the first SNP are included if they are
contained in the read.
The base found at the SNP position is extracted, and counted if it is a valid
allele. Otherwise, we assume there was an error in the read and the entire
read is thrown out (because it indicates a possibly mismapped read).
Every time an allele is found, we update the heterozygous and
corresponding heterozygous genotype probabilities, and move on to the next
SNP.
Short Binary search example
Read start:15928924
Read end: 15929185
List of SNPs
14431347
...
...
16010486
15700361
Look in the upper half...
15700375
16010486
15928971
This is it!
Look at consecutive SNPs to see if any more are in the read...
“Calling” Genotypes


There may be many reads which contain the same SNP. We count the
number of times an allele appears, and use known frequencies to calculate
the genotype probability.
After all the reads have been processed, we can calculate the posterior
probabilities using Bayes' Theorem:
P(G|R)=P(G)*P(R|G)/P(R)
Where P( G ) is given by Hardy-Weinberg proportions for known
population allele frequencies f0 and f1 (i.e., P( G = {0,0} ) = f02
and P( G = {0,1} ) = 2 * f0 *f1).
P(R|G={0,1})=(1/2)^|R|
P(R|G={0,0})=P(error)^|R(1)|*(1-P(error))^|R(0)|
P(R|G={1,1})=P(error)^|R(0)|*(1-P(error))^|R(1)|
P(R)= P(R|G={0,1})*P( G={0,1} )+ P(R|G={0,0})*P( G={0,0} )+ P(R|G={1,1})*P( G={1,1})
Example

A SNP is overlapped with six reads. Four of the reads have the 0 allele, two of the reads have
the 1 allele.

In the population, the 0 allele occurs 75% of the time, the 1 allele occurs 25% of the time.

Assume errors occur uniformly with a rate of 1%
|R|=6, |R(1)|=2, |R(0)|=4
P( G = {0,1} )=2*0.75*0.25=0.375
P( G = {0,0} )=0.75*0.75=0.5625
P( G = {1,1} )=0.25*0.25=0.0625
P(R|G = {0,1} )=0.5^6=0.015625
P(R|G = {0,0} )=0.01^2*0.99^4=0.00009605
P(R|G = {1,1} )=0.01^4*0.99^2=0.000000009801
P(R)=0.015625*0.375+0.000096059601*0.5625+0.000000009801*0.0625=0.005913
P(G = {0,1}|R)=0.375*0.015625/0.005913=0.9909
P(G = {0,0}|R)=0.5625*0.00009605/0.005913=0.009137
P(G = {1,1}|R)=0.0625*0.000000009801/0.005913=.0000001036
Results



We compared our results with those of the team which
sequenced Watson's genome.
Our binomial distribution calculations gave better results, calling
more SNPs and giving better accuracy than Wheeler et al. did.
There are probabilistic models which are far superior to the
simple binomial distribution, and which are more realistic
biologically. We are currently working on implementing Hidden
Markov Models to solve this problem.
Questions??