Glanville fritillary butterfly genomics and genetics

Download Report

Transcript Glanville fritillary butterfly genomics and genetics

Rainer Lehtonen
PhD, Genomics and genetics project leader
Metapopulation Research Group
Department of Biological and Environmental
Sciences, University of Helsinki





Background
Genome project
Genome assembly >> Panu Somervuo
Some NGS applications
Conclusions
2


Glanville fritillary is an internationally
recognized metapopulation model system in
ecological and evolutionary studies
Studied since 1991 in the Åland Islands in Finland
Data available from different populations:
-
Fragmented landscape vs. continuous
Isolated vs. metapopulation
Large vs. small
Same vs. different population history

Field studies, indoor & outdoor cage +
laboratory experiments, controlled crosses,
molecular studies
3
DNA (+RNA) SAMPLES
INSTITUTE OF BIOTECHNOLOGY
SEQUENCE DATA PRODUCTION
INSTITUTE OF BIOTECH, KAROLINSKA INSTITUTE
QC + ASSEMBLY
INSTITUTE OF BIOTECH, DEP COMPUTER SCI
ASSEMBLY VALIDATION (ref g)
INSTITUTE OF BIOTECH, DEP COMPUTER SCI
ANNOTATION + PUBLICATION
EBI, ENSEMBL GENOMES
GENOME ANALYSIS
EBI, OTHER GENOME PROJECTS
VARIATION IN THE GENOME
INSTITUTE OF BIOTECH, DEP COMPUTER SCI
GENETIC TOOLS
FIMM, BIOMEDICUM HKI, INSTITUTE OF BIOTECH,
ILLUMINA INC.
4
NEX-GEN SEQUENCING
454, SOLiD3, SOLEXA
REF DNA +RNA SAMPLES
EST ASSEMBLY
ESTs
GENOME
ASSEMBLY
NEX-GEN RE-SEQUENCING
SOLiD4/SOLEXA
CROSSES/POP POOLS/INDS
MAPPING TO REF GENOME
VARIATION
REF GENOME
GENETIC MAP
(MARKER
LOCATIONS)
GENETIC VARIATION
GENE EXPRESSION
GENOME ANNOTATION
DATA FROM
OTHER SOURCES
PLATFORM FOR LARGE SCALE
TARGETED GENOTYPING
GENOTYPING OF LARGE POPULATION
SAMPLES (>50K)
Sample
Aim
Platform
Read Type
Read
Runs to be
Length done
RNA, pool
used in
RNAseq
Gene start sites
Gene 5’ variation
SOLiD4
Pair-end
50+25
1/4
Amp DNA, 4
crosses
Construction of
genetic map
SOLiD4
Single read, RAD
tag library
50+25
3
Amp DNA,
pool ~30 ind
SNPs & other
genetic variation
SOLiD4
Pair-end
50+25
1
RNA, pooled
pop samples
from 5+1 pop
Variation in 5+2
pop
SNPs in ESTs,
Expression
SOLiD4
Pair-end
50+25
1(-2)
DNA from
selected
individuals
Pgi & flanking
genes +
Sdhd, Hsp70
Single read
400
1/4
25.-26.3.2010
SureSelect +
454
Heliconius Genome
Meeting
Sanger
seq
6
RAD-tag (Restriction Enzyme Associated DNA) known also as
“Deep sequencing of reduced representation library”
Example: Construction of a high-density genetic map:
*4 controlled Spain-Finland crosses
* Parents and 50 individuals from each family to be sequenced
Genetic or linkage map defines an order and distance between markers
based on a recombination frequency (1cM = 1% recombination rate) in
meiosis
SureSelect (Agilent)Target Enrichment + deep sequencing with 454
Example: Population comparison of the Pgi + flanking genes (+ some other)
in a sample of 24 individuals or pools
7
Nathan A et al.
PloS ONE 2008
Now:
500M
Reads
50 bp each
150-200bp pair-end library
50bp seq
SNP1
25 bp seq
SNP2
8
Average fragment size
454 Glanville gContigs
NcoI
13.3
XhoI
11.5
EcoRI
4.5
Heliconius
14
4
2
Mappable reads
•
Restriction site > 250bp from the end of a gContig
•
Targets = 2x sites
•
454-Newbler assembly: 320Mbp (out of ~550Mbp genome in 220K contigs (>500bp)
•
Expected number of SNPs 1/300bp, read lenght 50-25bp
----------------------------------------------------#sites
#mappable
#exp
#SNPs
NcoI*
ccatgg 24,064 38,880
48,128
12,032
XhoI
ctcgag 27,788
45,925
55,576
13,894
EcoRI
gaattc
70,474
117,293
140,948
35,2367
BsphI*
tcatga
66,967 110,731
133,934
33,483
NdeI
catatg
73,629
121,628
147,258
36,814
*The most probable combination > ~45,000 SNPs
•
Reads have to unique
•
10-20x coverage/ individual (>~5000x on average)
•
Heavy data filtering needed > probably only 30-50% of data is usable
In silico restriction analysis made by Panu Somervuo, MRG
9
Max 55K 120 mer
oligos
Glanville fritillary butterfly SureSelect
Target enrichment (10x tiling):
•To identify “lethal” haplotypes associated
to a known homozygous genotype
•To define structure and variations of the
hypervariable Pgi gene
* To design tag-SNPs for large scale genotyping
10
Hypothesis driven sampling
compare samples (24) from
different populations with
different tag-SNP genotype
frequencies
>Hardy-Weinberg equilibrium
> Hardy-Weinberg disequilibrium

•
Cinxia Sure Select
TCMID_72 - Tas_pooli_Cinxia Sure Select_13-16
TCMID71 TCMID70 TCMID_69 - Tas_pooli_Cinxia Sure Select_E3
TCMID_68 - Tas_pooli_Cinxia Sure Select_D3
TCMID_67 - Tas_pooli_Cinxia Sure Select_5
TCMID_66 - Tas_pooli_Cinxia Sure Select_4
TCMID_65 - Tas_pooli_Cinxia Sure Select_3
TCMID_64 - Tas_pooli_Cinxia Sure Select_2
TCMID_63 - Tas_pooli_Cinxia Sure Select_1
TCMID_62 - Tas_pooli_Cinxia Sure Select_C3
TCMID_61 - Tas_pooli_Cinxia Sure Select_B3
TCMID_60 - Tas_pooli_Cinxia Sure Select_A3
TCMID_59 - Tas_pooli_Cinxia Sure Select_A2
TCMID_58 - Tas_pooli_Cinxia Sure Select_H1
TCMID_57 - Tas_pooli_Cinxia Sure Select_G1
TCMID_56 - Tas_pooli_Cinxia Sure Select_F1
TCMID_55 - Tas_pooli_Cinxia Sure Select_E1
TCMID_54 - Tas_pooli_Cinxia Sure Select_D1
TCMID_53 - Tas_pooli_Cinxia Sure Select_C1
TCMID_52 - Tas_pooli_Cinxia Sure Select_B1
TCMID_51 - Tas_pooli_Cinxia Sure Select_A1
TCMID_50 - Tas_pooli_Cinxia Sure Select_6,9-12 +7+8
TCMID_3 - Tas_pooli_Cinxia Sure Select_F3
5468
14 731
7774
7960
6324
7718
3708
3621
6499
5361
4983
3613
4494
21 122
22 316
17 110
20 851
9 780
9 214
16 644
13 717
12 959
9 362
11 687
4441 131
3581
2829
3587
1791
4 568
4144
Reads (total 337 635)
9 863
7 520
9 164
10 540
5236
13 346
8204
4343
3128
5 000
Bases kbp (total 128 555 kbp)
20 699
11 072
7 998
10 000
12197
11546
15 000
31 488
30 753
20 000
25 000
30 000
35 000
¼ 454 Titanium run: 444-12197 kb/sample = 15-406 x coverage
Figure by Pia Laine
Institute of
Biotechnology
University of
Helsinki
11
Our very
preliminary result:
~40% of the data
comes from the
target
Data from Agilent
12
Sampsa Hautaniemi, Marko Laakso,
Sirkku Karinen, Rainer Lehtonen
[email protected]
25.-26.3.2010
Heliconius Genome Meeting
13




Whole genome sequencing is doable for a
“non-genome” oriented research group
Most work on data filtering and analysis
Tools for data management and analysis
under strong development
Down-stream efforts need to be compatible
with available genome data
14