Using individual scale de novo assembly to identify mutations causing a phenotypic trait Scott Geib USDA-ARS Daniel K Inouye Pacific Basin Agricultural Research Center Hilo HI Ceratitis capitata: the Mediterranean.

Download Report

Transcript Using individual scale de novo assembly to identify mutations causing a phenotypic trait Scott Geib USDA-ARS Daniel K Inouye Pacific Basin Agricultural Research Center Hilo HI Ceratitis capitata: the Mediterranean.

Using individual scale
de novo assembly to
identify mutations
causing a phenotypic
trait
Scott Geib
USDA-ARS
Daniel K Inouye Pacific Basin Agricultural Research Center
Hilo HI
Ceratitis capitata: the
Mediterranean fruit fly
• Not Drosophila
• Pest infesting over 300 different
hosts
• Threat to the U.S. , not officially
established in CA
• But, infestations found
periodically
• California produces $18 BILLION
of agriculture and represents
almost half of total agriculture in
the US
• Cost $50-200k in quarantine
efforts
• Sterile insect male release part of
eradication program
SIT strain genetics
• SIT (Sterile Insect
Technique) strain
• Reared by the billions (in
Hawaii and Guatemala)
• Is a genetic sexing strain
• Carries two sex-linked traits
• Temperature sensitive lethal
(tsl), females die at increased
temperature
• White pupae (wp), female
pupae are white
• Females killed in egg stage
and thus halves the cost of
rearing
• Desire to replicate this in
other species
SIT strain genetics
• SIT (Sterile Insect
Technique) strain
• Reared by the billions (in
Hawaii and Guatemala)
• Is a genetic sexing strain
• Carries two sex-linked traits
• Temperature sensitive lethal
(tsl), females die at increased
temperature
• White pupae (wp), female
pupae are white
• Females killed in egg stage
and thus halves the cost of
rearing
• Desire to replicate this in
other species
SIT strain
Females
X
X
5
Males
X
5
tsl
wp
tsl
wp
Y/5
5/Y
tsl+
wp+
5
tsl
wp
SIT strain
Females
X
X
5
Males
X
5
tsl
wp
tsl
wp
Y/5
5/Y
tsl+
wp+
5
tsl
wp
Wildtype lab line
Females
X
X
5
Males
X
5
tsl+
wp+
tsl+
wp+
Y
5
5
tsl+
wp+
tsl+
wp+
The making of a sterile male
• Colony produces eggs
• Females are selected
out
• Eggs hatch and larvae
feed
• Larvae pupate, are dyed
and irradiated
• Flies are shipped,
eclose and are
distributed by plane
The making of a sterile male
Eggs incubated in water bath at
32 degrees Celsius
• Colony produces eggs
• Females are selected
out
• Eggs hatch and larvae
feed
• Larvae pupate, are dyed
and irradiated
• Flies are shipped,
eclose and are
distributed by plane
The making of a sterile male
• Colony produces eggs
• Females are selected
out
• Eggs hatch and larvae
feed
• Larvae pupate, are dyed
and irradiated
• Flies are shipped,
eclose and are
distributed by plane
The making of a sterile male
• Colony produces eggs
• Females are selected
out
• Eggs hatch and larvae
feed
• Larvae pupate, are dyed
and irradiated
• Flies are shipped,
eclose and are
distributed by plane
The making of a sterile male
• Colony produces eggs
• Females are selected
out
• Eggs hatch and larvae
feed
• Larvae pupate, are dyed
and irradiated
• Flies are shipped,
eclose and are
distributed by plane
Los Angeles Basin Release Regions
Detections over time
• Despite efforts to keep flies out, they are consistently
detected each year
2014 Detections
Goals
• Identify causative mutations for white pupae and
temperature sensitive lethal traits in medfly
Approach
1. Genetic cross with inbred lab line, isolate trait in wild-type
background
2. Construct linkage mapping and perform QTL analysis to
identify region of genome associated with mutation
3. Through whole genome sequencing of individuals,
characterize potential causative mutations within the
highly associated regions
4. Re-creating phenotype through generation of CRISPR
mutants
Goals
• Identify causative mutations for white pupae and
temperature sensitive lethal traits in medfly
Approach
1. Genetic cross with inbred lab line, isolate trait in wild-type
background
2. Construct linkage mapping and perform QTL analysis to
identify region of genome associated with mutation
3. Through whole genome sequencing of individuals,
characterize potential causative mutations within the
highly associated regions
4. Re-creating phenotype through generation of CRISPR
mutants
Goals
• Identify causative mutations for white pupae and
temperature sensitive lethal traits in medfly
Approach
1. Genetic cross with inbred lab line, isolate trait in wild-type
background
2. Construct linkage mapping and perform QTL analysis to
identify region of genome associated with mutation
3. Through whole genome sequencing of individuals,
characterize potential causative mutations within the
highly associated regions
4. Re-creating phenotype through generation of CRISPR
mutants
Goals
• Identify causative mutations for white pupae and
temperature sensitive lethal traits in medfly
Approach
1. Genetic cross with inbred lab line, isolate trait in wild-type
background
2. Construct linkage mapping and perform QTL analysis to
identify region of genome associated with mutation
3. Through whole genome sequencing of individuals,
characterize potential causative mutations within the
highly associated regions
4. Re-creating phenotype through generation of CRISPR
mutants
Goals
• Identify causative mutations for white pupae and
temperature sensitive lethal traits in medfly
Approach
1. Genetic cross with inbred lab line, isolate trait in wild-type
background
2. Construct linkage mapping and perform QTL analysis to
identify region of genome associated with mutation
3. Through whole genome sequencing of individuals,
characterize potential causative mutations within the
highly associated regions
4. Confirm mutation and re-creating phenotype through
generation of CRISPR mutants
1. Isolate trait using a genetic cross
1. Isolate trait using a genetic cross
2. Generate genome wide variant data
and high resolution linkage map
A genome assembly exists as a result of the i5k (5000
insect genomes) project (ALLPATHS-LG; BCM)
A genotype by sequencing (GBS) approach
2. Generate genome wide variant data
and high resolution linkage map
Elshire RJ, Glaubitz JC, Sun Q, et al. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity
Species. Orban L, ed. PLoS ONE. 2011;6:e19379. doi:10.1371/journal.pone.0019379.
2. Generate genome wide variant data
and high resolution linkage map
A genome assembly exists as a result of the i5k (5000
insect genomes) project (ALLPATHS-LG; BCM)
A genotype by sequencing (GBS) approach
• F4 white and brown pupa individuals targeted for
sequenced (in addition to P-gen and parents of F4)
• 285 samples total
• Sequenced 1 X 75 bp HiSeq, ~1 million reads/sample
• Map to reference, identify SNPs (~ 80,000 identified)
• Use SNPs across mapping population to calculate linkage
map
2. Generate genome wide variant data
and high resolution linkage map
• We placed 80% of genome into linkage groups
• Generated 6 major linkage groups (matching expected
Autosomal and Sex Chromosome #)
2. Generate genome wide variant data
and high resolution linkage map
80% of the genome “super scaffolded” into 12 pieces
2. Generate genome wide variant data
and high resolution linkage map
2. Generate genome wide variant data
and high resolution linkage map
2. Generate genome wide variant data
and high resolution linkage map
QTL LOD score
By assessing
phenotype of
individual (white
or brown pupae)
we can calculate
score based off
of linkage
disequilibrium
2. Generate genome wide variant data
and high resolution linkage map
QTL LOD score
• Scaffolds
associated with
chromosome 5
had highest
scoring loci
• This is a draft
assembly, so
there are many
scaffolds on this
chromosome
2. Generate genome wide variant data
and high resolution linkage map
QTL LOD score
• Scaffolds
associated with
chromosome 5
had highest
scoring loci
• This is a draft
assembly, so
there are many
scaffolds on this
chromosome
• Peak at
Scaffold43,
position
1,353,742
3. Identification of putative
causative mutation
• Linkage map and QTL identify the relative location in
the genome associated with white pupae mutation.
• Putative region smaller than 1Mb on a single scaffold of the
draft genome assembly (Scaffold43:1353742)
• Most GBS loci are in non-coding regions
• Utilize whole genome re-sequencing of individual and
interrogation of this region of genome to identify putative
mutation.
• Compare variant finding approaches as a user
• Mapping based approach (GATK)
• Assembly graph approach (DISCOVAR/DISCOVAR denovo)
3. Identification of putative
causative mutation
• Generating data for DISCOVAR/GATK
• Individual flies were subjected to PCR-free library prep
methods (~500 ng DNA / fly or less)
• Size selection to target a 450 bp fragment
• For each sample, 2 X 250 bp sequencing on HiSeq2500
Rapid Run, approximately 70M read clusters per library
• Six F4 flies (3 white and 3 brown) sequenced on a single
Rapid Run to ~60X coverage
3. Identification of putative
causative mutation
• DISCOVAR de novo as an individual scale assembler
• Contig assembly only, no scaffolding
• 6 individual assemblies
• Medfly genome size ~450 Mb
Sample
W3M
W5M
W7M
B44F
B47F
B56F
# read pairs
(estimated
coverage)
79.6 (42x)
90 (47x)
141.7 (74x)
88.5 (46x)
69.8 (36x)
69.1 (36x)
Frag Insert Size
(determined
from sequence
data)
480
485
480
500
470
480
Scaffold N50
28826
40279
18817
37117
29943
6454
Mean Base
Quality
35.3
35.3
35.3
35
35.3
35.3
Starting
DNA
amount
(ng)
178
214
94
274
286
304
Starting
DNA
Peak Size
5.3
13.5
8
3.8
6.3
8.9
3. Identification of putative
causative mutation
• DISCOVAR de novo as an individual scale assembler
Contig N50 between 6.4 kb to over 40 kb
• While smaller than suggested by DISCOVAR, 40 kb is
much larger than contigs derived from “standard
ALLPATHS” assembly
• Able to anchor the assemblies to the reference draft
genome
• Genotype by comparing graph structure between
assemblies.
3. Identification of putative
causative mutation
Possibility of post-assembly scaffolding
• Another goal of this project was to test DISCOVAR for de
novo assembly of other insect species (lower cost/better
contig quality than ALLPATHS??)
• Utilizing existing jumping libraries, we can scaffold to
similar size as ALLPATHS (SSPACE)
• Looking at utility of Hi-C datasets to superscaffold
• DISCOVAR + Hi-C = Chromosome scale assembly (????)
• Maybe need some pre-scaffolding?
• Combine with linkage data?
• Received Hi-C data this week ………..
3. Identification of putative
causative mutation
• DISCOVAR as a variant caller
• Currently it is difficult to pull out variants across the
entire genome from a DISCOVAR de novo analysis (in the
works)
• Today, focused specifically within Scaffold43,
surrounding the linked loci (from QTL analysis)
3. Identification of putative
causative mutation
• DISCOVAR as a variant caller
• Example graph structure of 6 genomes together across
small region:
3. Identification of putative
causative mutation
• DISCOVAR as a variant caller
• Example graph structure of 6 genomes together across
small region:
White pupae: Homozygous alternative
Brown: Homozygous for reference
*** one brown was heterozygous (.44/.50)
3. Identification of putative
causative mutation
• Comparing results of these 6 genomes to the
reference and to each other at genome-wide scale
• Using GATK and DISCOVAR
• Look at the genotype data
• Verifying linkage map
3. Identification of putative
causative mutation
• Comparison of QTL score to variants discriminating
phenotypes from WGS (100kb window)
# of SNPs with
homozygous calls in all
white pupae F4s and
opposing call in brown
pupae sample (homo or
het)
3. Identification of putative
causative mutation
• Comparing results of these 6 genomes to the
reference and to each other at genome-wide scale
• Using GATK and DISCOVAR
• Look at the genotype data
• Verifying linkage map
• Identify possible assembly issues from reference
assembly
3. Identification of putative
causative mutation
• A chromosome 5 (linked) scaffold (white males
only)
3. Identification of putative
causative mutation
• An unlinked scaffold, potential scaffolding error
3. Identification of putative
causative mutation
• An unlinked scaffold, potential scaffolding error
3. Identification of putative
causative mutation
• Back to variant calling at a more detailed scale
• Using GATK and DISCOVAR, called variants across
Scaffold43 (~3.3 Mb)
• Variant impact analyzed using SNPEff and current NCBI
RefSeq annotation set
3. Identification of putative
causative mutation
• Back to variant calling at a more detailed scale
• Using GATK and DISCOVAR, called variants across
Scaffold43 (~3.3 Mb)
• Variant impact analyzed using SNPEff and current NCBI
RefSeq annotation set
• Overall, some consistency between variants called:
•
•
•
•
DISCOVAR called 87,542 variants (76k SNPs / 11k INDELs)
GATK called 106,873 variants (85k SNPs / 22k INDELs)
61,105 identical between methods
DISCOVAR found several very large insertions (100’s of bp) not
identified by GATK, all were in non-coding regions.
3. Identification of putative
causative mutation
• Generating “short list” of putative mutations
• Making some assumptions (disclaimer)
• Mutation is in coding region
• Not accounting for non-coding mutations that may impact gene
expression or regulation
• Very little regulatory info available for this non-model genome
3. Identification of putative
causative mutation
• Generating “short list” of putative mutations
• Making some assumptions (disclaimer)
• Mutation is in coding region
• Not accounting for non-coding mutations that may impact gene
expression or regulation
• Very little regulatory info available for this non-model genome
• Overall, found five major mutations that:
• Consistent between phenotypes
• Caused major impact
• Non-synonymous mutations
• Frameshift mutations and/or premature stop codons
3. Identification of putative
causative mutation
Scaffold
Position
Ref mutationsAlt
• Generating “short
list” of putative
Scaffold
43
C
T
• Making
some800831
assumptions (disclaimer)
• Mutation is 837972
in coding region
Scaffold 43
C
C
• Not accounting for non-coding mutations that may impact gene
Scaffold 43
1576424
G
A
expression or regulation
Scaffold 43
2259830info availableAAC
• Very little regulatory
for this non-model A
genome
• Overall,
found
five major mutations
Scaffold
43
2262779
C that:
A
• Consistent between phenotypes
• Caused major impact
ACAACAGGCATGCCAGCAAGTTGT
GGCCGTCTTCCAACAACATGCTGCT
ACAACTACAACAGCCAAATGACGAG
CCCGCCGTTGCAGCCTCAGCACCAG
CCAAGGCTACATATGCAACACTGCG
ACATGCGATGGTTGTAGAGGCGCA
AGC
• Non-synonymous mutations
• Frameshift mutations and/or premature stop codons
Scaffold 43
2410888
A
4. Demonstrate mutations
• Using the union of linkage mapping, whole genome variant
calling, structural impact, and RNAseq, identified a
prioritized “short list” of mutations causing white pupae and
temperature sensitive lethal
• Verify mutations through sanger sequencing and SNP assays
• Re-create phenotype through CRISPR-CAS targeted editing
currently occurring ……..
Conclusions
• A consortium of approaches allow identification of
putative mutations causing phenotypic traits
• Graph-based variant calling seems to be of similar
quality as mapping based approaches in non-model
system (difficult to validate without gold set of
variants)
• Potential advantage of getting de novo assembly to
identify novel structure of specific genome and not
carry over errors of reference assembly
• Cost is higher, but comparable to 30x coverage of
standard HiSeq data.
Acknowledgements
Pacific Basin
Agricultural
Research
Center
Geib Lab
Sheina Sim
Bernarda Calla
Steve Tam
Brian Hall
Teddy DeRego
APHIS-PPQ
Norman Barr,
Raul Ruiz
DISCOVAR Group
David Jaffe
Funding source/Resources:
USDA Farm Bill
USDA ARS
Moana HPC Cluster
NSF XSEDE Consortium
Questions?