Transcript Document

Software for population genetics
Structure: J. K. Pritchard et al.
Geneclass 2: S. Piry et al.
Structure
• Identification of genetic clusters
• Identification of subclustering within
breeds or relationships between breeds
• Breed assignment of unkown samples
to reference set
Structure
Identification of genetic clusters
• Baysian likelihood method of identifying
K clusters
• K: number of clusters/populations –
provided by user or inferred by
Structure
Structure
Ancestry models
• No admixture model: Each individual originates from
one of the K populations
• Admixture models: Each individual has genomic
fractions of more than one of the K populations
• Linkage model: admixture model, but linked loci are
more likely to originate from the same population.
• Prior information model: user pre-defines (some of)
the clusters
• NB: the model is also determined by the type of data
one has!!
Structure
Anchestry models and input data
• Dominant markers: noadmixture model.
– AA and Aa cannot be distinguished so only a
´present´ or ´absent´ genotype is available.
– AFLP, RFLP etc
• Sequence data, Y chrom or mtDNA
haplotypes: linkage model. Consider this as a
single locus with many alleles.
Structure
allele frequency models
• Correlated allele frequencies:
frequencies in different populations are
likely to be similar (due to migrations or
shared ancestry).
• Independent allele frequencies: allele
freqencies are independent draws from
a distribution specified by a factor λ
Structure
Determining the K
• How to estimate the number of
populations / clusters in your dataset?
– Fully resolving all the groups in your data
(high K): testing all K values until highest
likelihood values are reached.
– Determining the rough relations (low K)
– Trail and error
Structure
running parameters
• Likelihood method: the program
optimizes its own internal parameters.
– Startup configuration can have a very low
probability, so Structure needs a learning
run: the burnin (10.000-100.000 replicates)
• Actual run: enough replicates to obtain
statistically sound results (depending on
your dataset) ~ 50.000 (?)
Geneclass 2
breed assignment
• Software for Genetic assignment and
first-generation Migrant Detection
• S. Piry, A. Alpetite, J.-M. Cornuet, D.
Paetkau, L. Baudouin, A. Estoup
• INRA, Fr.
• Journal of Heredity 2004:95(6): 536-9
Geneclass 2
breed assignment
• Infers the probability of assignment of
reference populations as origin of sampled
individuals on the basis of multilocus
genotypic data.
• Haploid or diploid or mix.
• Likelihood criteria
– Genetic distances
– Allele frequencies
– Bayesian algorithm
• Monte Carlo resampling
Two examples…
• Products of protected geographical origin (PGI)
• Vitellone dell´Appennino Centrale
– Allowed breeds: Chianina,Romagnola, Marchicana
– Not allowed: Piedmontese, Maremmana, Pezzata Rossa Italiana,
Italian Brown, Italian Friesian, Charolais, Limousin, Belgian Blue
• Veau du Limousin
– Allowed breeds: Limousin, Blonde d'Aquitaine,Limousin, Bazadaise
– Not allowed: Holstein, Friesian, Fries-Hollands, Belgian Blue, MainAnjou, Normand, Bretonne-pied-noire, Charolais, Hereford,
Aberdeen Angus, Gasconne, Aubrac, Salers, Montbélliard,
Simmental, Piedmontese, Swiss Brown, Pirinaica
Objective?
• Identify a representative sample from a
batch
• Traceability
• Fraud?
• Protection of the (cultural, economic)
integrity of the product
How?
• Typing with microsatellites.
• Compare patterns / allele frequencies with
reference set.
• Reference library: product of EU diversity
project Resgen:
– ~45 breeds (still adding)
– 20 animals per breed
– 30 microsatellite markers
Title
Markerorder
Populations
Genotypes (allele1allele2)
Optimization
• No need to type all 30 microsatellites
• Product specific level of marker information
• Geneclass 2 option: selfidentification
– Isolate breeds involved in the product (allowed or
not allowed)
– Infer the level of successful selfidentification per
maker
– Rank the markers in order of level of information
% correctly selfidentified in Italian set
INRA5
ETH152
SPS115
INRA35
INRA63
INRA32
MM12
TGLA126
ILSTS6
HEL1
HEL5
CSRM60
ETH225
HEL13
HEL9
HAUT27
INRA37
ILSTS5
BM1818
ETH185
TGLA53
INRA23
ETH3
BM1824
BM2113
CSSM66
HAUT24
TGLA122
TGLA227
16%
14%
12%
10%
8%
6%
4%
2%
0%
ETH10
Identification succes in Italian set
Additive succesfull identification of Italian set compared to full ref set
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Identification succes in French set
25%
20%
15%
10%
% correctly selfidentified in french set
ILSTS5
TGLA126
INRA5
SPS115
INRA35
HEL13
ETH3
INRA63
HEL1
BM1824
ETH225
INRA37
BM1818
HEL9
MM12
ETH152
ETH10
CSRM60
ILSTS6
INRA32
BM2113
HAUT24
HEL5
TGLA53
TGLA227
INRA23
ETH185
HAUT27
TGLA122
0%
CSSM66
5%
Additive succesfull identification of French set compared to full ref set
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Additive identification succes
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
5
10
15
20
25
random marker order; French dataset
Fr Optimized marker order; French dataset
Fr Optimized marker order; Italian dataset
It optimized marker order; Italian dataset
Fr Optimized marker order; Italian dataset_2 class
Fr Optimized marker order set; French dataset_2 class
30
Conclusions
• Breed assignment of unknown samples to a
(large) reference set is quite successful
• Optimizing markerorder for each question
greatly decreases the amount of typing
necessary.
• For a more detailed picture of relationships,
data can be analyzed in structure
Exercise
• 37 unknown samples (file exercise.txt)
• Use the reference set (file reference.txt)
to assign breednames to the samples
• Play with the loci to see the effect of
different markers on the solution
Solution