Transcript Document
Software for population genetics Structure: J. K. Pritchard et al. Geneclass 2: S. Piry et al. Structure • Identification of genetic clusters • Identification of subclustering within breeds or relationships between breeds • Breed assignment of unkown samples to reference set Structure Identification of genetic clusters • Baysian likelihood method of identifying K clusters • K: number of clusters/populations – provided by user or inferred by Structure Structure Ancestry models • No admixture model: Each individual originates from one of the K populations • Admixture models: Each individual has genomic fractions of more than one of the K populations • Linkage model: admixture model, but linked loci are more likely to originate from the same population. • Prior information model: user pre-defines (some of) the clusters • NB: the model is also determined by the type of data one has!! Structure Anchestry models and input data • Dominant markers: noadmixture model. – AA and Aa cannot be distinguished so only a ´present´ or ´absent´ genotype is available. – AFLP, RFLP etc • Sequence data, Y chrom or mtDNA haplotypes: linkage model. Consider this as a single locus with many alleles. Structure allele frequency models • Correlated allele frequencies: frequencies in different populations are likely to be similar (due to migrations or shared ancestry). • Independent allele frequencies: allele freqencies are independent draws from a distribution specified by a factor λ Structure Determining the K • How to estimate the number of populations / clusters in your dataset? – Fully resolving all the groups in your data (high K): testing all K values until highest likelihood values are reached. – Determining the rough relations (low K) – Trail and error Structure running parameters • Likelihood method: the program optimizes its own internal parameters. – Startup configuration can have a very low probability, so Structure needs a learning run: the burnin (10.000-100.000 replicates) • Actual run: enough replicates to obtain statistically sound results (depending on your dataset) ~ 50.000 (?) Geneclass 2 breed assignment • Software for Genetic assignment and first-generation Migrant Detection • S. Piry, A. Alpetite, J.-M. Cornuet, D. Paetkau, L. Baudouin, A. Estoup • INRA, Fr. • Journal of Heredity 2004:95(6): 536-9 Geneclass 2 breed assignment • Infers the probability of assignment of reference populations as origin of sampled individuals on the basis of multilocus genotypic data. • Haploid or diploid or mix. • Likelihood criteria – Genetic distances – Allele frequencies – Bayesian algorithm • Monte Carlo resampling Two examples… • Products of protected geographical origin (PGI) • Vitellone dell´Appennino Centrale – Allowed breeds: Chianina,Romagnola, Marchicana – Not allowed: Piedmontese, Maremmana, Pezzata Rossa Italiana, Italian Brown, Italian Friesian, Charolais, Limousin, Belgian Blue • Veau du Limousin – Allowed breeds: Limousin, Blonde d'Aquitaine,Limousin, Bazadaise – Not allowed: Holstein, Friesian, Fries-Hollands, Belgian Blue, MainAnjou, Normand, Bretonne-pied-noire, Charolais, Hereford, Aberdeen Angus, Gasconne, Aubrac, Salers, Montbélliard, Simmental, Piedmontese, Swiss Brown, Pirinaica Objective? • Identify a representative sample from a batch • Traceability • Fraud? • Protection of the (cultural, economic) integrity of the product How? • Typing with microsatellites. • Compare patterns / allele frequencies with reference set. • Reference library: product of EU diversity project Resgen: – ~45 breeds (still adding) – 20 animals per breed – 30 microsatellite markers Title Markerorder Populations Genotypes (allele1allele2) Optimization • No need to type all 30 microsatellites • Product specific level of marker information • Geneclass 2 option: selfidentification – Isolate breeds involved in the product (allowed or not allowed) – Infer the level of successful selfidentification per maker – Rank the markers in order of level of information % correctly selfidentified in Italian set INRA5 ETH152 SPS115 INRA35 INRA63 INRA32 MM12 TGLA126 ILSTS6 HEL1 HEL5 CSRM60 ETH225 HEL13 HEL9 HAUT27 INRA37 ILSTS5 BM1818 ETH185 TGLA53 INRA23 ETH3 BM1824 BM2113 CSSM66 HAUT24 TGLA122 TGLA227 16% 14% 12% 10% 8% 6% 4% 2% 0% ETH10 Identification succes in Italian set Additive succesfull identification of Italian set compared to full ref set 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Identification succes in French set 25% 20% 15% 10% % correctly selfidentified in french set ILSTS5 TGLA126 INRA5 SPS115 INRA35 HEL13 ETH3 INRA63 HEL1 BM1824 ETH225 INRA37 BM1818 HEL9 MM12 ETH152 ETH10 CSRM60 ILSTS6 INRA32 BM2113 HAUT24 HEL5 TGLA53 TGLA227 INRA23 ETH185 HAUT27 TGLA122 0% CSSM66 5% Additive succesfull identification of French set compared to full ref set 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Additive identification succes 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 5 10 15 20 25 random marker order; French dataset Fr Optimized marker order; French dataset Fr Optimized marker order; Italian dataset It optimized marker order; Italian dataset Fr Optimized marker order; Italian dataset_2 class Fr Optimized marker order set; French dataset_2 class 30 Conclusions • Breed assignment of unknown samples to a (large) reference set is quite successful • Optimizing markerorder for each question greatly decreases the amount of typing necessary. • For a more detailed picture of relationships, data can be analyzed in structure Exercise • 37 unknown samples (file exercise.txt) • Use the reference set (file reference.txt) to assign breednames to the samples • Play with the loci to see the effect of different markers on the solution Solution