PowerPoint-presentatie - The Genome Analysis Centre

Transcript PowerPoint-presentatie - The Genome Analysis Centre

Added value of whole-genome sequence data to genomic predictions in dairy cattle

Rianne van Binsbergen 1,2 , Mario Calus 1 , Chris Schrooten 3 , Fred van Eeuwijk 2 , Roel Veerkamp 1 , Marco Bink

1 Animal Breeding & Genetics Centre, Wageningen UR (NL)

Biometris, Wageningen UR (NL) 3 CRV (cattle breeding company) , Arnhem (NL)

Genomic Prediction in agricultural species

Reference population: 1) Estimate effects for each SNP (w) 2) Generate a prediction equation that combines all the marker genotypes with their effects to predict the breeding value of each individual Apply prediction equation to a group of individuals that have genotypes but not phenotypes  Estimated genomic breeding values  Select the best individuals for breeding Each SNP represented by a variable (x), which takes the values 0 [

A A

] 1 [

B] 2 [B B] • • •

Advantages:

Select at early age (before phenotypes available) Save costs to phenotype candidates Increase accuracy of predicted Breeding Values Goddard & Hayes (2009) Nature Reviews Genetics 10:381

One seminal paper on Genomic Prediction

Simulation Study  Dense marker maps  SNP markers at 1cM density  Prediction Accuracy    Least Squares method: Genomic BLUP method: Bayesian methods(A,B): 0.32

0.73

0.85

 Conclusion: “selection on genetic values predicted from markers could substantially increase the rate of genetic gain in

animals and plants

, especially if combined with reproductive techniques to shorten the generation interval”

Another (seminal) paper on Genomic Prediction

“In the case of whole-genome sequence data, the polymorphisms that are causing the genetic differences between the individuals are among those being analyzed.” Higher accuracy in genomic predictions since causal mutation is included (assumption)  No dependency on LD   Persistency across generations Genomic prediction across breeds Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps T. H. E. Meuwissen,* B. J. Hayes† and M. E. Goddard†,‡ “Only few SNPs were useful for predicting the trait [because they were in linkage disequilibrium (LD) with mutations causing variation in the trait] while many SNPs were not useful.”

Genomic predictions from whole-genome sequence data

  Tremendous increase in number of SNPs (more noise) Large (sequence) data are required Solution  Sequence core set of individuals (e.g. founders)  Impute whole-genome sequence genotypes of other individuals Accuracy of imputation to whole-genome sequence data was generally high for imputation from 777K SNP panel Van Binsbergen, et al. Genet Sel Evol 2014 (in press)

This presentation:

First results of genomic prediction with imputed whole-genome sequence data for 5503 bulls with accurate phenotypes

Dataset: SNP genotypes & trait phenotypes

5503 Holstein Friesian bulls

777K SNP genotypes (Illumina BovineHD BeadChip)

Imputation - Beagle v4 software 1000 bull genomes project

28M SNP genotypes

429 bulls

(multiple breeds)

5503 Holstein Friesian bulls

12M SNP genotypes MAF > 0.005

Imputation accuracy > 0.05

De-regressed progeny based proofs (DRP 1 ) and associated effective daughter contributions (EDC 2 )    Somatic cell score (SCS) Interval fist and last insemination (IFL) Protein yield (PY) 1 VanRaden et al. 2009 (J Dairy Sci) 2 VanRaden and Wiggans 1991 (J Dairy Sci)

Prediction reliability

= squared correlation between original phenotype (DRP) and estimated genetic values (GEBV)

5503 Holstein Friesian bulls

777K SNP genotypes (Illumina BovineHD BeadChip)

5503 Holstein Friesian bulls

12M SNP genotypes MAF > 0.005

Imputation accuracy > 0.05

training population 4322 old bulls validation population 1181 young bulls training population 4322 old bulls validation population 1181 young bulls Validation population  Youngest bulls with EDC    Mainly sons of bulls in training population Mimics breeding practice 0

Genomic prediction – 2 methods

GBLUP

 Genome-enabled best linear unbiased prediction

BSSVS

 Bayes stochastic search variable selection  Distribution QTL effects to be close to infinitesimal model (all SNPs equally small effect)  Build a genomic relationship matrix to model variance covariance structure 3 chains of 60,000 cycles (10,000 cycles burn-in)  Large number of SNPs with tiny (close to zero) and a few SNPs with moderate effects (=mixture of two Normal distributions) Implementation via Markov chain Monte Carlo (MCMC) simulation algorithms (computer intensive) Calus M (2014). Right-hand-side updating for fast computing of genomic breeding values. Genetics Selection Evolution 46(1): 24.

Computation

GBLUP 777K

● SNP ●

12M

SNP ● HPC – 1 node ~ 3 hours ~ 32 GB RAM ● ● ● HPC – 12 nodes ~ 6 hours ~ 600 GB RAM

3 chains of 60,000 cycles (10,000 cycles burn-in)

BSSVS (per MCMC chain) ● Windows – 1 CPU ● ~ 5 days ● ~ 1.6 GB RAM ● ● ● HPC – 1 node ~ 50 days ~ 32 GB RAM Windows 7 Enterprise desktop pc: 32 CPU – 8 GB RAM/CPU (clock speed 2.60 GHz) HPC Linux cluster: Normal nodes – 64 GB/node (2.60 GHz); 2 fat nodes – 1 TB RAM/node (2.20 GHz)

Results: Prediction Reliability

0,6 0,5 0,4 0,3 0,2 0,1 0,0 SCS IFL BSSVS: Average over 3 chains of 60,000 cycles (10,000 cycles burn-in) PY BovineHD GBLUP BovineHD BSSVS Sequence GBLUP Sequence BSSVS * * Based on 45,000 cycles

Results: Prediction Reliability

0,6 0,5 0,4 0,3 0,2 0,1 0,0 SCS IFL PY BovineHD GBLUP BovineHD BSSVS Sequence GBLUP Sequence BSSVS * * Based on 45,000 cycles

BSSVS: Convergence & SNP effects

Trace of variance of SNP effects Bayes Factor for SNP effects 777K SNP 12M SNP 3 chains of 60,000 cycles (10,000 cycles burn-in) Sequence: 45,000 cycles

Suitability of BSSVS model?

 Large number of SNPs with tiny and a few SNPs with moderate effects ● Sequence data: Really large number of SNPs with tiny effects  Captures too much signal?

 Another Bayesian Prediction Model:

Bayes-C

● Large number of SNPs with NO effect and a few SNPs with moderate effects

Concentrate on single chromosome (BTA 6)

MCMC convergence 777K SNP

BSSSVS Bayes-C

12M SNP

Concentrate on single chromosome (BTA 6)

Signal of QTL effects 777K SNP

BSSSVS Bayes-C

12M SNP

Reliability estimates BSSVS

BovineHD 0.328

Sequence 0.324

BayesC

0.328

0.325

Conclusions

 Genomic prediction using sequence data becomes reality

●

However, sequence data requires intensive computation  Need for faster algorithms  Use of Sequence Data did not improve Prediction reliability

●

Convergence issues with BSSVS  Longer chains may yield better results   BSSVS slightly better compared to GBLUP Preliminary results BTA6 hint that Bayes-C method may work better (than BSSVS) for sequence data Next Steps: Did we bet on the wrong horse - named BSSVS?

  Review choice of priors in BSSVS model.

Apply Bayes-C model to whole genome sequence data

Thanks!

Acknowledgments

1000 bull genomes project

(www.1000bullgenomes.com)

De-regressed proofs (DRP) Effective daughter contribution (EDC)

𝐷𝑅𝑃 = 𝑃𝐴 + 𝐸𝐵𝑉 − 𝑃𝐴 ∗ 𝐸𝐷𝐶 𝐸𝐵𝑉 𝐸𝐷𝐶 𝑝𝑟𝑜𝑔 Parent average Effective Daughter Estimated breeding value Contribution 𝐸𝐷𝐶 𝐸𝐵𝑉 = 𝛼 𝑅𝐸𝐿 𝐸𝐵𝑉 / 1 − 𝑅𝐸𝐿 𝐸𝐵𝑉 (4 − ℎ 2 )/ℎ 2 Published reliability of EBV 𝐸𝐷𝐶 𝑝𝑟𝑜𝑔 = 𝐸𝐷𝐶 𝐸𝐵𝑉 − 𝐸𝐷𝐶 𝑃𝐴

VanRaden et al. 2009 (J Dairy Sci)

Based on reliability of parents 𝑅𝐸𝐿 𝑠𝑖𝑟𝑒 + 𝑅𝐸𝐿 𝑑𝑎𝑚 /4

VanRaden and Wiggans 1991 (J Dairy Sci)