Efficient downstream data analyses of massive whole genome datasets Timo Kanninen BC Platforms Inc Director, Research.
Download ReportTranscript Efficient downstream data analyses of massive whole genome datasets Timo Kanninen BC Platforms Inc Director, Research.
Efficient downstream data analyses of massive whole genome datasets Timo Kanninen BC Platforms Inc Director, Research The Company 1 BC Platforms Inc • BC Platforms is a Finnish bioinformatics software company since 1997 – Born from pioneering diabetes project of MIT (Eric Lander), Millennium Pharmaceuticals and unique patient cohort in Finland (Botnia project) • Pioneer of data management and analysis software and bioinformatics services for genomic research • Company is focusing on three market segments: – Academics, Healthcare, Pharma / industry • All segments have the same challenge in integrating, managing and analyzing growing, complex big data in a systematic and secure manner • Headquarters and R&D are located in Finland (Espoo). The company has sales offices in London (UK), Vancouver (Canada), Basel (Switzerland) and San Diego (USA). • We have delivered solutions to 18 countries and have over 80 customer sites. Customer cities and sales representation Examples of our references Strong customer base with satisfied customers USA, Canada, Australia • • • • • • • Johns Hopkins, Asthma and Allergy Center, Baltimore Emory University, Atlanta NIHGRI, USA Penn State University, PA Dartmouth College, NH Hospital for Sick Children, Toronto Diamantina Institute, University of Brisbane, Australia Europe • • • • • • • • • • • • CMM, Karolinska Institute, Sweden Imperial College, London, UK King's College, London, UK CIGMR, University of Manchester, UK University of Leeds, UK Trinity College, Dublin, Ireland GSF München, Germany University of Bonn, Germany Max Planck Institute, Germany Gulbenkian Institute, Portugal Lund University, Sweden Universities of Helsinki, Tampere and Turku, Finland Our platform solutions are used in 18 countries at over 50 leading institutions globally. ©BC Platforms 2014 BC is the key player in many genomics related programs EU FP7 / IMI projects eMERGE (Electronic Medical Record and Genomics Network), USA NUgene 5 BioVU ©BC Platforms 2014 The Platform 6 Key features of the BC|ENTERPRISE Platform • Data integration – – – – Data integration needed for research and healthcare applications Access to Electronic Healthcare Records (EHR) Access to various different data sources in different places Access to on-line data collected from patients • Data security – Complies with US (HIPAA), EU and other laws on patient data security – Possible modifications to comply with additional requirements can be done • Efficient management and analyses of massive datasets – Data amounts are growing in three dimensions – BigData features: tiled datasets, distributed data analyses, compression, virtual filesystem etc – Bioinformatics workflows (imputing, alignment/variation calling etc) • Application programming interface (API) – For adding 3rd party applications BC|ENTERPRISE PLATFORM RESEARCH APPLICATIONS HEALTHCARE APPLICATIONS Diagnostics, prediction models etc INDUSTRY APPLICATIONS BC|DATA etc BC|GENOME etc Application Programming Interface LOCAL SERVER Data Catalog - pointers to data objects Central SQL database VIRTUAL FILESYSTEM Local storage Virtual files UCSC reference genome Hosted storage Mirth Connect etc DATA INTEGRATION LAYER EXTERNAL DATA SOURCES COMPUTING CLUSTER EHR External NGS & Genotyping data On-line patient health info Different types of data managed in one point Supported data types • Next generation sequence data • FASTQ, BAM and VCF files • Genotype and –omics data • SNP arrays, microsatellites, variations (indels) • Copy Number Variation (CNV) • Expression, RNA-seq, metabolomics, methylation, etc • Clinical data and pedigrees: • Vendor independent communication via standard interfaces • HL7, DICOM, XML, XDS etc. • Case Record Forms (CRF) and laboratory values • Genealogy • Annotations • References, maps, etc, etc www.bcplatforms.com Scalable workgroup solution • Secure data and analysis results sharing – Dataset, subset, and user specific limitations to data access – Log files (full audit trail, data trail, edit trail) – Data analyses can be performed inside, without exporting data out • Local or global collaborating environment – Tool for small and large research workgroups – Local server inside the firewall for local projects – Hosted server connected to internet for global collaborating projects • Scalable, multiuser database and data analysis system – Both database and data analysis can be distributed to multiple servers to handle very large projects www.bcplatforms.com Professional support • User support and training – Email, phone and webex support – On-site training session when taken in use, webex training for new people • Software updates – Support for selected new analysis tools and new versions of existing tools – Support for new genotyping chips, sequencing devices etc – Updates of annotations (dbSNP maps, 1000G references etc) • Technical maintenance and problem solving – Maintenance work – Complex problem solving – Hardware expansions, migration to new server etc www.bcplatforms.com Information Security Introduction and background • Changes in legal environment and client awareness – Attitude change on societal level through data security breaches, patient consent issues etc. – New and stricter laws regarding the handling of personally identifiable, sensitive information: • HIPAA privacy rule in the U.S., new EU directives etc – New security policies of universities • Changes in the operative environment – Requirements imposed by research collaboration and partner networks – Data integration needs due to closer connection with clinical use and clinical user environment – Biobank networks • Changes in the research data itself – Challenges of data management and analysis in a systematic and secure manner in the face of growing amounts and growing complexity of data – Sensitivity of NGS data “More and more time is spent on administrative and legal issues of data management instead of data analysis” General rules and safeguards • Objectives • Ensure the confidentiality, integrity and availability of data • Identify and protect against reasonably anticipated threats to security and integrity • Protect against reasonably anticipated, impermissible uses or disclosures • Measures • Risk analysis and management • Administrative safeguards • Security personnel • Security awareness and training of personnel • Physical safeguards • • • • Facility access and control Workstation and device security System security Network security Typical minimum system requirements • Access controls - Only the personnel, who need to see sensitive information to be able to accomplish their work, are allowed to see it. • User access rights and auditing (personal userIDs , strong passwords, reports on login successes and failures, automatic logoffs etc.) • Different user profiles for different uses of the system • Different access rights to different types of data • Log file information • What data has been viewed , when and by whom • Data encryption and virus protection • Encrypted internet browser connection (https) • Security updates • Data backup and disaster recovery plans More advanced system requirements #1 • Access controls - Broader use of the least privileged principle • Tendency to use more fine-grained user profiles (e.g. a dedicated account for making backups) – Tighter access controls for remote use , e.g. VPN • Log file information – Regular inspections of audit logs for unauthorized access as a further administrative safeguard • Separation of the application and database to different servers • Hardenings of physical access, operating system and database – Hardened installation requirements (following vendor security recommendations) More advanced system requirements #2 • Data encryption – Encryption of server disks – Application level encryption: sensitive data stored encrypted in the database • Security updates and virus protection – Security updates for underlying database software – Anti-virus software required in all servers • Backup encryption Data Integration 19 Data Integration RESEARCH APPLICATIONS HEALTHCARE APPLICATIONS Diagnostics, prediction models INDUSTRY APPLICATIONS BC|DATA etc BC|GENOME etc Application Programming Interface LOCAL SERVER Data Catalog - pointers to data objects Central SQL database VIRTUAL FILESYSTEM Local storage Virtual files UCSC reference genome Hosted storage Mirth Connect etc DATA INTEGRATION LAYER EXTERNAL DATA SOURCES COMPUTING CLUSTER EHR External NGS & Genotyping data On-line patient health info Simple data integration • Attaching distant data sources to the federated database – Datafile with predefined data structure are stored to a file – This file is configured to be seen by BCFS (URL, etc) – Also data gateway server (virtual server etc) can be used – When the file is changed, it is fetched, and cached to local database – By using drop and load functions, operations are very fast • SQL queries – SQL queries can be performed joining data from different sources – Queries performed through the BC system make data updates automatically – For queries done by ODBC, tables needs to refreshed manually 21 Federated database • IBM DB2 SQL database federation feature – Facilitates integration of large datasources – Data sources can be SQL databases, or files – Needs direct block storage access to the data – Instead of copying whole file to the local database, intelligent SQL query engine only transfer data what is needed to perform the query – Implementation is transparent to SQL database, when using through BC system or directly with ODBC – Works together with simple integration model 22 DataPump • Automated data upload – Java software (can be installed to any OS) – Polling content of the configured folder – Uploads all new files to the BC server using HTTPS (encrypted connection) – Works behind the firewalls – most organisations keep https port out open – Java code available for auditing by customer – Interval can be some seconds, depending the size of the datafile • Data processing – When data is uploaded, BC server can perform required preprocessing, and writing data to the database – Log files are kept of all DataPump operations, and changes to the data made www.bcplatforms.com Healthcare data integration • Mirth Connect • Swiss Army knife of healthcare integration engines, specifically designed for HL7 message integration • Open source – commercial support available • Connect to any system over any protocol • MLLP, MLLP, HTTP, Database, Email, PDF, JMSTCP/IP, Web Services (SOAP), File System, (S)FTP, RTF, DICOM • Supports for custom connectors written using Java or JavaScript • Easily transform, filter, and route your data • HL7 v2.x, HL7 v3, CDA, CCR, DICOM, X12, Delimited Text, CCD, XML, NCPDP, EDI, Raw ASCII or Binary www.bcplatforms.com Scalable Data Analysis 25 Relevant research by BC Platforms 26 Confidential ©BC Platforms 2014 Scalable Data Analysis RESEARCH APPLICATIONS HEALTHCARE APPLICATIONS Diagnostics, prediction models INDUSTRY APPLICATIONS BC|DATA etc BC|GENOME etc Application Programming Interface LOCAL SERVER Data Catalog - pointers to data objects Central SQL database VIRTUAL FILESYSTEM Local storage Virtual files UCSC reference genome Hosted storage Mirth Connect etc DATA INTEGRATION LAYER EXTERNAL DATA SOURCES COMPUTING CLUSTER EHR External NGS & Genotyping data On-line patient health info Using academic algorithms instead developing your own … … but solving the major problems: lack of scalability needs direct access to original data (data security) data format related issues 28 Confidential ©BC Platforms 2014 Distributed data analysis • Support for 30+ academic analysis packages – Data format conversions – Easy www interface for running analyses – New/custom analysis packages can be added • Efficient, distributed data analysis – General framework for segmenting analysis tasks to small segments – Massively parallel data analysis • Support for different calculation environments – Linux boxes/servers – Institution calculation cluster – Cloud www.bcplatforms.com Some supported analysis methods o Case-control analysis: PLINK v1/v2, PLINK multivariate, Haploview, WG-Permer, METAL, HAPPY, QCtool, EPACTS, GCTA o Linkage analysis: Merlin, Allegro, GENEHUNTER, Simwalk2, Solar o Epistasis: PLINK, BEAM o Imputing: MaCH, IMPUTE, BEAGLE, Minimac, ShapeIT 2 o Association analysis of probabilistic genotypes: SNPtest, MaCH2QTL, ProbABEL o TDT analysis: FBAT / PBAT, QTDT o Haplotyping: PHASE, fastPHASE o CNV analysis: PLINK CNV, PennCNV, QuantiSNP o Next-generation sequencing (NGS): VAT, PLINK/SEQ, ANNOVAR, Granvil, SKAT, PolyPhen2, SIFT, SAMtools, BWA and BWA-SW, GATK, data export/import in the Variant Call Format (VCF), EPACTS, SNPeff, PLATYPUS o Mendelian checking on inheritance patterns: PLINK, PEDCHECK, Merlin o Population stratification: Eigensoft (Eigenstrat, smartPCA) PLINK, STRUCTURE, RELPAIR o Pedigree drawing and analysis: Cranefoot, Pelican, Madeline v2 o Scripting: R scripts, GenABEL, SAS scripts o Result visualization: Haploview, USCS browser, GWAS central, WGAviewer o Data format conversions: Mega2 o Data export: ASCII, SPSS, STATA, Excel (SYLK) 30 Confidential ©BC Platforms 2014 BigData analyses • Analyses of BigData – When number of subjects with whole genome data (genotypes or sequences) exceeds say 50 000, new approaches are needed – Data needs to be tiled, and tiles needs to be analyzed in parallel • Workflows – Distributed imputing workflows (IMPUTE, SHAPEIT2, MaCH, BEAGLE) – Distributed BWA/GATK workflow – Custom workflows can be added • Data export in different formats – Data can be exported in many different formats for command line use – New export formats can be easily added by us, or by the customer Some large projects • eMERGE network data analysis 75 000 subjects, 32 M imputed variants / subject = 2400 billion variants • MIMOmics (EU FP7 funded collaboration project) Budget 15 MEUR, 20 European Universities BC co-leads: “Data integration and distributed data analysis” -omics data (GWAS, WES, WGS, lipidomics, N-glycans, NMR, metabolomics, glycomics, telomere, expression, miRNA, mythylation) • SUMMIT (IMI project) Budget 35 MEUR, 6 pharma companies, 19 universities, BC as database vendor Prediction model for diabetes type 2 complications 70 000 patients with various types of data 32 Confidential ©BC Platforms 2014 Tiled Dataset Chr 1 (Mbp) 0-3 3-6 Chr 22 (Mbp) 6-9 ... 40-43 43-46 46-49 SUBJECT 15 000 5 00110 000 10 00115 000 ... 495 000500 000 500 000 subjects * 3 Mbp window => 100 000 segments Distributed/parallel data analysis TILED DATASET CALCULATION CLUSTER Each calculation node handles one tile at a time ! Distributed BigData analysis RESEARCHER Select data analysis tool Select data (genetic and other) Define data analysis parameters Define used parallelization CALCULATION CLUSTER Job segmenting Job queue Data Results VIRTUAL, OBJECT BASED FILESYSTEM More than 30 academic statistical genomics tools supported Own tools are easy to add Existing calculation clusters can be used Applications 36 Applications HEALTHCARE APPLICATIONS RESEARCH APPLICATIONS INDUSTRY APPLICSTIONS BC|GENOME Diagnostics, prediction models BC|DATA Application Programming Interface LOCAL SERVER Data Catalog - pointers to data objects Central SQL database VIRTUAL FILESYSTEM Local storage Virtual files UCSC reference genome Hosted storage Mirth Connect etc DATA INTEGRATION LAYER EXTERNAL DATA SOURCES COMPUTING CLUSTER EHR External NGS & Genotyping data On-line patient health info Modular BC architecture • Data integration platform • BC | ENTERPRISE - Integration of genomic, clinical (Electronic Patient Record etc) and other data sources, connections to storage solutions, calculation clusters etc • Sample information management • BC|SAMPLE – simple LIMS optimized for research and biobanking • Clinical data collection • BC|CLIN & CRF editor for online data collecting, importing patient registers • Madeline pedigree drawing tool, support for 3rd party tools (Progeny, PASS) • Sequence, genotype, -omics and other related information • BC|DATA – scalable, secure LifeScience data warehouse • BC|GENOME – advanced analysis tool for genetic epidemiology • Biobanks www.bcplatforms.com Sample data management – BC|SAMPLE www.bcplatforms.com © 2012 all rights reserved Simple layout – easy navigation Tables and Filters Import, edit, report Flexible table structure Quick search Spreadsheet interface On-line editing Alerts www.bcplatforms.com Data model editor www.bcplatforms.com Freezers, boxes, and plates Track sample location Container hierarchy Automatic placement of new samples Easy editing Many container types www.bcplatforms.com Label editor – robust and platform independent Easy drag'n'place editing 1D/2D barcodes Label content fetched from the database Print labels for multiple samples in one go www.bcplatforms.com Phenotype/clinical data collection and management – BC|CLIN www.bcplatforms.com © 2012 all rights reserved CRF (Case Record Form) Editor Data import wizard or data entry www.bcplatforms.com Queries and export www.bcplatforms.com © 2011 all rights reserved 3rd party pedigree drawing – PASS software and Progeny www.bcplatforms.com © 2011 all rights reserved Pedigree editor – Madeline v2 Genome data management and analysis BC|GENOME www.bcplatforms.com © 2012 all rights reserved Queue system for large data uploads and analysis www.bcplatforms.com Example: 1000 Genomes imputation using MaCH www.bcplatforms.com Association: define cases and controls ... www.bcplatforms.com … and analysis parameters … www.bcplatforms.com ... and retrieve and visualize results www.bcplatforms.com Clinical Genomics rd (3 party App by Euformatics) 56 57 CONFIDENTIAL ©BC Platforms 2014 58 CONFIDENTIAL ©BC Platforms 2014 59 CONFIDENTIAL ©BC Platforms 2014 Application Programming Interface 60 Application Programming Interface RESEARCH APPLICATIONS HEALTHCARE APPLICATIONS Diagnostics, prediction models INDUSTRY APPLICATIONS BC|DATA etc BC|GENOME etc Application Programming Interface LOCAL SERVER Data Catalog - pointers to data objects Central SQL database VIRTUAL FILESYSTEM Local storage Virtual files UCSC reference genome Hosted storage Mirth Connect etc DATA INTEGRATION LAYER EXTERNAL DATA SOURCES COMPUTING CLUSTER EHR External NGS & Genotyping data On-line patient health info End of presentation