Efficient downstream data analyses of massive whole genome datasets Timo Kanninen BC Platforms Inc Director, Research.

Download Report

Transcript Efficient downstream data analyses of massive whole genome datasets Timo Kanninen BC Platforms Inc Director, Research.

Efficient downstream data analyses of massive whole
genome datasets
Timo Kanninen
BC Platforms Inc
Director, Research
The Company
1
BC Platforms Inc
•
BC Platforms is a Finnish bioinformatics software company since 1997
– Born from pioneering diabetes project of MIT (Eric Lander), Millennium
Pharmaceuticals and unique patient cohort in Finland (Botnia project)
•
Pioneer of data management and analysis software and bioinformatics services for
genomic research
•
Company is focusing on three market segments:
– Academics, Healthcare, Pharma / industry
•
All segments have the same challenge in integrating, managing and analyzing growing,
complex big data in a systematic and secure manner
•
Headquarters and R&D are located in Finland (Espoo). The company has sales offices
in London (UK), Vancouver (Canada), Basel (Switzerland) and San Diego (USA).
•
We have delivered solutions to 18 countries and have over 80 customer sites.
Customer cities and sales representation
Examples of our references
Strong customer base with satisfied customers
USA, Canada, Australia
•
•
•
•
•
•
•
Johns Hopkins, Asthma and Allergy Center, Baltimore
Emory University, Atlanta
NIHGRI, USA
Penn State University, PA
Dartmouth College, NH
Hospital for Sick Children, Toronto
Diamantina Institute, University of Brisbane, Australia
Europe
•
•
•
•
•
•
•
•
•
•
•
•
CMM, Karolinska Institute, Sweden
Imperial College, London, UK
King's College, London, UK
CIGMR, University of Manchester, UK
University of Leeds, UK
Trinity College, Dublin, Ireland
GSF München, Germany
University of Bonn, Germany
Max Planck Institute, Germany
Gulbenkian Institute, Portugal
Lund University, Sweden
Universities of Helsinki, Tampere and Turku, Finland
Our platform solutions are used in 18 countries at over 50 leading institutions globally.
©BC Platforms 2014
BC is the key player in many genomics related programs
EU FP7 / IMI projects
eMERGE (Electronic Medical Record and Genomics Network), USA
NUgene
5
BioVU
©BC Platforms 2014
The Platform
6
Key features of the BC|ENTERPRISE Platform
• Data integration
–
–
–
–
Data integration needed for research and healthcare applications
Access to Electronic Healthcare Records (EHR)
Access to various different data sources in different places
Access to on-line data collected from patients
• Data security
– Complies with US (HIPAA), EU and other laws on patient data security
– Possible modifications to comply with additional requirements can be done
• Efficient management and analyses of massive datasets
– Data amounts are growing in three dimensions
– BigData features: tiled datasets, distributed data analyses, compression,
virtual filesystem etc
– Bioinformatics workflows (imputing, alignment/variation calling etc)
• Application programming interface (API)
– For adding 3rd party applications
BC|ENTERPRISE PLATFORM
RESEARCH APPLICATIONS
HEALTHCARE APPLICATIONS
Diagnostics, prediction
models etc
INDUSTRY APPLICATIONS
BC|DATA etc
BC|GENOME etc
Application Programming Interface
LOCAL SERVER
Data Catalog
- pointers to data objects
Central
SQL database
VIRTUAL FILESYSTEM
Local
storage
Virtual files
UCSC reference
genome
Hosted
storage
Mirth
Connect etc
DATA INTEGRATION LAYER
EXTERNAL DATA SOURCES
COMPUTING CLUSTER
EHR
External NGS &
Genotyping
data
On-line
patient
health
info
Different types of data managed in one point
Supported data types
• Next generation sequence data
•
FASTQ, BAM and VCF files
• Genotype and –omics data
•
SNP arrays, microsatellites, variations (indels)
•
Copy Number Variation (CNV)
•
Expression, RNA-seq, metabolomics, methylation, etc
• Clinical data and pedigrees:
•
Vendor independent communication via standard interfaces
•
HL7, DICOM, XML, XDS etc.
•
Case Record Forms (CRF) and laboratory values
•
Genealogy
• Annotations
•
References, maps, etc, etc
www.bcplatforms.com
Scalable workgroup solution
• Secure data and analysis results sharing
– Dataset, subset, and user specific limitations to data access
– Log files (full audit trail, data trail, edit trail)
– Data analyses can be performed inside, without exporting
data out
• Local or global collaborating environment
– Tool for small and large research workgroups
– Local server inside the firewall for local projects
– Hosted server connected to internet for global collaborating
projects
• Scalable, multiuser database and data analysis
system
– Both database and data analysis can be distributed to
multiple servers to handle very large projects
www.bcplatforms.com
Professional support
• User support and training
– Email, phone and webex support
– On-site training session when taken in use, webex training
for new people
• Software updates
– Support for selected new analysis tools and new versions of
existing tools
– Support for new genotyping chips, sequencing devices etc
– Updates of annotations (dbSNP maps, 1000G references etc)
• Technical maintenance and problem solving
– Maintenance work
– Complex problem solving
– Hardware expansions, migration to new server etc
www.bcplatforms.com
Information
Security
Introduction and background
• Changes in legal environment and client awareness
– Attitude change on societal level through data security breaches, patient consent issues etc.
– New and stricter laws regarding the handling of personally identifiable, sensitive information:
• HIPAA privacy rule in the U.S., new EU directives etc
– New security policies of universities
• Changes in the operative environment
– Requirements imposed by research collaboration and partner networks
– Data integration needs due to closer connection with clinical use and clinical user environment
– Biobank networks
• Changes in the research data itself
– Challenges of data management and analysis in a systematic and secure manner in the face of
growing amounts and growing complexity of data
– Sensitivity of NGS data
“More and more time is spent on administrative and legal issues of data management instead of data analysis”
General rules and safeguards
• Objectives
• Ensure the confidentiality, integrity and availability of data
• Identify and protect against reasonably anticipated threats to security and integrity
• Protect against reasonably anticipated, impermissible uses or disclosures
• Measures
• Risk analysis and management
• Administrative safeguards
• Security personnel
• Security awareness and training of personnel
• Physical safeguards
•
•
•
•
Facility access and control
Workstation and device security
System security
Network security
Typical minimum system requirements
• Access controls
- Only the personnel, who need to see sensitive information to be able to
accomplish their work, are allowed to see it.
• User access rights and auditing (personal userIDs , strong passwords,
reports on login successes and failures, automatic logoffs etc.)
• Different user profiles for different uses of the system
• Different access rights to different types of data
• Log file information
• What data has been viewed , when and by whom
• Data encryption and virus protection
• Encrypted internet browser connection (https)
• Security updates
• Data backup and disaster recovery plans
More advanced system requirements #1
• Access controls
- Broader use of the least privileged principle
• Tendency to use more fine-grained user profiles (e.g. a dedicated
account for making backups)
– Tighter access controls for remote use , e.g. VPN
• Log file information
– Regular inspections of audit logs for unauthorized access as a further
administrative safeguard
• Separation of the application and database to different servers
• Hardenings of physical access, operating system and database
– Hardened installation requirements (following vendor security
recommendations)
More advanced system requirements #2
• Data encryption
– Encryption of server disks
– Application level encryption: sensitive data stored encrypted in the database
• Security updates and virus protection
– Security updates for underlying database software
– Anti-virus software required in all servers
• Backup encryption
Data
Integration
19
Data Integration
RESEARCH APPLICATIONS
HEALTHCARE APPLICATIONS
Diagnostics, prediction
models
INDUSTRY APPLICATIONS
BC|DATA etc
BC|GENOME etc
Application Programming Interface
LOCAL SERVER
Data Catalog
- pointers to data objects
Central
SQL database
VIRTUAL FILESYSTEM
Local
storage
Virtual files
UCSC reference
genome
Hosted
storage
Mirth
Connect etc
DATA INTEGRATION LAYER
EXTERNAL DATA SOURCES
COMPUTING CLUSTER
EHR
External NGS &
Genotyping
data
On-line
patient
health
info
Simple data integration
• Attaching distant data sources to the federated database
– Datafile with predefined data structure are stored to a file
– This file is configured to be seen by BCFS (URL, etc)
– Also data gateway server (virtual server etc) can be used
– When the file is changed, it is fetched, and cached to local database
– By using drop and load functions, operations are very fast
• SQL queries
– SQL queries can be performed joining data from different sources
– Queries performed through the BC system make data updates automatically
– For queries done by ODBC, tables needs to refreshed manually
21
Federated database
• IBM DB2 SQL database federation feature
– Facilitates integration of large datasources
– Data sources can be SQL databases, or files
– Needs direct block storage access to the data
– Instead of copying whole file to the local database, intelligent SQL query engine only
transfer data what is needed to perform the query
– Implementation is transparent to SQL database, when using through BC system or
directly with ODBC
– Works together with simple integration model
22
DataPump
• Automated data upload
– Java software (can be installed to any OS)
– Polling content of the configured folder
– Uploads all new files to the BC server using HTTPS (encrypted connection)
– Works behind the firewalls – most organisations keep https port out open
– Java code available for auditing by customer
– Interval can be some seconds, depending the size of the datafile
• Data processing
– When data is uploaded, BC server can perform required preprocessing, and
writing data to the database
– Log files are kept of all DataPump operations, and changes to the data made
www.bcplatforms.com
Healthcare data integration
• Mirth Connect
• Swiss Army knife of healthcare integration engines,
specifically designed for HL7 message integration
• Open source – commercial support available
• Connect to any system over any protocol
• MLLP, MLLP, HTTP, Database, Email, PDF, JMSTCP/IP,
Web Services (SOAP), File System, (S)FTP, RTF, DICOM
• Supports for custom connectors written using Java or
JavaScript
• Easily transform, filter, and route your data
• HL7 v2.x, HL7 v3, CDA, CCR, DICOM, X12, Delimited
Text, CCD, XML, NCPDP, EDI, Raw ASCII or Binary
www.bcplatforms.com
Scalable Data
Analysis
25
Relevant research by BC Platforms
26
Confidential
©BC Platforms 2014
Scalable Data Analysis
RESEARCH APPLICATIONS
HEALTHCARE APPLICATIONS
Diagnostics, prediction
models
INDUSTRY APPLICATIONS
BC|DATA etc
BC|GENOME etc
Application Programming Interface
LOCAL SERVER
Data Catalog
- pointers to data objects
Central
SQL database
VIRTUAL FILESYSTEM
Local
storage
Virtual files
UCSC reference
genome
Hosted
storage
Mirth
Connect etc
DATA INTEGRATION LAYER
EXTERNAL DATA SOURCES
COMPUTING CLUSTER
EHR
External NGS &
Genotyping
data
On-line
patient
health
info
Using academic algorithms instead developing your
own …
… but solving the major problems:
lack of scalability
needs direct access to original data (data security)
data format related issues
28
Confidential
©BC Platforms 2014
Distributed data analysis
• Support for 30+ academic analysis packages
– Data format conversions
– Easy www interface for running analyses
– New/custom analysis packages can be added
• Efficient, distributed data analysis
– General framework for segmenting analysis tasks to small segments
– Massively parallel data analysis
• Support for different calculation environments
– Linux boxes/servers
– Institution calculation cluster
– Cloud
www.bcplatforms.com
Some supported analysis methods
o Case-control analysis: PLINK v1/v2, PLINK multivariate, Haploview, WG-Permer,
METAL, HAPPY, QCtool, EPACTS, GCTA
o Linkage analysis: Merlin, Allegro, GENEHUNTER, Simwalk2, Solar
o Epistasis: PLINK, BEAM
o Imputing: MaCH, IMPUTE, BEAGLE, Minimac, ShapeIT 2
o Association analysis of probabilistic genotypes: SNPtest, MaCH2QTL, ProbABEL
o TDT analysis: FBAT / PBAT, QTDT
o Haplotyping: PHASE, fastPHASE
o CNV analysis: PLINK CNV, PennCNV, QuantiSNP
o Next-generation sequencing (NGS): VAT, PLINK/SEQ, ANNOVAR, Granvil, SKAT,
PolyPhen2, SIFT, SAMtools, BWA and BWA-SW, GATK, data export/import in the
Variant Call Format (VCF), EPACTS, SNPeff, PLATYPUS
o Mendelian checking on inheritance patterns: PLINK, PEDCHECK, Merlin
o Population stratification: Eigensoft (Eigenstrat, smartPCA) PLINK, STRUCTURE, RELPAIR
o Pedigree drawing and analysis: Cranefoot, Pelican, Madeline v2
o Scripting: R scripts, GenABEL, SAS scripts
o Result visualization: Haploview, USCS browser, GWAS central, WGAviewer
o Data format conversions: Mega2
o Data export: ASCII, SPSS, STATA, Excel (SYLK)
30
Confidential
©BC Platforms 2014
BigData analyses
• Analyses of BigData
– When number of subjects with whole genome data (genotypes or sequences) exceeds
say 50 000, new approaches are needed
– Data needs to be tiled, and tiles needs to be analyzed in parallel
• Workflows
– Distributed imputing workflows (IMPUTE, SHAPEIT2, MaCH, BEAGLE)
– Distributed BWA/GATK workflow
– Custom workflows can be added
• Data export in different formats
– Data can be exported in many different formats for command line use
– New export formats can be easily added by us, or by the customer
Some large projects
• eMERGE network data analysis
75 000 subjects, 32 M imputed variants / subject = 2400 billion variants
• MIMOmics (EU FP7 funded collaboration project)
Budget 15 MEUR, 20 European Universities
BC co-leads: “Data integration and distributed data analysis”
-omics data (GWAS, WES, WGS, lipidomics, N-glycans, NMR, metabolomics,
glycomics, telomere, expression, miRNA, mythylation)
• SUMMIT (IMI project)
Budget 35 MEUR, 6 pharma companies, 19 universities, BC as database vendor
Prediction model for diabetes type 2 complications
70 000 patients with various types of data
32
Confidential
©BC Platforms 2014
Tiled Dataset
Chr 1 (Mbp)
0-3
3-6
Chr 22 (Mbp)
6-9
...
40-43 43-46 46-49
SUBJECT
15 000
5 00110 000
10 00115 000
...
495 000500 000
500 000 subjects * 3 Mbp window => 100 000 segments
Distributed/parallel data analysis
TILED DATASET
CALCULATION CLUSTER
Each calculation node handles one tile at a time !
Distributed BigData analysis
RESEARCHER
Select data analysis tool
Select data (genetic and other)
Define data analysis parameters
Define used parallelization
CALCULATION CLUSTER
Job segmenting
Job queue
Data
Results
VIRTUAL, OBJECT BASED FILESYSTEM
More than 30 academic statistical genomics tools supported
Own tools are easy to add
Existing calculation clusters can be used
Applications
36
Applications
HEALTHCARE APPLICATIONS
RESEARCH APPLICATIONS
INDUSTRY APPLICSTIONS
BC|GENOME
Diagnostics, prediction
models
BC|DATA
Application Programming Interface
LOCAL SERVER
Data Catalog
- pointers to data objects
Central
SQL database
VIRTUAL FILESYSTEM
Local
storage
Virtual files
UCSC reference
genome
Hosted
storage
Mirth
Connect etc
DATA INTEGRATION LAYER
EXTERNAL DATA SOURCES
COMPUTING CLUSTER
EHR
External NGS &
Genotyping
data
On-line
patient
health
info
Modular BC architecture
• Data integration platform
•
BC | ENTERPRISE - Integration of genomic, clinical (Electronic Patient Record etc)
and other data sources, connections to storage solutions, calculation clusters etc
• Sample information management
• BC|SAMPLE – simple LIMS optimized for research and biobanking
• Clinical data collection
• BC|CLIN & CRF editor for online data collecting, importing patient registers
• Madeline pedigree drawing tool, support for 3rd party tools (Progeny, PASS)
• Sequence, genotype, -omics and other related information
• BC|DATA – scalable, secure LifeScience data warehouse
• BC|GENOME – advanced analysis tool for genetic epidemiology
• Biobanks
www.bcplatforms.com
Sample data management – BC|SAMPLE
www.bcplatforms.com © 2012 all rights reserved
Simple layout – easy navigation
Tables and Filters
Import, edit, report
Flexible table structure
Quick search
Spreadsheet
interface
On-line
editing
Alerts
www.bcplatforms.com
Data model editor
www.bcplatforms.com
Freezers, boxes, and plates
Track sample location
Container hierarchy
Automatic placement of
new samples
Easy editing
Many container
types
www.bcplatforms.com
Label editor – robust and platform
independent
Easy drag'n'place editing
1D/2D barcodes
Label content fetched
from the database
Print labels for
multiple samples in
one go
www.bcplatforms.com
Phenotype/clinical data collection and
management – BC|CLIN
www.bcplatforms.com © 2012 all rights reserved
CRF (Case Record Form) Editor
Data import wizard or data entry
www.bcplatforms.com
Queries and export
www.bcplatforms.com © 2011 all rights reserved
3rd party pedigree drawing – PASS software and Progeny
www.bcplatforms.com © 2011 all rights reserved
Pedigree editor – Madeline v2
Genome data management and analysis BC|GENOME
www.bcplatforms.com © 2012 all rights reserved
Queue system for large data uploads and analysis
www.bcplatforms.com
Example: 1000 Genomes imputation using MaCH
www.bcplatforms.com
Association: define cases and controls ...
www.bcplatforms.com
… and analysis parameters …
www.bcplatforms.com
... and retrieve and visualize results
www.bcplatforms.com
Clinical
Genomics
rd
(3 party App
by Euformatics)
56
57
CONFIDENTIAL
©BC Platforms 2014
58
CONFIDENTIAL
©BC Platforms 2014
59
CONFIDENTIAL
©BC Platforms 2014
Application
Programming
Interface
60
Application Programming Interface
RESEARCH APPLICATIONS
HEALTHCARE APPLICATIONS
Diagnostics, prediction
models
INDUSTRY APPLICATIONS
BC|DATA etc
BC|GENOME etc
Application Programming Interface
LOCAL SERVER
Data Catalog
- pointers to data objects
Central
SQL database
VIRTUAL FILESYSTEM
Local
storage
Virtual files
UCSC reference
genome
Hosted
storage
Mirth
Connect etc
DATA INTEGRATION LAYER
EXTERNAL DATA SOURCES
COMPUTING CLUSTER
EHR
External NGS &
Genotyping
data
On-line
patient
health
info
End of presentation