A Vision for Managing Big Data @ UC Davis A Data Science Institute

Download Report

Transcript A Vision for Managing Big Data @ UC Davis A Data Science Institute

Big Data Why it matters

Patrice KOEHL Department of Computer Science Genome Center UC Davis

The three I’s of Big Data

Big Data is:

Ill-defined (what is it?) Immediate (we need to do something about it now) Intimidating (what if we don’t)

(loosely adapted from Forbes)

Big Data: Volume

Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB

1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

Big Data: Volume

One page of text

30KB

One song One movie 6 million books

5 MB 5 GB 1 TB

55 storeys of DVD Data up to 2003

1 PB 5 EB

Data in 2011

1.8 ZB

NSA data center

1 YB

Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB

1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

Big Data: Volume

One page of text

30KB

One song One movie 6 million books

5 MB 5 GB 1 TB

55 storeys of DVD Data up to 2003

1 PB 5 EB

Data in 2011

1.8 ZB

NSA data center

1 YB

Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB

1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

1s 20 mins 11 days 30 years 300 30 million 30 billion ….

centuries years years

204 million e-mails sent

Big Data: Volume, Velocity

640 TB

IP data transferred One minute in the digital world

(Intel, 2013)

50 GB of data generated at the Large Hadron Collider 3+ million searches launched 6 million users connected 30 hours videos uploaded 1.3 million videos viewed

text

Big Data: Volume, Velocity, Variety

Numbers Images sound

Big Data: Challenges

 Volume and Velocity  Variety   Structured, Unstructured…. Images, Sound, Numbers, Tables,…  Security  Reliability, Integrity, Validity

Big Data: Challenges

Large N:

“Any dataset that is collected by a scientist whose data collection skills are far superior to her analysis skills”

Computing issues:

 Data transfer  Scalability of algorithms   Memory limitations Distributed computing

Big Data: Challenges

Vizualization issues:

The black screen problem

(Matloff, 2013)

Big Data: Challenges

Rule of thumb: N/P > 5….what if it does not hold anymore?

Large P, “small” N:

 Curse of dimensionality (all data points seem equidistant)   Non linearity Dimension reduction

Big Data: Challenges and Opportunities

 Fourth Paradigm: data driven science Data

Basic

Knowledge

Translational

Societal Benefit  Holistic approaches to major research efforts  New paradigms in computing  Digital Humanities

Big Data: Enabling Dreams

        Understanding the physics of “Dark Energy” How the brain works: from neurons to cognition A holistic view of natural ecosystems Understanding climate changes From genotype to phenotype Precision medicine Big Humanities ….

Big Data Dreams: Genomics

Big Data Dreams: Genomics

$10 000,00 $1 000,00 $100,00 $10,00 $1,00 $0,10 $0,01

Genomics: Sequencing costs

$100 000 000 $10 000 000 $1 000 000 $100 000 $10 000 $1 000 $100

http://www.genome.gov

Genomics: Game changing technologies

Illumina HiSeq 2000

Capable of 600 Gb per run -> 1,000+ Gb 55 Gb/day 6 billion paired-end reads <$4,000 per human/plant genome <$200 per transcriptome Multiplex 384 pathogen isolates/lane  $10 (+ $50 library construction)/isolate Challenges: Library preparation & data analysis Gary Schroth (Illumina): “

A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days

”.

Genomics @ UC Davis

Massively parallel DNA sequencing

2 Illumina Genome Analyzers 1 Illumina Hiseq 2000, 2 Miseq 1 Roche 454 Junior 1 Pacific Biosystems RS

GoldenGate SNP genotyping

iScan, BeadArray & BeadExpress

Cancer Genomics: Molecular Diagnostics

Genomics: actual costs

“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.”

Gary Schroth (Illumina)

Genomics: actual costs

Assembling 22GB conifer genome:

“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.”

Data:

-16 billion pair reads (100 bases)

Processing:

-10 days for error correction -11 days for assembling “super-reads”

Gary Schroth (Illumina)

-60 days to build contigs/scaffold -8 days to fill in gaps

http://www.homolog.us/blogs/2013/05/11/ steven-salzberg-at-bog13-assembling-22gb-conifer-genome/

Social Consequences of Commodity Sequencing      The danger of misuse predict sensitivities to various industrial or environmental agents discrimination by employers?

The impact of information that is likely to be incomplete an indication of a 25 percent increase in the risk of cancer? Reversal of knowledge paradigm Are the "products" of the Human Genome Project to be patented and commercialized? Myriad genetics and BRCA1/2 How to educate about genetic research and its implications?

Social Consequences of Commodity Sequencing

Social Consequences of Commodity Sequencing

How to Approach Big Data