Experiences with a large-memory HP cluster – performance on benchmarks and genome codes Craig A.

Download Report

Transcript Experiences with a large-memory HP cluster – performance on benchmarks and genome codes Craig A.

Experiences with a large-memory HP cluster –
performance on benchmarks and genome codes
Craig A. Stewart ([email protected])
Executive Director, Pervasive Technology Institute
Associate Dean, Research Technologies
Associate Director, CREST
Robert Henschel
Manager, High Performance Applications, Research Technologies/PTI
William K. Barnett
Director, National Center for Genome Analysis Support
Associate Director, Center for Applied Cybersecurity Research, PTI
Thomas G. Doak
Department of Biology
Indiana University
2
License terms
•
Please cite this presentation as: Stewart, C.A., R. Henschel, W. K. Barnett, T.G. Doak.
2011. Experiences with a large-memory HP cluster – performance on benchmarks
and genome codes. Presented at: HP-CAST 17 - HP Consortium for Advanced
Scientific and Technical Computing World-Wide User Group Meeting. Renaissance
Hotel, 515 Madison Street, Seattle WA, USA, November 12th 2011.
http://hdl.handle.net/2022/13879
•
Portions of this document that originated from sources outside IU are shown here and used
by permission or under licenses indicated within this document. Items indicated with a ©
are under copyright and may not be reused without permission from the holder of
copyright, except where license terms noted on a slide permit reuse.
•
Except where otherwise noted, the contents of this presentation are copyright 2011 by the
Trustees of Indiana University. This content is released under the Creative Commons
Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license
includes the following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions: attribution – you must
attribute the work in the manner specified by the author or licensor (but not in any way that
suggests that they endorse you or your use of the work). For any reuse or distribution, you
must make clear to others the license terms of this work.
3
The human genome project was just the start
•
More sequencing will help us:
– Understand the basic building blocks and mechanisms of life
– Improve crop yields or disease resistance by genetic modification, or
add new nutrients to crops,
– Understand disease variability by mapping the genetic variation of
diseases such as cancer or by studying the human microbiome and how
it interacts with us and our conditions,
– Create personalized treatments for various illnesses by understanding
human genetic variability, or create gene therapies as new treatments
– Really begin to understand genome variability as a population issue as
well as having at hand the genome of one particular individual
4
Evolution of sequencers over time
Genome
sequencer model
Year
introduced
Raw image
data per run
Data products
n.a.
Several
exposed films
/day on a good
day
Sequence per
run
Read length
Doctoral Student
(hard working) as
sequencer
Circa
1980s
ABI 3730
2002
0.03 GB
2 GB/day
454 Titanium
2005
39 GB
9 GB/day
Illumina-Solexa G1
2006
600 GB
100 GB/day
50 Gbp
300 nt
ABI SOLiD 4
2007
680 GB
25 GB/day
70 Gbp
90 nt
Illumina HiSeq
2000
2010
600 GB
150 GB/day
200 Gbp
200 nt
2 Kbp
100-200 nt
60 Kbp
800 nt
500 Mbp
400 nt
5
Cost of sequencing over time
Date
Cost per Mb of DNA
sequence
Cost per human
genome
March 2002
$3,898.64
$70,175,437
April 2004
$1,135.70
$20,442,576
April 2006
$651.81
$11,732,535
April 2008
$15.03
$1,352,982
April 2010
$0.35
$31,512
6
Mason – a HP ProLiant DL580 G7
•
•
•
16 node cluster
10GE interconnect
– Cisco Nexus 7018
– Compute nodes are oversubscribed 4 : 1
– This is the same switch that we use for DC and other 10G connected
equipment.
Quad socket nodes
– 8 core Xeon L7555, 1.87 GHz base frequency
– 32 cores per node
– 512 GByte of memory per node
7
Why 512 GB – sweet spot in RAM requirements
Application
Genome / Genome size
RAM required (per node)
ABySS
Various plant genomes
SOAPdenov
o
Panda / 2.3 Gbp
Human gut metagenome
/ est. 4Gbp
Human Genome
Honeybee / ~300 Mbp
Daphnia / 200 Mbp
Duckweed / 150 Mbp
7.8 GB RAM per node (distributed memory parallel code)
[based on McCombie lab at CSHL]
512 GB
512 GB
Velvet
Coverage
20x
40x
60x
150 GB
128 GB
> 512 GB [based on runs by Lynch lab at IU
> 512 GB [based on McCombie lab at CSHL]
Maximum assemblable genome (Gbp)
Percentile of distribution of genome sizes
1.3
0.6
0.4
Plant
32
16
7
Animal
44
15
9
Largest genome that can be assembled on a computer with 512 GB of memory, assuming
maximum memory usage and k-mers of 20 bp
“Memory is a problem that admits only money as a solution” – David Moffett
8
Community trust matters
Application
Year initially published
Number of citations as of August 2010
de Bruijn graph methods
ABySS
2009
783
EULER
2001
1870
SOAPdenovo
2010
254
Velvet
2008
1420
Overlap/layout/consensus
Arachne 2
2003
578
Celera Assembler
2000
3579
Newbler
2005
999
Right now the codes most trusted by the community require large amounts of
memory in a single name space. More side by side testing may lead to more
trust in the distributed memory codes but for now….
9
The National Center for Genome Analysis Support
•
•
•
•
•
•
•
Dedicated to supporting life science researchers who need computational
support for genomics analysis
Initially funded by the National Science Foundation Advances in Biological
Informatics (ABI) program, grant no. 1062432
A Cyberinfrastructure Service Center affiliated with the Pervasive
Technology Institute at Indiana University (http://pti.iu.edu)
Provides support for genomics analysis software on supercomputers
customized for genomics studies, including Mason and systems which are
part of XSEDE
Provides distributions of hardened versions of popular codes
Particularly dedicated for genome assembly codes such as:
– de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS
– consensus methods: Celera, Newbler, Arachne 2
For more information, see http://ncgas.org
10
Benchmark overview
•
•
•
High Performance Computing Challenge Benchmark (HPCC)
– http://icl.cs.utk.edu/hpcc/
SPEC OpenMP
– http://www.spec.org/omp2001/
SPEC MPI
– http://www.spec.org/mpi2007/
11
High Performance Computing Challenge benchmark
Innovative Computing Laboratory at the University of Tennessee (Jack
Dongarra and Piotr Luszczek)
•
•
•
Announced at Supercomputing 2004
Version 1.0 available June 2005
Current: Version 1.4.1
•
Our results are not yet published, because we are unsatisfied with 8 node
HPCC runs. This is likely due to our oversubscription of the switch. (4 to 1)
12
High Performance Computing Challenge benchmark
•
Raw results, 1 to 16 nodes
HPCC
HPCC Benchmark Target
G-HPL
GPTRANS
GB/s
G-FFTE
GRandom
GSTREAM
EPSTREAM
EPDGEMM
Random
Ring
Bandwidth
Random
Ring
Latency
% HPL
Peak
GLFOP/s
Gup/s
GB/s
GB/s
GFLOP/s
GB/s
usec
Percent
# Nodes
#CPUs
#Cores
TFLOP/s
16
64
512
3.383
5.112
17.962
0.245
1084.682
2.119
7.158
0.010
229.564
82.59
8
32
256
1.608
2.648
8.938
0.1534
549.184
2.145
7.136
0.011
169.079
78.51
4
16
128
0.847
1.575
5.356
0.123
267.297
2.088
7.141
0.014
119.736
82.66
2
8
64
0.424
3.545
10.095
0.152
137.790
2.153
7.128
0.083
71.988
82.85
1
4
32
0.222
6.463
11.542
0.225
66.936
2.092
7.157
0.324
3.483
86.78
13
High Performance Computing Challenge benchmark
HPL Efficiency from 1 to 16 nodes
– Highlighting our issue at 8 nodes
– However, for a 10GE system, not so bad!
HPL Efficiency
HPL Efficiency [%]
•
88
86
84
82
80
78
76
74
1
2
4
Nodes
8
16
14
SPEC benchmarks
•
•
•
•
•
•
High Performance Group (HPG) of the Standard Performance Evaluation
Corporation (SPEC)
Robust framework for measurement
Industry / education mix
Result review before publication
Fair use policy and its enforcement
Concept of reference machine, base / peak runs, different datasets
15
SPEC OpenMP
•
•
•
Evaluate performance of OpenMP applications (single node)
Benchmark consists of 11 applications, medium and large dataset available
Our results:
– Large and Medium:
• http://www.spec.org/omp/results/res2011q3/
16
The SPEC OpenMP application suite
310.wupwise_m and 311.wupwise_l
312.swim_m and 313.swim_l
314.mgrid_m and 315.mgrid_l
316.applu_m and 317.applu_l
318.galgel_m
330.art_m and 331.art_l
320.equake_m and 321.equake_l
332.ammp_m
328.fma3d_m and 329.fma3d_l
324.apsi_m and 325.apsi_l
326.gafort_m and 327.gafort_l
quantum chromodynamics
shallow water modeling
multi-grid solver in 3D potential field
parabolic/elliptic partial differential equations
fluid dynamics analysis of oscillatory instability
neural network simulation of adaptive resonance theory
finite element simulation of earthquake modeling
computational chemistry
finite-element crash simulation
temperature, wind, distribution of pollutants
genetic algorithm code
17
SPEC OpenMP Medium
Benchmarks
310.wupwise_m
312.swim_m
314.mgrid_m
316.applu_m
318.galgel_m
320.equake_m
324.apsi_m
326.gafort_m
328.fma3d_m
330.art_m
332.ammp_m
SPECompMbase2001
Base Ref
Time
6000
6000
7300
4000
5100
2600
3400
8700
4600
6400
7000
Hyperthreading OFF
Base Run
Base
Time
Ratio
46.7
128583
84.1
71337
96.9
75332
29.9
133967
115.0
44387
52.3
49691
44.7
76128
98.2
88571
88.3
52108
31.1
205935
152.0
45953
78307
Hyperthreading ON
Base Run
Base
Time
Ratio
37.3
160774
74.2
80847
87.7
83222
26.1
153288
114.0
44802
47.9
54295
46.5
73134
109.0
79651
92.8
49543
32.9
194318
161.0
43469
80989
HyperThreading
beneficial
18
SPEC OpenMP Medium
System Comparison
200000
180000
160000
SPEC Score
140000
120000
100000
80000
60000
40000
20000
0
HP Integrity, 32 HP Integrity, 64 Sun SPARC, 64 HP DL 580, 32
IBM p570, 32
SGI UV, 64
IBM p595, 128
threads, Itanium threads, Itanium
threads,
threads, XEON, threads, Power6, threads, XEON, threads, Power6,
2, 1.5GHz, Sep. 2, 1.5 GHz, Sep. SPARC64, 2.5 1.8 GHz, Apr. 11 4.7 GHz, May 07 2.2 GHz, Mar. 10 5 GHz, Jun. 08
03
03
GHz, Jul 08
19
SPEC OpenMP Large
Benchmarks
311.wupwise_l
313.swim_l
315.mgrid_l
317.applu_l
321.equake_l
325.apsi_l
327.gafort_l
329.fma3d_l
331.art_l
SPECompLbase2001
Base Ref
Time
9200
12500
13500
13500
13000
10500
11000
23500
25000
Hyperthreading OFF
Base Run
Base
Time
Ratio
203
723729
602
331955
518
416695
562
384230
575
361456
271
620871
391
450003
1166
322462
290
1377258
493152
Hyperthreading ON
Base Run
Base
Time
Ratio
211
697727
628
318338
528
409378
590
366221
542
383513
286
587380
359
490814
941
399786
277
1445765
504788
HyperThreading
beneficial
20
SPEC MPI
•
•
•
Evaluate performance of MPI applications in the whole cluster
Benchmark consists of 12 applications, medium and large dataset available
Our results are not yet published, as we are still lacking a 16 node run
– We spent a lot of time on the 8 node run
– We think we know what the source of the problem is, we just have not
yet been able to fix it
– The problem is the result of rational configuration choices, and does not
impact our primary intended uses of the system
21
SPEC MPI
SPEC Score
•
Scalability study, 1 to 8 nodes – Preliminary results, not yet published by
SPEC
Scalability, 1 to 8 Nodes
35
30
25
20
15
10
5
0
IU HP DL580
1
Intel Endeavor
2
Intel Atlantis
4
Nodes
Endeavor: Intel Xeon X5560, 2.80 GHz, IB, Feb 2009
Atlantis: Intel Xeon X5482, 3.20 GHz, IB, Mar 2009
IU/HP: Intel Xeon L7555, 1.87 GHz, 10 GigE
8
22
Early users overview
•
•
•
•
•
Metagenomics Sequences Analysis
Genome Assembly and Annotation
Genome Informatics for Animals and Plants
Imputation of Genotypes And Sequence Alignment
Daphnia Population Genomics
23
Metagenomics Sequences Analysis
Yuzhen Ye's Lab (IUB School of Informatics)
•
•
•
•
Environmental sequencing
– Sampling DNA sequences directly from the environment
– Since the sequences consist of DNA fragments from hundreds or even
thousands of species, the analysis is far more difficult than traditional
sequence analysis that involves only one species.
Assembling metagenomic sequences and getting genes from the assembled
dataset
Dynamic programming is used to find the optimal mapping of consecutive
contigs out of the assembly
Since the number of contigs is enormous for most metagenomic datasets, a
large memory computing system is required to perform the dynamic
programming algorithm so that the task can be completed in polynomial time
24
Genome Assembly and Annotation
Michael Lynch's Lab (IUB Department of Biology)
•
•
•
Assembles and annotates genomes in the Paramecium aurelia species complex
in order to eventually study the evolutionary fates of duplicate genes after
whole-genome duplication. This project also has been performing RNAseq on
each genome, which is currently being used to aid in genome annotations and
will later be used to detect expression differences between paralogs.
The assembler used is based on an overlap-layout-consensus method instead
of a de Bruijn graph method (like some of the newer assemblers). It is more
memory intensive – requires performing pairwise alignments between all pairs of
reads.
The annotation of the genome assemblies involves programs such as GMAP,
GSNAP, PASA, and Augustus. To use these programs, we need to load-in
millions of RNAseq and EST reads and map them back to the genome.
25
Genome Informatics for Animals and Plants
Genome Informatics Lab (Don Gilbert) (IUB Department of Biology)
•
•
This project is to find genes in animals and plants, using the vast amounts of
new gene information coming from next generation sequencing technology.
These improvements are applied to newly deciphered genomes for an
environmental sentinel animal, the waterflea (Daphnia), the agricultural pest
insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the
chocolate bean tree (Th. cacao) which will bring genomics insights to
sustainable agriculture of cacao.
Large memory compute systems are needed for biological genome and gene
transcript assembly because assembly of genomic DNA or gene RNA sequence
reads (in billions of fragments) into full genomic or gene sequences requires a
minimum of 128 GB of shared memory, more depending on data set. These
programs build graph matrices of sequence alignments in memory.
26
Imputation of Genotypes And Sequence Alignment
Tatiana Foroud's Lab (IUPUI Department of Medical and Molecular Genetics)
•
•
•
Study the complex disorders by using imputation of genotypes typically for
genome wide association studies as well as sequence alignment and postprocessing of whole genome and whole exome sequencing.
Requires analysis of markers in a genetic region (such as a chromosome) in
several hundred representative individuals genotyped for the full reference
panel of SNPs, with extrapolation of the inferred haplotype structures.
More memory allows the imputation algorithms to evaluate haplotypes
across much broader genomic regions, reducing or eliminating the need to
partition the chromosomes into segments. This would result in imputed
genotypes with both increased accuracy and speed, allowing for improved
evaluation of detailed within-study results as well as communication and
collaboration (including meta-analysis) using the disease study results with
other researchers.
27
Daphnia Population Genomics
Michael Lynch's Lab (IUB Department of Biology)
•
•
This project involves the whole genome shotgun sequences of over 20 more
diploid genomes with genomes sizes >200 Megabases each. With each
genome sequenced to over 30 x coverage, the full project involves both the
mapping of reads to a reference genome and the de novo assembly of each
individual genome.
The genome assembly of millions of small reads often requires excessive
memory use for which we once turned to Dash at SDSC. With Mason now
online at IU, we have been able to run our assemblies and analysis programs
here at IU.
28
Mason as an example of effective campus bridging
•
•
•
The goal of campus bridging is to make local, regional, and national
cyberinfrastructure facilities appear as if they were peripherals to your
laptop
Mason is designed for a specific set of tasks that drive a different
configuration than XSEDE (the eXtreme Science and Engineering
Discovery Environment – http://xsede.org/)
For more information on campus bridging: http://pti.iu.edu/campusbridging/
29
Key points
•
•
•
•
•
The increased amount of data and decreased kmer length that are driving
growing demands for data analysis in genome assembly
The codes the biological community trusts are the codes they trust. Over time,
testing may enable more use of distributed memory codes. But for now if we
want to serve the biological community most effectively we need to implement
systems that match their research needs now.
In the performance analysis of Mason we found two outcomes of note:
– There is a problem in our switch configuration that we still have not sorted
that is causing odd HPL results, and we will continue to work on that.
– The summary result on hyperthreading is “sometimes it helps, sometimes
not”
If we as a community are frustrated by the attention that senior administrators
give placement on the Top500 list, and how that affects system configuration,
we need to take more time to publish SPEC and / or HPCC benchmark results.
– Much of the time this may mean “we got what we expected.” But more data
will make it easier to identify and understand results we don’t expect.
By implementing Mason – a lot of memory with some some processors attached
to it – we have enabled research that would otherwise not be possible.
30
Absolutely Shameless Plugs
•
•
XSEDE12: Bridging from the eXtreme to the campus and beyond
July 16-20, 2012 | Chicago
The XSEDE12 Conference will be held at the beautiful Intercontinental
Chicago (Magnificent Mile) at 505 N. Michigan Ave. The hotel is in the heart
of Chicago's most interesting tourist destinations and best shopping.
Watch for Calls for Participation – coming early January
•
And please visit the XSEDE and IU displays in the SC11 Exhibition Hallway!
•
31
Thanks
•
•
•
•
•
•
•
•
•
Danke fuer das Einladung: Herr Dr. Frank Baetke, Eva-Marie Markert, und HP
Thanks to HP, particularly James Kovach, for partnership efforts over many years, including
the implementation of Mason.
Staff of the Research Technologies Division of University Information Technology Services,
affiliated with the Pervasive Technology Institute, who led the implementation of Mason and
benchmarking activities at IU: George Turner, Robert Henschel, David Y. Hancock,
Matthew R. Link
Our many collaborators in the Pervasive Technology Institute, particularly the co-PIs of
NCGAS: Michael Lynch, Matthew Hahn, and Geoffrey C. Fox
Those involved in campus bridging activities: Guy Almes, Von Welch, Patrick Dreher, Jim
Pepin, Dave Jent, Stan Ahalt, Bill Barnett, Therese Miller, Malinda Lingwall, Maria Morris,
Gabrielle Allen, Jennifer Schopf, Ed Seidel
All of the IU Research Technologies and Pervasive Technology Institute staff who have
contributed to the development of IU’s advanced cyberinfrastructure and its support
NSF for funding support (Awards 040777, 1059812, 0948142, 1002526, 0829462,
1062432, OCI-1053575 – which supports the Extreme Science and Engineering Discovery
Environment)
Lilly Endowment, Inc. and the Indiana University Pervasive Technology Institute
Any opinions presented here are those of the presenter and do not necessarily represent
the opinions of the National Science Foundation or any other funding agencies
32
Thank you!
•
Questions and discussion?