Clouds Web2.0 and Multicore for Data Intensive Computing LSU Baton Rouge LA March 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University http://www.infomall.org/multicore [email protected], http://www.infomall.org.

Download Report

Transcript Clouds Web2.0 and Multicore for Data Intensive Computing LSU Baton Rouge LA March 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University http://www.infomall.org/multicore [email protected], http://www.infomall.org.

Clouds Web2.0 and Multicore for
Data Intensive Computing
LSU Baton Rouge LA
March 14 2008
Geoffrey Fox
Community Grids Laboratory, School of informatics
Indiana University
http://www.infomall.org/multicore
[email protected], http://www.infomall.org
1
Abstract of Clouds
Web2.0 and Multicore
for Data Intensive Computing


We discuss the macroscopic and microscopic drivers for next
generation grids.
Clouds could support infrastructure at two to three orders of
magnitude larger scale than conventional data centers. This will
drive simple hardware and software architectures exploiting
virtual machines and "too much computing".
• Namely that multicore chips will offer so much performance that we need
not cobble together heterogeneous resources but rather can deploy simple
powerful systems.


Data analysis and data mining will be critical applications for
both science and commodity applications.
We study the parallelization of a class of data mining algorithms
on current multicore systems and contrast programming models
2
from MPI to MapReduce.
e-moreorlessanything







‘e-Science is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from
its inventor John Taylor Director General of Research Councils
UK, Office of Science and Technology
e-Science is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research
Similarly e-Business captures an emerging view of corporations as
dynamic virtual organizations linking employees, customers and
stakeholders across the world.
This generalizes to e-moreorlessanything including presumably eEducation and e-MardiGras ….
A deluge of data of unprecedented and inevitable size must be
managed and understood.
People (see Web 2.0), computers, data (including sensors and
instruments) must be linked.
On demand assignment of experts, computers, networks and
storage resources must be supported
3
Applications, Infrastructure,
Technologies






This field is confused by inconsistent use of terminology; I define
Web Services, Grids and (aspects of) Web 2.0 (Enterprise 2.0) are
technologies
Grids could be everything (Broad Grids implementing some sort
of managed web) or reserved for specific architectures like OGSA
or Web Services (Narrow Grids)
These technologies combine and compete to build electronic
infrastructures termed e-infrastructure or Cyberinfrastructure
e-moreorlessanything is an emerging application area of broad
importance that is hosted on the infrastructures e-infrastructure
or Cyberinfrastructure
e-Science or perhaps better e-Research is a special case of emoreorlessanything
Relevance of Web 2.0





Web 2.0 can help e-Science in many ways
Its tools (web sites) can enhance scientific collaboration,
i.e. effectively support virtual organizations, in
different ways from grids
The popularity of Web 2.0 can provide high quality
technologies and software that (due to large
commercial investment) can be very useful in e-Science
and preferable to Grid or Web Service solutions
The usability and participatory nature of Web 2.0 can
bring science and its informatics to a broader audience
Web 2.0 can even help the emerging challenge of using
multicore chips i.e. in improving parallel computing
programming and runtime environments
“Best Web 2.0 Sites” -- 2006
from http://web2.wsj2.com/
SeeExtracted
http://www.seomoz.org/web2.0
for May 2007 List
 All important capabilities for e-Science
 Social Networking


Start Pages

Social Bookmarking

Peer Production News

Social Media Sharing

Online Storage
(Computing)
6
MSI-CIEC Web 2.0 Research Matching Portal










Portal supporting tagging and
linkage of Cyberinfrastructure
Resources
NSF (and other agencies via
grants.gov) Solicitations and
Awards
MSI-CIEC Portal Homepage
Feeds such as SciVee and NSF
Researchers on NSF Awards
User and Friends
TeraGrid Allocations
Search Results
Search for linked people, grants etc.
Could also be used to support
matching of students and faculty for
REUs etc.
MSI-CIEC Portal Homepage
Search Results
Web 2.0 Systems like Grids have Portals, Services, Resources

Captures the incredible development of interactive Web
sites enabling people to create and collaborate
Web 2.0 and Web Services




I once thought Web Services were inevitable but this is no longer
clear to me
Web services are complicated, slow and non functional
• WS-Security is unnecessarily slow and pedantic
(canonicalization of XML)
• WS-RM (Reliable Messaging) seems to have poor adoption
and doesn’t work well in collaboration
• WSDM (distributed management) specifies a lot
There are de facto Web 2.0 standards like Google Maps and
powerful suppliers like Google/Microsoft which “define the
architectures/interfaces”
One can easily combine SOAP (Web Service) based
services/systems with HTTP messages but dominance of “lowest
common denominator” suggests additional structure/complexity
of SOAP will not easily survive
Distribution of APIs and Mashups per
Protocol
google
maps
Number of
APIs
Number of
Mashups
del.icio.us
411sync
yahoo! search
yahoo! geocoding
SOAP is quite a small fraction
virtual
earth
technorati
netvibes
yahoo! images
trynt
yahoo! local
amazon
ECS
google
search
flickr
SOAP
ebay
youtube
amazon S3
REST
live.com
XML-RPC
REST,
XML-RPC
REST,
XML-RPC,
SOAP
REST,
SOAP
JS
Other
Too much Computing?



Historically both grids and parallel computing have tried to
increase computing capabilities by
• Optimizing performance of codes at cost of re-usability
• Exploiting all possible CPU’s such as Graphics coprocessors and “idle cycles” (across administrative
domains)
• Linking central computers together such as NSF/DoE/DoD
supercomputer networks without clear user requirements
Next Crisis in technology area will be the opposite problem –
commodity chips will be 32-128way parallel in 5 years time
and we currently have no idea how to use them on commodity
systems – especially on clients
• Only 2 releases of standard software (e.g. Office) in this
time span so need solutions that can be implemented in
next 3-5 years
Intel RMS analysis: Gaming and Generalized decision
support (data mining) are ways of using these cycles
Intel’s Projection
Too much Data to the Rescue?




Multicore servers have clear “universal parallelism” as many
users can access and use machines simultaneously
Maybe also need application parallelism (e.g. datamining) as
needed on client machines
Over next years, we will be submerged of course in data
deluge
• Scientific observations for e-Science
• Local (video, environmental) sensors
• Data fetched from Internet defining users interests
Maybe data-mining of this “too much data” will use up the
“too much computing” both for science and commodity PC’s
• PC will use this data(-mining) to be intelligent user
assistant?
• Must have highly parallel algorithms
What are Clouds?

Clouds are “Virtual Clusters” (maybe “Virtual Grids”)
of possibly “Virtual Machines”
• They may cross administrative domains or may “just be a
single cluster”; the user cannot and does not want to know

Clouds support access to (lease of) computer instances
• Instances accept data and job descriptions (code) and return
results that are data and status flags


Each Cloud is a “Narrow” (perhaps internally
proprietary) Grid
When does Cloud concept work
• Parameter searches, LHC style data analysis ..
• Common case (most likely success case for clouds) versus
corner case?


Clouds can be built from Grids
Grids can be built from Clouds
Raw Data 
Data  Information 
Knowledge 
Wisdom  Decisions
Information and Cyberinfrastructure
S
S
S
S
S
S
fs
SS
fs
fs
S
S
S
S
fs
fs
fs
fs
S
S
fs
S
S
S
S
S
S
Discovery
Cloud
fs
fs
Filter
Cloud
fs
S
S
fs
Filter
Service
fs
Compute
Cloud
Database
Filter
Cloud
Filter
Service
fs
SS
SS
Filter
Cloud
fs
SS
Another
Grid
fs
fs
Filter
Cloud
fs
Discovery
Cloud
fs
fs
Filter
Service
fs
SS
Filter
Service
fs
SS
SS
fs
fs
Filter
Cloud
Another
Service
S
S
Another
Grid
Another
Grid
Traditional Grid
with exposed
services
Filter
Cloud
S
S
S
S
Storage
Cloud
S
S
Sensor or Data
Interchange
Service
Clouds and Grids






Clouds are meant to help user by simplifying interface to
computing
Clouds are meant to help CIO and CFO by simplifying system
architecture enabling larger (factor of 100) more cost effective
data centers
Clouds support green computing by supporting remote location
where operations including power cheaper
Clouds are like Grids in many ways but a cloud is built as a “ab
initio” system whereas Grids are built from existing
heterogeneous systems (with heterogeneity exposed)
The low level interoperability architecture of services has failed –
the WS-* do not work. However only need these if linking
heterogeneous systems. Clouds do not need low level
interoperability but rather expose high level interfaces
Clouds very very loosely coupled; services loosely coupled
Technical Questions about Clouds I


What is performance overhead?
• On individual CPU
• On system including data and program transfer
What is cost gain
• From size efficiency; “green” location

Is Cloud Security adequate: can clouds be
trusted?

Can one can do parallel computing on clouds?
• Looking at “capacity” not “capability” i.e. lots of
modest sized jobs
• Marine corps will use Petaflop machines – they just
need ssh and a.out
Technical Questions about Clouds II

How is data-compute affinity tackled in clouds?
• Co-locate data and compute clouds?
• Lots of optical fiber i.e. “just” move the data?

What happens in clouds when demand for resources
exceeds capacity – is there a multi-day job input queue?
• Are there novel cloud scheduling issues?


Do we want to link clouds (or ensembles defined as
atomic clouds); if so how and with what protocols
Is there an intranet cloud e.g. “cloud in a box” software
to manage personal (cores on my future 128 core
laptop) department or enterprise cloud?
MSI Challenge Problem






There are > 330 MSI’s – Minority Serving Institutions
• 2 examples
ECSU (Elizabeth City State University) is a small state university
in North Carolina
• HBCU with 4000 students
• Working on PolarGrid (Sensors in Arctic/Antarctic linked to
“TeraGrid”)
Navajo Tech in Crown Point NM is community college with
technology leadership for Navajo Nation
• “Internet to the Hogan and Dine Grid” links Navajo
communities by wireless
• Wish to integrate TeraGrid science into Navajo Nation
education curriculum
Current Grid technology too complicated; especially if you are
not an R1 institution
Hard to deploy campus grids broadly into MSI’s
Clouds could provide virtual campus resources?
Where did Narrow Grids and Web Services go wrong?
 Interoperability Interfaces will be for data not for
infrastructure
• Google, Amazon, TeraGrid, European Grids will not
interoperate at the resource or compute (processing) level
but rather at the data streams flowing in and out of
independent Grid clouds
• Data focus is consistent with Semantic Grid/Web but not
clear if latter has learnt the usability message of Web 2.0
 Lack of detailed standards in Web 2.0 preferable to industry
who can get proprietary advantage inside their clouds
 One needs to share computing, data, people in emoreorlessanything, Grids initially focused on computing but
data and people are more important
 eScience is healthy as is e-moreorlessanything
 Most Grids are solving wrong problem at wrong point in stack
with a complexity that makes friendly usability difficult
Superior (from broad usage)
technologies of Web 2.0
Mash-ups can replace Workflow
Gadgets can replace Portlets
UDDI replaced by user generated
registries
Mashups v Workflow?


Mashup Tools are reviewed at
http://blogs.zdnet.com/Hinchcliffe/?p=63
Workflow Tools are reviewed by Gannon and Fox
http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf



Both include scripting
in PHP, Python, sh etc.
as both implement
distributed
programming at level
of services
Mashups use all types
of service interfaces
and perhaps do not
have the potential
robustness (security) of
Grid service approach
Mashups typically
“pure” HTTP (REST)
22
Major Companies entering mashup area



Web 2.0 Mashups (by definition the largest market) are likely to
drive composition tools for Grid and web
Recently we see Mashup tools like Yahoo Pipes and Microsoft
Popfly which have familiar graphical interfaces
Currently only simple examples but tools could become powerful
Yahoo Pipes
Web 2.0 Mashups
and APIs


http://www.programmableweb.com/
has (March 13 2008)
2857 Mashups and
670 Web 2.0 APIs and
with GoogleMaps the
most often used in
Mashups
This is the Web 2.0
UDDI (service registry)
The List of Web 2.0 API’s






Each site has API and its
features
Divided into broad categories
Only a few used a lot (60 API’s
used in 10 or more mashups)
RSS feed of new APIs
Google maps dominates but
Amazon EC2/S3 growing in
popularity
Interesting that no such
eScience site; we are not
building interoperable (reuable) services?
Grid-style portal as used in Earthquake Grid
The Portal is built from portlets
– providing user interface
fragments for each service
that are composed into the
full interface – uses OGCE
technology as does planetary
QuakeSim has a typical Grid technology portal
science VLAB portal with
Such Server side Portlet-based approaches to portals are University
being challenged
by client
of Minnesota
side gadgets from Web 2.0
Portlets aggregated on server using Java analogous to JSP, JSF
Gadgets aggregated on client using Javascript analogous to “classic” DHTML
Mashups can still be totally server side like workflow
Note Web 2.0 more than a user interface
Now to Portals
27
Note the many competitions powering Web 2.0
Mashup and Gadget Development
Portlets v. Google Gadgets





Portals for Grid Systems are built using portlets with
software like GridSphere integrating these on the
server-side into a single web-page
Google (at least) offers the Google sidebar and Google
home page which support Web 2.0 services and do not
use a server side aggregator
Google is more user friendly!
The many Web 2.0 competitions is an interesting model
for promoting development in the world-wide
distributed collection of Web 2.0 developers
I guess Web 2.0 model will win!
28
Typical Google Gadget Structure
Google Gadgets are an example of
Start Page (Web 2.0 term for portals)
technology
See http://blogs.zdnet.com/Hinchcliffe/?p=8

… Lots of HTML and JavaScript </Content> </Module>
Portlets build User Interfaces by combining fragments in a standalone Java Server
Google Gadgets build User Interfaces by combining fragments with JavaScript on the client
The Ten areas covered by the 60 core WS-*
Specifications
WS-* Specification Area
Typical Grid/Web Service Examples
1: Core Service Model
XML, WSDL, SOAP
2: Service Internet
WS-Addressing, WS-MessageDelivery; Reliable
Messaging WSRM; Efficient Messaging MOTM
3: Notification
WS-Notification, WS-Eventing (PublishSubscribe)
4: Workflow and Transactions
BPEL, WS-Choreography, WS-Coordination
5: Security
WS-Security, WS-Trust, WS-Federation, SAML,
WS-SecureConversation
6: Service Discovery
UDDI, WS-Discovery
7: System Metadata and State
WSRF, WS-MetadataExchange, WS-Context
8: Management
WSDM, WS-Management, WS-Transfer
9: Policy and Agreements
WS-Policy, WS-Agreement
10: Portals and User Interfaces
WSRP (Remote Portlets)
WS-* Areas and Web 2.0
WS-* Specification Area
Web 2.0 Approach
1: Core Service Model
XML becomes optional but still useful
SOAP becomes JSON RSS ATOM
WSDL becomes REST with API as GET PUT etc.
Axis becomes XmlHttpRequest
2: Service Internet
No special QoS. Use JMS or equivalent?
3: Notification
Hard with HTTP without polling– JMS perhaps?
4: Workflow and Transactions
(no Transactions in Web 2.0)
Mashups, Google MapReduce
Scripting with PHP JavaScript ….
5: Security
SSL, HTTP Authentication/Authorization,
OpenID is Web 2.0 Single Sign on
6: Service Discovery
http://www.programmableweb.com
7: System Metadata and State
Processed by application – no system state –
Microformats are a universal metadata approach
8: Management==Interaction
WS-Transfer style Protocols GET PUT etc.
9: Policy and Agreements
Service dependent. Processed by application
10: Portals and User Interfaces Start Pages, AJAX and Widgets(Netvibes) Gadgets
Web 2.0 can also help address
long standing difficulties with
parallel programming
environments
Use workflow or mashups to compose services
instead of building libraries
Service Aggregated Linked Sequential Activities
SALSA Team
Geoffrey Fox
Xiaohong Qiu
Seung-Hee Bae
Huapeng Yuan
Indiana University
Technology Collaboration
George Chrysanthakopoulos
Henrik Frystyk Nielsen
Microsoft
Application Collaboration
Cheminformatics
Rajarshi Guha
David Wild
Bioinformatics
Haiku Tang
Demographics (GIS)
Neil Devadasan
IU Bloomington and IUPUI
GOALS: Increasing number of cores
accompanied by continued data deluge
Develop scalable parallel data mining
algorithms with good multicore and
cluster performance; understand
software runtime and parallelization
method. Use managed code (C#) and
package algorithms as services to
encourage broad use assuming
experts parallelize core algorithms.
CURRENT RESUTS: Microsoft CCR supports MPI,
dynamic threading and via DSS a Service model of
computing; detailed performance measurements
Speedups of 7.5 or above on 8-core systems for
“large problems” with deterministic annealed (avoid
local minima) algorithms for clustering, Gaussian
Mixtures, GTM (dimensional reduction) etc.
SALSA
General Problem Classes
N data points E(x) in D dimensional space OR
points with dissimilarity ij defined between them
Unsupervised Modeling
• Find clusters without prejudice
• Model distribution as clusters formed from
Gaussian distributions with general shape
• Both can use multi-resolution annealing
Dimensional Reduction/Embedding
• Given vectors, map into lower dimension space
“preserving topology” for visualization: SOM and GTM
• Given ij associate data points with vectors in a
Euclidean space with Euclidean distance approximately
ij : MDS (can anneal) and Random Projection
Data Parallel over N data points E(x)
SALSA
N data points E(x) in D dim. space and Minimize F by EM
N
N
x 1
x 1
2 2
F  T
) ln{
g
(
k
)
exp[

0.5(
E
(
x
)

Y
(
k
))/ T/](Ts(k ))]
F
 aT( x
p(
x) ln{
exp[

(
E
(
x
)

Y
(
k
))
k 1  k 1
K
K
Deterministic Annealing Clustering (DAC)
• a(x) = 1/N or generally p(x) with  p(x) =1
• g(k)=1 and s(k)=0.5
• T is annealing temperature varied down from 
with final value of 1
• Vary cluster centerY(k)
• K starts at 1 and is incremented by algorithm
• My 4th most cited article but little used; probably
as no good software compared to simple K-means
SALSA
Deterministic Annealing Clustering of Indiana Census Data
Decrease temperature (distance scale) to discover more clusters
Distance Scale
Temperature0.5
Deterministic
Annealing
F({Y}, T)
Solve Linear
Equations for
each
temperature
Nonlinearity
removed by
approximating
with solution
at previous
higher
temperature
Configuration {Y}

Minimum evolving as temperature decreases

Movement at fixed temperature going to local
minima if not initialized “correctly”
N data points E(x) in D dim. space and Minimize F by EM
N
F  T  a( x) ln{ k 1 g (k ) exp[0.5( E ( x)  Y (k )) 2 / (Ts(k ))]
K
x 1
Deterministic
Generative
Traditional
Topographic
Annealing
Gaussian
Clustering
Mapping
(GTM)
(DAC)
Deterministic
Annealing
Gaussian
mixture
models
GM
models
(DAGM)
• a(x) = 1/NMixture
or generally
p(x)
D/2 with  p(x) =1
• a(x) = 1 and g(k) = (1/K)(/2)
•and
Ass(k)=0.5
DAGM but set T=1 and fix K
•• g(k)=1
a(x)
=
1
• s(k) = 1/  and T = 1
• T is annealing
temperature
2)D/2}1/T
varied down from 
M W/(2(k)
•Y(k) •= g(k)={P
m=1DAGTM:

(X(k))
km m
Deterministic
Annealed
with
final
value
of
1
2
2/2 Gaussian)
• s(k)=
(k)
(taking
case
of(X-
spherical
• Choose
fixed

(X)
=
exp(
0.5
)
)
m
m
Generative
Topographic
Mapping
• Vary
cluster centerY(k)
but can
calculate
weight
T misand
annealing
temperature
varied
down
from

• Vary•W

but
fix
values
of
M
and
K
a
priori
2
• GTM
has several
natural
annealing
P
and
correlation
matrix
s(k)
=
(k)
(even
for space
k
with
final
value
of
1
•Y(k) E(x)versions
Wm are2 vectors
in
original
high
D
dimension
based
on eitherformulae
DAC orfor
DAGM:
matrix
(k)
)
using
IDENTICAL
•
Vary
Y(k)
P
and
(k)
• X(k) andunder
m areinvestigation
vectors
in 2 dimensional mapped space
k
Gaussian
• K startsmixtures
at 1 and is incremented by algorithm
•K starts at 1 and is incremented by algorithm
SALSA

We implement micro-parallelism using Microsoft CCR
(Concurrency and Coordination Runtime) as it supports both MPI rendezvous
and dynamic (spawned) threading style of parallelism
http://msdn.microsoft.com/robotics/

CCR Supports exchange of messages between threads using named ports
and has primitives like:
 FromHandler: Spawn threads without reading ports
 Receive: Each handler reads one item from a single port
 MultipleItemReceive: Each handler reads a prescribed number of items of
a given type from a given port. Note items in a port can be general
structures but all must have same type.
 MultiplePortReceive: Each handler reads a one item of a given type from
multiple ports.

CCR has fewer primitives than MPI but can implement MPI collectives
efficiently

Use DSS (Decentralized System Services) built in terms of CCR for service
model

DSS has ~35 µs and CCR a few µs overhead
SALSA
Multicore Matrix Multiplication
(dominant linear algebra in GTM)
Speedup = Number of cores/(1+f)
f = (Sum of Overheads)/(Computation per core)
10,000.00
Execution Time
Seconds 4096X4096 matrices
Computation  Grain Size n . # Clusters K
Overheads are
Synchronization: small with CCR
Load Balance: good
Memory Bandwidth Limit:  0 as K  
Cache Use/Interference: Important
Runtime Fluctuations: Dominant large n, K
All our “real” problems have f ≤ 0.05 and
speedups on 8 core systems greater than 7.6
1 Core
1,000.00
Parallel Overhead
 1%
8 Cores
100.00
Block Size
10.00
1
0.14
10
100
1000
10000
Parallel GTM Performance
0.12
Fractional
Overhead
f
0.1
0.08
0.06
4096 Interpolating Clusters
0.04
0.02
1/(Grain Size n)
0
0
0.002
n = 500
0.004
0.006
0.008
0.01
100
0.012
0.014
0.016
0.018
0.02
50 SALSA
MPI Exchange Latency in µs (20-30 µs computation between messaging)
Machine
Intel8c:gf12
(8 core
2.33 Ghz)
(in 2 chips)
Intel8c:gf20
(8 core
2.33 Ghz)
Intel8b
(8 core
2.66 Ghz)
AMD4
(4 core
2.19 Ghz)
Intel(4 core)
OS
Runtime
Grains
Parallelism
MPI Latency
Redhat
MPJE(Java)
Process
8
181
MPICH2 (C)
Process
8
40.0
MPICH2:Fast
Process
8
39.3
Nemesis
Process
8
4.21
MPJE
Process
8
157
mpiJava
Process
8
111
MPICH2
Process
8
64.2
Vista
MPJE
Process
8
170
Fedora
MPJE
Process
8
142
Fedora
mpiJava
Process
8
100
Vista
CCR (C#)
Thread
8
20.2
XP
MPJE
Process
4
185
Redhat
MPJE
Process
4
152
mpiJava
Process
4
99.4
MPICH2
Process
4
39.3
XP
CCR
Thread
4
16.3
XP
CCR
Thread
4
25.8
Fedora
Messaging CCR versus MPI
C# v. C v. Java
SALSA
Intel8b: 8 Core
(μs)
1
2
3
4
7
8
1.58
2.44
3
2.94
4.5
5.06
Shift
2.42
3.2
3.38
5.26
5.14
Two Shifts
4.94
5.9
6.84
14.32
19.44
3.96
4.52
5.78
6.82
7.18
Shift
4.46
6.42
5.86
10.86
11.74
Exchange As Two
Shifts
7.4
11.64
14.16
31.86
35.62
6.94
11.22
13.3
18.78
20.16
Pipeline
Dynamic
Spawned
Threads
Pipeline
Rendezvous
MPI style
Number of Parallel Computations
CCR Custom
Exchange
2.48
SALSA
30
Time Microseconds
AMD Exch
25
AMD Exch as 2 Shifts
AMD Shift
20
15
10
5
Stages (millions)
0
0
2
4
6
8
10
Overhead (latency) of AMD4 PC with 4 execution threads on MPI style Rendezvous
Messaging for Shift and Exchange implemented either as two shifts or as custom CCR
pattern
70
Time Microseconds
60
Intel Exch
50
Intel Exch as 2 Shifts
Intel Shift
40
30
20
10
Stages (millions)
0
0
2
4
6
8
10
Overhead (latency) of Intel8b PC with 8 execution threads on MPI style Rendezvous
Messaging for Shift and Exchange implemented either as two shifts or as custom
CCR pattern
1.6
Scaled
Intel 8b Vista C# CCR 1 Cluster
1.5
10,000
Runtime
1.4
500,000
1.3
Divide runtime
by
Grain Size n
. # Clusters K
1.2
50,000
Datapoints
per thread
1.1
1
a)
1
2
3
4
5
6
Number of Threads (one per core)
7
8
1
Scaled
Runtime
Intel 8b Vista C# CCR 80 Clusters
50,000
10,000
0.95
500,000
0.9
Datapoints
per thread
0.85
0.8
b)
1
2
3
4
5
8 cores (threads)
and 1 cluster
show memory
bandwidth effect
6
Number of Threads (one per core)
7
8
80 clusters show
cache/memory
bandwidth effect
0.1
Std Dev Intel 8a XP C# CCR
Runtime 80 Clusters
0.075
500,000
10,000
0.05
50,000
0.025
Datapoints
per thread
0
b)
0
1
2
3
4
5
6
7
Number of Threads (one per core)
8
synchronization
0.006
Std Dev Intel 8c Redhat C Locks
Runtime 80 Clusters
10,000
0.004
50,000
500,000
0.002
Datapoints
per thread
0
b)
1
2
3
4
5
6
Number of Threads (one per core)
This is
average of
standard
deviation of
run time of
the 8 threads
between
messaging
7
8
points





Early implementations of our clustering algorithm
showed large fluctuations due to the cache line
interference effect (false sharing)
We have one thread on each core each calculating a sum
of same complexity storing result in a common array A
with different cores using different array locations
Thread i stores sum in A(i) is separation 1 – no memory
access interference but cache line interference
Thread i stores sum in A(X*i) is separation X
Serious degradation if X < 8 (64 bytes) with Windows


Note A is a double (8 bytes)
Less interference effect with Linux – especially Red Hat
Machine
OS
Run
Time
Intel8b
Intel8b
Intel8b
Intel8b
Intel8a
Intel8a
Intel8a
Intel8c
AMD4
AMD4
AMD4
AMD4
AMD4
AMD4
Vista
Vista
Vista
Fedora
XP CCR
XP Locks
XP
Red Hat
WinSrvr
WinSrvr
WinSrvr
XP
XP
XP
C# CCR
C# Locks
C
C
C#
C#
C
C
C# CCR
C# Locks
C
C# CCR
C# Locks
C



Time µs versus Thread Array Separation (unit is 8 bytes)
1
4
8
1024
Mean Std/
Mean
Std/
Mean Std/
Mean Std/
Mean
Mean
Mean
Mean
8.03
.029
3.04
.059
0.884 .0051
0.884 .0069
13.0
.0095 3.08
.0028
0.883 .0043
0.883 .0036
13.4
.0047 1.69
.0026
0.66
.029
0.659 .0057
1.50
.01
0.69
.21
0.307 .0045
0.307 .016
10.6
.033
4.16
.041
1.27
.051
1.43
.049
16.6
.016
4.31
.0067
1.27
.066
1.27
.054
16.9
.0016 2.27
.0042
0.946 .056
0.946 .058
0.441 .0035 0.423
.0031
0.423 .0030
0.423 .032
8.58
.0080 2.62
.081
0.839 .0031
0.838 .0031
8.72
.0036 2.42
0.01
0.836 .0016
0.836 .0013
5.65
.020
2.69
.0060
1.05
.0013
1.05
.0014
8.05
0.010
2.84
0.077
0.84
0.040
0.840 0.022
8.21
0.006
2.57
0.016
0.84
0.007
0.84
0.007
6.10
0.026
2.95
0.017
1.05
0.019
1.05
0.017
Note measurements at a separation X of 8 and X=1024 (and values between 8 and
1024 not shown) are essentially identical
Measurements at 7 (not shown) are higher than that at 8 (except for Red Hat which
shows essentially no enhancement at X<8)
As effects due to co-location of thread variables in a 64 byte cache line, align the
array with cache boundaries
Parallel Generative Topographic Mapping GTM
Reduce dimensionality preserving
topology and perhaps distances
Here project to 2D
GTM Projection of PubChem:
10,926,94 compounds in 166
dimension binary property space takes
4 days on 8 cores. 64X64 mesh of GTM
clusters interpolates PubChem. Could
usefully use 1024 cores! David Wild will
use for GIS style 2D browsing interface
to chemistry
PCA
GTM
Linear PCA v. nonlinear GTM on 6 Gaussians in 3D
PCA is Principal Component Analysis
GTM Projection of 2 clusters
of 335 compounds in 155
SALSA
dimensions
“Main Thread” and Memory M
MPI/CCR/DSS
From other nodes
MPI/CCR/DSS
From other nodes
0
m0
1
m1
2
m2
3
m3
4
m4
5
m5
6
m6
7
m7
Subsidiary threads t with memory mt

Use Data Decomposition as in classic distributed memory
but use shared memory for read variables. Each thread
uses a “local” array for written variables to get good cache
performance

Multicore and Cluster use same parallel algorithms but
different runtime implementations; algorithms are
 Accumulate matrix and vector elements in each process/thread
 At iteration barrier, combine contributions (MPI_Reduce)
 Linear Algebra (multiplication, equation solving, SVD)
SALSA


Micro-parallelism uses low latency CCR threads or
MPI processes
Services can be used where loose coupling natural
Input data
 Algorithms

 PCA
 DAC GTM GM DAGM DAGTM – both for complete algorithm
and for each iteration
 Linear Algebra used inside or outside above
 Metric embedding MDS, Bourgain, Quadratic Programming ….
 HMM, SVM ….

User interface: GIS (Web map Service) or equivalent
SALSA
Average run time (microseconds)
350
DSS Service Measurements
300
250
200
150
100
50
0
1
10
100
1000
10000
Round trips

Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better
54

This class of data mining does/will parallelize well on current/future multicore
nodes

Several engineering issues for use in large applications
 How to take CCR in multicore node to cluster (MPI or cross-cluster CCR?)
 Need high performance linear algebra for C# (PLASMA from UTenn)
 Access linear algebra services in a different language?
 Need equivalent of Intel C Math Libraries for C# (vector arithmetic – level 1
BLAS)
 Service model to integrate modules

Need access to a ~ 128 node Windows cluster

Future work is more applications; refine current algorithms such as DAGTM

New parallel algorithms
 Clustering with pairwise distances but no vectorspaces
 Bourgain Random Projection for metric embedding
 MDS Dimensional Scaling with EM-like SMACOF and deterministicannealing
 Support use of Newton’s Method (Marquardt’s method) as EM alternative
 Later HMM and SVM
SALSA