Clouds Web2.0 and Multicore for Data Intensive Computing LSU Baton Rouge LA March 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University http://www.infomall.org/multicore [email protected], http://www.infomall.org.
Download ReportTranscript Clouds Web2.0 and Multicore for Data Intensive Computing LSU Baton Rouge LA March 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University http://www.infomall.org/multicore [email protected], http://www.infomall.org.
Clouds Web2.0 and Multicore for Data Intensive Computing LSU Baton Rouge LA March 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University http://www.infomall.org/multicore [email protected], http://www.infomall.org 1 Abstract of Clouds Web2.0 and Multicore for Data Intensive Computing We discuss the macroscopic and microscopic drivers for next generation grids. Clouds could support infrastructure at two to three orders of magnitude larger scale than conventional data centers. This will drive simple hardware and software architectures exploiting virtual machines and "too much computing". • Namely that multicore chips will offer so much performance that we need not cobble together heterogeneous resources but rather can deploy simple powerful systems. Data analysis and data mining will be critical applications for both science and commodity applications. We study the parallelization of a class of data mining algorithms on current multicore systems and contrast programming models 2 from MPI to MapReduce. e-moreorlessanything ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ from its inventor John Taylor Director General of Research Councils UK, Office of Science and Technology e-Science is about developing tools and technologies that allow scientists to do ‘faster, better or different’ research Similarly e-Business captures an emerging view of corporations as dynamic virtual organizations linking employees, customers and stakeholders across the world. This generalizes to e-moreorlessanything including presumably eEducation and e-MardiGras …. A deluge of data of unprecedented and inevitable size must be managed and understood. People (see Web 2.0), computers, data (including sensors and instruments) must be linked. On demand assignment of experts, computers, networks and storage resources must be supported 3 Applications, Infrastructure, Technologies This field is confused by inconsistent use of terminology; I define Web Services, Grids and (aspects of) Web 2.0 (Enterprise 2.0) are technologies Grids could be everything (Broad Grids implementing some sort of managed web) or reserved for specific architectures like OGSA or Web Services (Narrow Grids) These technologies combine and compete to build electronic infrastructures termed e-infrastructure or Cyberinfrastructure e-moreorlessanything is an emerging application area of broad importance that is hosted on the infrastructures e-infrastructure or Cyberinfrastructure e-Science or perhaps better e-Research is a special case of emoreorlessanything Relevance of Web 2.0 Web 2.0 can help e-Science in many ways Its tools (web sites) can enhance scientific collaboration, i.e. effectively support virtual organizations, in different ways from grids The popularity of Web 2.0 can provide high quality technologies and software that (due to large commercial investment) can be very useful in e-Science and preferable to Grid or Web Service solutions The usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audience Web 2.0 can even help the emerging challenge of using multicore chips i.e. in improving parallel computing programming and runtime environments “Best Web 2.0 Sites” -- 2006 from http://web2.wsj2.com/ SeeExtracted http://www.seomoz.org/web2.0 for May 2007 List All important capabilities for e-Science Social Networking Start Pages Social Bookmarking Peer Production News Social Media Sharing Online Storage (Computing) 6 MSI-CIEC Web 2.0 Research Matching Portal Portal supporting tagging and linkage of Cyberinfrastructure Resources NSF (and other agencies via grants.gov) Solicitations and Awards MSI-CIEC Portal Homepage Feeds such as SciVee and NSF Researchers on NSF Awards User and Friends TeraGrid Allocations Search Results Search for linked people, grants etc. Could also be used to support matching of students and faculty for REUs etc. MSI-CIEC Portal Homepage Search Results Web 2.0 Systems like Grids have Portals, Services, Resources Captures the incredible development of interactive Web sites enabling people to create and collaborate Web 2.0 and Web Services I once thought Web Services were inevitable but this is no longer clear to me Web services are complicated, slow and non functional • WS-Security is unnecessarily slow and pedantic (canonicalization of XML) • WS-RM (Reliable Messaging) seems to have poor adoption and doesn’t work well in collaboration • WSDM (distributed management) specifies a lot There are de facto Web 2.0 standards like Google Maps and powerful suppliers like Google/Microsoft which “define the architectures/interfaces” One can easily combine SOAP (Web Service) based services/systems with HTTP messages but dominance of “lowest common denominator” suggests additional structure/complexity of SOAP will not easily survive Distribution of APIs and Mashups per Protocol google maps Number of APIs Number of Mashups del.icio.us 411sync yahoo! search yahoo! geocoding SOAP is quite a small fraction virtual earth technorati netvibes yahoo! images trynt yahoo! local amazon ECS google search flickr SOAP ebay youtube amazon S3 REST live.com XML-RPC REST, XML-RPC REST, XML-RPC, SOAP REST, SOAP JS Other Too much Computing? Historically both grids and parallel computing have tried to increase computing capabilities by • Optimizing performance of codes at cost of re-usability • Exploiting all possible CPU’s such as Graphics coprocessors and “idle cycles” (across administrative domains) • Linking central computers together such as NSF/DoE/DoD supercomputer networks without clear user requirements Next Crisis in technology area will be the opposite problem – commodity chips will be 32-128way parallel in 5 years time and we currently have no idea how to use them on commodity systems – especially on clients • Only 2 releases of standard software (e.g. Office) in this time span so need solutions that can be implemented in next 3-5 years Intel RMS analysis: Gaming and Generalized decision support (data mining) are ways of using these cycles Intel’s Projection Too much Data to the Rescue? Multicore servers have clear “universal parallelism” as many users can access and use machines simultaneously Maybe also need application parallelism (e.g. datamining) as needed on client machines Over next years, we will be submerged of course in data deluge • Scientific observations for e-Science • Local (video, environmental) sensors • Data fetched from Internet defining users interests Maybe data-mining of this “too much data” will use up the “too much computing” both for science and commodity PC’s • PC will use this data(-mining) to be intelligent user assistant? • Must have highly parallel algorithms What are Clouds? Clouds are “Virtual Clusters” (maybe “Virtual Grids”) of possibly “Virtual Machines” • They may cross administrative domains or may “just be a single cluster”; the user cannot and does not want to know Clouds support access to (lease of) computer instances • Instances accept data and job descriptions (code) and return results that are data and status flags Each Cloud is a “Narrow” (perhaps internally proprietary) Grid When does Cloud concept work • Parameter searches, LHC style data analysis .. • Common case (most likely success case for clouds) versus corner case? Clouds can be built from Grids Grids can be built from Clouds Raw Data Data Information Knowledge Wisdom Decisions Information and Cyberinfrastructure S S S S S S fs SS fs fs S S S S fs fs fs fs S S fs S S S S S S Discovery Cloud fs fs Filter Cloud fs S S fs Filter Service fs Compute Cloud Database Filter Cloud Filter Service fs SS SS Filter Cloud fs SS Another Grid fs fs Filter Cloud fs Discovery Cloud fs fs Filter Service fs SS Filter Service fs SS SS fs fs Filter Cloud Another Service S S Another Grid Another Grid Traditional Grid with exposed services Filter Cloud S S S S Storage Cloud S S Sensor or Data Interchange Service Clouds and Grids Clouds are meant to help user by simplifying interface to computing Clouds are meant to help CIO and CFO by simplifying system architecture enabling larger (factor of 100) more cost effective data centers Clouds support green computing by supporting remote location where operations including power cheaper Clouds are like Grids in many ways but a cloud is built as a “ab initio” system whereas Grids are built from existing heterogeneous systems (with heterogeneity exposed) The low level interoperability architecture of services has failed – the WS-* do not work. However only need these if linking heterogeneous systems. Clouds do not need low level interoperability but rather expose high level interfaces Clouds very very loosely coupled; services loosely coupled Technical Questions about Clouds I What is performance overhead? • On individual CPU • On system including data and program transfer What is cost gain • From size efficiency; “green” location Is Cloud Security adequate: can clouds be trusted? Can one can do parallel computing on clouds? • Looking at “capacity” not “capability” i.e. lots of modest sized jobs • Marine corps will use Petaflop machines – they just need ssh and a.out Technical Questions about Clouds II How is data-compute affinity tackled in clouds? • Co-locate data and compute clouds? • Lots of optical fiber i.e. “just” move the data? What happens in clouds when demand for resources exceeds capacity – is there a multi-day job input queue? • Are there novel cloud scheduling issues? Do we want to link clouds (or ensembles defined as atomic clouds); if so how and with what protocols Is there an intranet cloud e.g. “cloud in a box” software to manage personal (cores on my future 128 core laptop) department or enterprise cloud? MSI Challenge Problem There are > 330 MSI’s – Minority Serving Institutions • 2 examples ECSU (Elizabeth City State University) is a small state university in North Carolina • HBCU with 4000 students • Working on PolarGrid (Sensors in Arctic/Antarctic linked to “TeraGrid”) Navajo Tech in Crown Point NM is community college with technology leadership for Navajo Nation • “Internet to the Hogan and Dine Grid” links Navajo communities by wireless • Wish to integrate TeraGrid science into Navajo Nation education curriculum Current Grid technology too complicated; especially if you are not an R1 institution Hard to deploy campus grids broadly into MSI’s Clouds could provide virtual campus resources? Where did Narrow Grids and Web Services go wrong? Interoperability Interfaces will be for data not for infrastructure • Google, Amazon, TeraGrid, European Grids will not interoperate at the resource or compute (processing) level but rather at the data streams flowing in and out of independent Grid clouds • Data focus is consistent with Semantic Grid/Web but not clear if latter has learnt the usability message of Web 2.0 Lack of detailed standards in Web 2.0 preferable to industry who can get proprietary advantage inside their clouds One needs to share computing, data, people in emoreorlessanything, Grids initially focused on computing but data and people are more important eScience is healthy as is e-moreorlessanything Most Grids are solving wrong problem at wrong point in stack with a complexity that makes friendly usability difficult Superior (from broad usage) technologies of Web 2.0 Mash-ups can replace Workflow Gadgets can replace Portlets UDDI replaced by user generated registries Mashups v Workflow? Mashup Tools are reviewed at http://blogs.zdnet.com/Hinchcliffe/?p=63 Workflow Tools are reviewed by Gannon and Fox http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf Both include scripting in PHP, Python, sh etc. as both implement distributed programming at level of services Mashups use all types of service interfaces and perhaps do not have the potential robustness (security) of Grid service approach Mashups typically “pure” HTTP (REST) 22 Major Companies entering mashup area Web 2.0 Mashups (by definition the largest market) are likely to drive composition tools for Grid and web Recently we see Mashup tools like Yahoo Pipes and Microsoft Popfly which have familiar graphical interfaces Currently only simple examples but tools could become powerful Yahoo Pipes Web 2.0 Mashups and APIs http://www.programmableweb.com/ has (March 13 2008) 2857 Mashups and 670 Web 2.0 APIs and with GoogleMaps the most often used in Mashups This is the Web 2.0 UDDI (service registry) The List of Web 2.0 API’s Each site has API and its features Divided into broad categories Only a few used a lot (60 API’s used in 10 or more mashups) RSS feed of new APIs Google maps dominates but Amazon EC2/S3 growing in popularity Interesting that no such eScience site; we are not building interoperable (reuable) services? Grid-style portal as used in Earthquake Grid The Portal is built from portlets – providing user interface fragments for each service that are composed into the full interface – uses OGCE technology as does planetary QuakeSim has a typical Grid technology portal science VLAB portal with Such Server side Portlet-based approaches to portals are University being challenged by client of Minnesota side gadgets from Web 2.0 Portlets aggregated on server using Java analogous to JSP, JSF Gadgets aggregated on client using Javascript analogous to “classic” DHTML Mashups can still be totally server side like workflow Note Web 2.0 more than a user interface Now to Portals 27 Note the many competitions powering Web 2.0 Mashup and Gadget Development Portlets v. Google Gadgets Portals for Grid Systems are built using portlets with software like GridSphere integrating these on the server-side into a single web-page Google (at least) offers the Google sidebar and Google home page which support Web 2.0 services and do not use a server side aggregator Google is more user friendly! The many Web 2.0 competitions is an interesting model for promoting development in the world-wide distributed collection of Web 2.0 developers I guess Web 2.0 model will win! 28 Typical Google Gadget Structure Google Gadgets are an example of Start Page (Web 2.0 term for portals) technology See http://blogs.zdnet.com/Hinchcliffe/?p=8 … Lots of HTML and JavaScript </Content> </Module> Portlets build User Interfaces by combining fragments in a standalone Java Server Google Gadgets build User Interfaces by combining fragments with JavaScript on the client The Ten areas covered by the 60 core WS-* Specifications WS-* Specification Area Typical Grid/Web Service Examples 1: Core Service Model XML, WSDL, SOAP 2: Service Internet WS-Addressing, WS-MessageDelivery; Reliable Messaging WSRM; Efficient Messaging MOTM 3: Notification WS-Notification, WS-Eventing (PublishSubscribe) 4: Workflow and Transactions BPEL, WS-Choreography, WS-Coordination 5: Security WS-Security, WS-Trust, WS-Federation, SAML, WS-SecureConversation 6: Service Discovery UDDI, WS-Discovery 7: System Metadata and State WSRF, WS-MetadataExchange, WS-Context 8: Management WSDM, WS-Management, WS-Transfer 9: Policy and Agreements WS-Policy, WS-Agreement 10: Portals and User Interfaces WSRP (Remote Portlets) WS-* Areas and Web 2.0 WS-* Specification Area Web 2.0 Approach 1: Core Service Model XML becomes optional but still useful SOAP becomes JSON RSS ATOM WSDL becomes REST with API as GET PUT etc. Axis becomes XmlHttpRequest 2: Service Internet No special QoS. Use JMS or equivalent? 3: Notification Hard with HTTP without polling– JMS perhaps? 4: Workflow and Transactions (no Transactions in Web 2.0) Mashups, Google MapReduce Scripting with PHP JavaScript …. 5: Security SSL, HTTP Authentication/Authorization, OpenID is Web 2.0 Single Sign on 6: Service Discovery http://www.programmableweb.com 7: System Metadata and State Processed by application – no system state – Microformats are a universal metadata approach 8: Management==Interaction WS-Transfer style Protocols GET PUT etc. 9: Policy and Agreements Service dependent. Processed by application 10: Portals and User Interfaces Start Pages, AJAX and Widgets(Netvibes) Gadgets Web 2.0 can also help address long standing difficulties with parallel programming environments Use workflow or mashups to compose services instead of building libraries Service Aggregated Linked Sequential Activities SALSA Team Geoffrey Fox Xiaohong Qiu Seung-Hee Bae Huapeng Yuan Indiana University Technology Collaboration George Chrysanthakopoulos Henrik Frystyk Nielsen Microsoft Application Collaboration Cheminformatics Rajarshi Guha David Wild Bioinformatics Haiku Tang Demographics (GIS) Neil Devadasan IU Bloomington and IUPUI GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data mining algorithms with good multicore and cluster performance; understand software runtime and parallelization method. Use managed code (C#) and package algorithms as services to encourage broad use assuming experts parallelize core algorithms. CURRENT RESUTS: Microsoft CCR supports MPI, dynamic threading and via DSS a Service model of computing; detailed performance measurements Speedups of 7.5 or above on 8-core systems for “large problems” with deterministic annealed (avoid local minima) algorithms for clustering, Gaussian Mixtures, GTM (dimensional reduction) etc. SALSA General Problem Classes N data points E(x) in D dimensional space OR points with dissimilarity ij defined between them Unsupervised Modeling • Find clusters without prejudice • Model distribution as clusters formed from Gaussian distributions with general shape • Both can use multi-resolution annealing Dimensional Reduction/Embedding • Given vectors, map into lower dimension space “preserving topology” for visualization: SOM and GTM • Given ij associate data points with vectors in a Euclidean space with Euclidean distance approximately ij : MDS (can anneal) and Random Projection Data Parallel over N data points E(x) SALSA N data points E(x) in D dim. space and Minimize F by EM N N x 1 x 1 2 2 F T ) ln{ g ( k ) exp[ 0.5( E ( x ) Y ( k ))/ T/](Ts(k ))] F aT( x p( x) ln{ exp[ ( E ( x ) Y ( k )) k 1 k 1 K K Deterministic Annealing Clustering (DAC) • a(x) = 1/N or generally p(x) with p(x) =1 • g(k)=1 and s(k)=0.5 • T is annealing temperature varied down from with final value of 1 • Vary cluster centerY(k) • K starts at 1 and is incremented by algorithm • My 4th most cited article but little used; probably as no good software compared to simple K-means SALSA Deterministic Annealing Clustering of Indiana Census Data Decrease temperature (distance scale) to discover more clusters Distance Scale Temperature0.5 Deterministic Annealing F({Y}, T) Solve Linear Equations for each temperature Nonlinearity removed by approximating with solution at previous higher temperature Configuration {Y} Minimum evolving as temperature decreases Movement at fixed temperature going to local minima if not initialized “correctly” N data points E(x) in D dim. space and Minimize F by EM N F T a( x) ln{ k 1 g (k ) exp[0.5( E ( x) Y (k )) 2 / (Ts(k ))] K x 1 Deterministic Generative Traditional Topographic Annealing Gaussian Clustering Mapping (GTM) (DAC) Deterministic Annealing Gaussian mixture models GM models (DAGM) • a(x) = 1/NMixture or generally p(x) D/2 with p(x) =1 • a(x) = 1 and g(k) = (1/K)(/2) •and Ass(k)=0.5 DAGM but set T=1 and fix K •• g(k)=1 a(x) = 1 • s(k) = 1/ and T = 1 • T is annealing temperature 2)D/2}1/T varied down from M W/(2(k) •Y(k) •= g(k)={P m=1DAGTM: (X(k)) km m Deterministic Annealed with final value of 1 2 2/2 Gaussian) • s(k)= (k) (taking case of(X- spherical • Choose fixed (X) = exp( 0.5 ) ) m m Generative Topographic Mapping • Vary cluster centerY(k) but can calculate weight T misand annealing temperature varied down from • Vary•W but fix values of M and K a priori 2 • GTM has several natural annealing P and correlation matrix s(k) = (k) (even for space k with final value of 1 •Y(k) E(x)versions Wm are2 vectors in original high D dimension based on eitherformulae DAC orfor DAGM: matrix (k) ) using IDENTICAL • Vary Y(k) P and (k) • X(k) andunder m areinvestigation vectors in 2 dimensional mapped space k Gaussian • K startsmixtures at 1 and is incremented by algorithm •K starts at 1 and is incremented by algorithm SALSA We implement micro-parallelism using Microsoft CCR (Concurrency and Coordination Runtime) as it supports both MPI rendezvous and dynamic (spawned) threading style of parallelism http://msdn.microsoft.com/robotics/ CCR Supports exchange of messages between threads using named ports and has primitives like: FromHandler: Spawn threads without reading ports Receive: Each handler reads one item from a single port MultipleItemReceive: Each handler reads a prescribed number of items of a given type from a given port. Note items in a port can be general structures but all must have same type. MultiplePortReceive: Each handler reads a one item of a given type from multiple ports. CCR has fewer primitives than MPI but can implement MPI collectives efficiently Use DSS (Decentralized System Services) built in terms of CCR for service model DSS has ~35 µs and CCR a few µs overhead SALSA Multicore Matrix Multiplication (dominant linear algebra in GTM) Speedup = Number of cores/(1+f) f = (Sum of Overheads)/(Computation per core) 10,000.00 Execution Time Seconds 4096X4096 matrices Computation Grain Size n . # Clusters K Overheads are Synchronization: small with CCR Load Balance: good Memory Bandwidth Limit: 0 as K Cache Use/Interference: Important Runtime Fluctuations: Dominant large n, K All our “real” problems have f ≤ 0.05 and speedups on 8 core systems greater than 7.6 1 Core 1,000.00 Parallel Overhead 1% 8 Cores 100.00 Block Size 10.00 1 0.14 10 100 1000 10000 Parallel GTM Performance 0.12 Fractional Overhead f 0.1 0.08 0.06 4096 Interpolating Clusters 0.04 0.02 1/(Grain Size n) 0 0 0.002 n = 500 0.004 0.006 0.008 0.01 100 0.012 0.014 0.016 0.018 0.02 50 SALSA MPI Exchange Latency in µs (20-30 µs computation between messaging) Machine Intel8c:gf12 (8 core 2.33 Ghz) (in 2 chips) Intel8c:gf20 (8 core 2.33 Ghz) Intel8b (8 core 2.66 Ghz) AMD4 (4 core 2.19 Ghz) Intel(4 core) OS Runtime Grains Parallelism MPI Latency Redhat MPJE(Java) Process 8 181 MPICH2 (C) Process 8 40.0 MPICH2:Fast Process 8 39.3 Nemesis Process 8 4.21 MPJE Process 8 157 mpiJava Process 8 111 MPICH2 Process 8 64.2 Vista MPJE Process 8 170 Fedora MPJE Process 8 142 Fedora mpiJava Process 8 100 Vista CCR (C#) Thread 8 20.2 XP MPJE Process 4 185 Redhat MPJE Process 4 152 mpiJava Process 4 99.4 MPICH2 Process 4 39.3 XP CCR Thread 4 16.3 XP CCR Thread 4 25.8 Fedora Messaging CCR versus MPI C# v. C v. Java SALSA Intel8b: 8 Core (μs) 1 2 3 4 7 8 1.58 2.44 3 2.94 4.5 5.06 Shift 2.42 3.2 3.38 5.26 5.14 Two Shifts 4.94 5.9 6.84 14.32 19.44 3.96 4.52 5.78 6.82 7.18 Shift 4.46 6.42 5.86 10.86 11.74 Exchange As Two Shifts 7.4 11.64 14.16 31.86 35.62 6.94 11.22 13.3 18.78 20.16 Pipeline Dynamic Spawned Threads Pipeline Rendezvous MPI style Number of Parallel Computations CCR Custom Exchange 2.48 SALSA 30 Time Microseconds AMD Exch 25 AMD Exch as 2 Shifts AMD Shift 20 15 10 5 Stages (millions) 0 0 2 4 6 8 10 Overhead (latency) of AMD4 PC with 4 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern 70 Time Microseconds 60 Intel Exch 50 Intel Exch as 2 Shifts Intel Shift 40 30 20 10 Stages (millions) 0 0 2 4 6 8 10 Overhead (latency) of Intel8b PC with 8 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern 1.6 Scaled Intel 8b Vista C# CCR 1 Cluster 1.5 10,000 Runtime 1.4 500,000 1.3 Divide runtime by Grain Size n . # Clusters K 1.2 50,000 Datapoints per thread 1.1 1 a) 1 2 3 4 5 6 Number of Threads (one per core) 7 8 1 Scaled Runtime Intel 8b Vista C# CCR 80 Clusters 50,000 10,000 0.95 500,000 0.9 Datapoints per thread 0.85 0.8 b) 1 2 3 4 5 8 cores (threads) and 1 cluster show memory bandwidth effect 6 Number of Threads (one per core) 7 8 80 clusters show cache/memory bandwidth effect 0.1 Std Dev Intel 8a XP C# CCR Runtime 80 Clusters 0.075 500,000 10,000 0.05 50,000 0.025 Datapoints per thread 0 b) 0 1 2 3 4 5 6 7 Number of Threads (one per core) 8 synchronization 0.006 Std Dev Intel 8c Redhat C Locks Runtime 80 Clusters 10,000 0.004 50,000 500,000 0.002 Datapoints per thread 0 b) 1 2 3 4 5 6 Number of Threads (one per core) This is average of standard deviation of run time of the 8 threads between messaging 7 8 points Early implementations of our clustering algorithm showed large fluctuations due to the cache line interference effect (false sharing) We have one thread on each core each calculating a sum of same complexity storing result in a common array A with different cores using different array locations Thread i stores sum in A(i) is separation 1 – no memory access interference but cache line interference Thread i stores sum in A(X*i) is separation X Serious degradation if X < 8 (64 bytes) with Windows Note A is a double (8 bytes) Less interference effect with Linux – especially Red Hat Machine OS Run Time Intel8b Intel8b Intel8b Intel8b Intel8a Intel8a Intel8a Intel8c AMD4 AMD4 AMD4 AMD4 AMD4 AMD4 Vista Vista Vista Fedora XP CCR XP Locks XP Red Hat WinSrvr WinSrvr WinSrvr XP XP XP C# CCR C# Locks C C C# C# C C C# CCR C# Locks C C# CCR C# Locks C Time µs versus Thread Array Separation (unit is 8 bytes) 1 4 8 1024 Mean Std/ Mean Std/ Mean Std/ Mean Std/ Mean Mean Mean Mean 8.03 .029 3.04 .059 0.884 .0051 0.884 .0069 13.0 .0095 3.08 .0028 0.883 .0043 0.883 .0036 13.4 .0047 1.69 .0026 0.66 .029 0.659 .0057 1.50 .01 0.69 .21 0.307 .0045 0.307 .016 10.6 .033 4.16 .041 1.27 .051 1.43 .049 16.6 .016 4.31 .0067 1.27 .066 1.27 .054 16.9 .0016 2.27 .0042 0.946 .056 0.946 .058 0.441 .0035 0.423 .0031 0.423 .0030 0.423 .032 8.58 .0080 2.62 .081 0.839 .0031 0.838 .0031 8.72 .0036 2.42 0.01 0.836 .0016 0.836 .0013 5.65 .020 2.69 .0060 1.05 .0013 1.05 .0014 8.05 0.010 2.84 0.077 0.84 0.040 0.840 0.022 8.21 0.006 2.57 0.016 0.84 0.007 0.84 0.007 6.10 0.026 2.95 0.017 1.05 0.019 1.05 0.017 Note measurements at a separation X of 8 and X=1024 (and values between 8 and 1024 not shown) are essentially identical Measurements at 7 (not shown) are higher than that at 8 (except for Red Hat which shows essentially no enhancement at X<8) As effects due to co-location of thread variables in a 64 byte cache line, align the array with cache boundaries Parallel Generative Topographic Mapping GTM Reduce dimensionality preserving topology and perhaps distances Here project to 2D GTM Projection of PubChem: 10,926,94 compounds in 166 dimension binary property space takes 4 days on 8 cores. 64X64 mesh of GTM clusters interpolates PubChem. Could usefully use 1024 cores! David Wild will use for GIS style 2D browsing interface to chemistry PCA GTM Linear PCA v. nonlinear GTM on 6 Gaussians in 3D PCA is Principal Component Analysis GTM Projection of 2 clusters of 335 compounds in 155 SALSA dimensions “Main Thread” and Memory M MPI/CCR/DSS From other nodes MPI/CCR/DSS From other nodes 0 m0 1 m1 2 m2 3 m3 4 m4 5 m5 6 m6 7 m7 Subsidiary threads t with memory mt Use Data Decomposition as in classic distributed memory but use shared memory for read variables. Each thread uses a “local” array for written variables to get good cache performance Multicore and Cluster use same parallel algorithms but different runtime implementations; algorithms are Accumulate matrix and vector elements in each process/thread At iteration barrier, combine contributions (MPI_Reduce) Linear Algebra (multiplication, equation solving, SVD) SALSA Micro-parallelism uses low latency CCR threads or MPI processes Services can be used where loose coupling natural Input data Algorithms PCA DAC GTM GM DAGM DAGTM – both for complete algorithm and for each iteration Linear Algebra used inside or outside above Metric embedding MDS, Bourgain, Quadratic Programming …. HMM, SVM …. User interface: GIS (Web map Service) or equivalent SALSA Average run time (microseconds) 350 DSS Service Measurements 300 250 200 150 100 50 0 1 10 100 1000 10000 Round trips Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better 54 This class of data mining does/will parallelize well on current/future multicore nodes Several engineering issues for use in large applications How to take CCR in multicore node to cluster (MPI or cross-cluster CCR?) Need high performance linear algebra for C# (PLASMA from UTenn) Access linear algebra services in a different language? Need equivalent of Intel C Math Libraries for C# (vector arithmetic – level 1 BLAS) Service model to integrate modules Need access to a ~ 128 node Windows cluster Future work is more applications; refine current algorithms such as DAGTM New parallel algorithms Clustering with pairwise distances but no vectorspaces Bourgain Random Projection for metric embedding MDS Dimensional Scaling with EM-like SMACOF and deterministicannealing Support use of Newton’s Method (Marquardt’s method) as EM alternative Later HMM and SVM SALSA