Transcript 幻灯片 1
Introduction to Cloud Computing http://net.pku.edu.cn/~course/cs402/2009/ 彭波 [email protected] 北京大学信息科学技术学院 6/30/2009 大纲 云计算(Cloud Computing)是? 大规模数据处理是? 我们这门课的目标和内容是? 2 云计算(Cloud Computing) What is Cloud Computing? 1. 2. 3. First write down your own opinion about “cloud computing” , whatever you thought about in your mind. Question: What ? Who? Why? How? Pros and cons? The most important question is: What is the relation with me? 4 Cloud Computing is… No software access everywhere by Internet power -- Large-scale data processing Appeal for startups Cost efficiency 实在是太方便了 Software as platform SaaS PaaS Utility Computing Cons Security Data lock-in 5 Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand. 6 Platform as a Service (PaaS) 对于开发Web Application和Services,PaaS提供了一 整套基于Internet的,从开发,测试,部署,运营到维护 的全方位的集成环境。特别它从一开始就具备了Multitenant architecture,用户不需要考虑多用户并发的问 题,而由platform来解决,包括并发管理,扩展性,失效 恢复,安全。 7 Utility Computing “pay-as-you-go” 好比让用户把电源插头插在墙上,你得 到的电压和Microsoft得到的一样,只是你用得少,pay less;utility computing的目标就是让计算资源也具有这 样的服务能力,用户可以使用500强公司所拥有的计算资 源,只是use less pay less。这是cloud computing的一 个重要方面 8 Cloud Computing is… 9 Key Characteristics illusion of infinite computing resources available on demand; elimination of an up-front commitment by Cloud users; 创业启动花费 ability to pay for use of computing resources on a very large datacenters short-term basis as needed。 小时间片的billing,报告指 large-scale software infrastructure 出utility computing在这一 点上的实践是失败的 operational expertise 10 Why now? very large-scale datacenter的实践, 因为新的技术趋势和Business模式 pay-as-you-go computing 11 Key Players Amazon Web Services Google App Engine Microsoft Windows Azure 12 Key Applications Mobile Interactive applications, Tim O’Reilly相信未来是 属于能够实时对用户提供信息的服务。Mobile必定是关键。 而后台在datacenter中运行是很自然的模式,特别是那些 mashup融合类型的服务。 Parallel batch processing。大规模数据处理使用Cloud Computing技术很自然,MapReduce,Hadoop在这里起 到重要作用。这里,数据移入/移出cloud是很大的开销, Amazon开始尝试host large public datasets for free。 The rise of analytics。数据库应用中transaction based应 用还在增长,而analytics的应用增长迅速。数据挖掘,用 户行为分析等应用的巨大推动。 Extension of compute-intensive desktop application。计 算密集型的任务,说matlab, mathematica都有了cloud computing的扩展,woo~ 13 Cloud Computing = Silver Bullet? Google文档在3月7日发生 了大批用户文件外泄事件。 美国隐私保护组织就此提 请政府对Google采取措施, 使其加强云计算产品的安 全性。 14 Problem of Data Lock-in Challenges 15 Some other Voices The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads. Larry Ellison, quoted in the Wall Street Journal, September 26, 2008 It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true. Richard Stallman, quoted in The Guardian, September 29, 2008 16 What’s matter with ME?! What you want to do with 1000pcs, or even 100,000 pcs? 17 Cloud is coming… Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law “Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools 18 大规模数据处理 20 Happening everywhere! microarray chips Molecular biology (cancer) fiber optics microprocessors Network traffic (spam) 300M/day Simulations (Millennium) particle colliders Particle events (LHC) 1B 21 1M/sec 22 Maximilien Brice, © CERN 23 Maximilien Brice, © CERN 24 Maximilien Brice, © CERN 25 Maximilien Brice, © CERN How much data? Internet archive has 2 PB of data + 20 TB/month Google processes 20 PB a day (2008) “all words ever spoken by human beings” ~ 5 EB CERN’s LHC will generate 10-15 PB a year Sanger anticipates 6 PB of data in 2009 640K ought to be enough for anybody. 26 NERSC User George Smoot wins 2006 Nobel Prize in Physics Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years 27 The Current CMB Map source J. Borrill, LBNL • Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization. • Extracting these Kelvin fluctuations from inherently noisy data is a serious computational challenge. 28 Evolution Of CMB Data Sets: Cost > O(Np^3Limiting ) Experiment Nt Np Nb Data COBE (1989) 2x109 6x103 3x101 Time Satellite, BOOMERanG (1998) 3x108 5x105 3x101 Pixel Balloon, 1st HPC/NERSC (4yr) WMAP (2001) 7x1010 4x107 1x103 ? Satellite, Analysis-bound Planck (2007) 5x1011 6x108 6x103 Time/ Pixel Satellite, Major HPC/DA effort POLARBEAR (2007) 8x1012 6x106 1x103 Time Ground, NGmultiplexing CMBPol (~2020) 1014 109 104 Time/ Pixel Satellite, Early planning/design data compression 29 Notes Workstation Example: Wikipedia Anthropology Kittur, Suh, Pendleton (UCLA, PARC), “He Says, She Says: Conflict and Coordination in Wikipedia” CHI, 2007 Increasing fraction of edits are for work indirectly related to articles Experiment Download entire revision history of Wikipedia 4.7 M pages, 58 M revisions, 800 GB Analyze editing patterns & trends Computation 30 Hadoop on 20-machine cluster Example: Scene Completion Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007 Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation Extension 31 Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Flickr.com has over 500 million images … Example: Web Page Analysis Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience, 2004 Experiment Use web crawler to gather 151M HTML pages weekly 11 times Generated 1.2 TB log information Analyze page statistics and change frequencies Systems Challenge “Moreover, we experienced a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.” 32 GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT ? CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Subject genome Sequencer 33 DNA Sequencing Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp) ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC Bacteria: ~5 million bp Humans: ~3 billion bp Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al, 2009) Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads ~144 GB of compressed sequence data GGAGTTAGTAAAAGTCCACATTGAG 34 Subject reads CTATGCGGGC AT CTATGCGG TCTAGATGCT GCTTAT CTAT AT CTATGCGG AT CTATGCGG AT CTATGCGG TTA T CTATGC CTATGCGGGC GCTTAT CTAT CTAGATGCTT Alignment CGGTCTAGATGCTTAGCTATGCGGGCCCCTT Reference sequence 35 Subject reads ATGCGGGCCC CTAGATGCTT CTATGCGGGC TCTAGATGCT ATCTATGCGG CGGTCTAG ATCTATGCGG CTT CGGTCT TTATCTATGC CCTT CGGTC GCTTATCTAT GCCCCTT GCTTATCTAT CGG GGCCCCTT CGGTCTAGATGCTTATCTATGCGGGCCCCTT Reference sequence 36 Example: Bioinformatics Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press. Evaluate running time on local 24 core cluster Running time increases linearly with the number of reads 37 Example: Data Mining Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang: Pfp: parallel fp-growth for query recommendation. RecSys 2008: 107114 38 del.icio.us crawl->a bipartite graph covering 802739 Webpages and 1021107 tags. 大规模数据处理+云计算 An Example 数据处理任务 词频统计:统计一个文档集中每个词出现的次数 Try on these collection: 2006年初,我们在国内搜集了870 Million 不同网页,共 约2 TB. 商业搜索引擎Google, Yahoo等,收集网页数量在100+ Billion pages 怎样处理海量数据? 40 Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result” 41 What’s Mapreduce Parallel/Distributed Computing Programming Model Input split shuffle 42 output Typical problem solved by MapReduce 读入数据: key/value 对的记录格式数据 Map: 从每个记录里extract something Shuffle: 混排交换数据 把相同key的中间结果汇集到相同节点上 Reduce: aggregate, summarize, filter, etc. map (in_key, in_value) -> list(out_key, intermediate_value) 处理input key/value pair 输出中间结果key/value pairs reduce (out_key, list(intermediate_value)) -> list(out_value) 归并某一个key的所有values,进行计算 输出合并的计算结果 (usually just one) 输出结果 43 Word Frequencies in Web pages 输入:one document per record 用户实现map function,输入为 key = document URL value = document contents map输出 (potentially many) key/value pairs. 对document中每一个出现的词,输出一个记录<word, “1”> 44 Example continued: MapReduce运行系统(库)把所有相同key的记录收集到一 起 (shuffle/sort) 用户实现reduce function对一个key对应的values计算 求和sum Reduce输出<key, sum> 45 MapReduce Runtime System 46 History of Hadoop 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo! 47 From Theory to Practice 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster 48 课程目标和内容 课程目标 掌握MapReduce编程模型与运行环境的使用。 掌握算法在MapReduce模型下并行化的基本方法。 了解MapReduce运行分布式环境的实现技术。 了解云计算中大规模数据处理和算法并行化技术的 发展现状和关键问题。 了解并培养并行化思考问题的习惯。 50 课程内容 LEC# TOPICS 1 课程介绍-云计算 MapReduce环境 2 MapReduce原理 Inverted Index问题 3 4 并行与分布式系统基础 ABSTRACT 围绕大规模数据处理为背景介绍云计算技术 以MapReduce为平台展开讲授和实践,是课程的 中心。 从函数式语言谈MapReduce的基本原理 分析Inverted Index问题及其MapReduce实现 PageRank问题 介绍大规模并行分布式系统的设计 分析PageRank问题及其MapReduce实现 MapReduce系统设计与 实现 分析MapReduce的系统设计和考虑 分析Clustering问题及其MapReduce实现 Clustering问题 频繁集挖掘问题 介绍MapReduce之上的应用和发展 分析频繁集挖掘问题及其MapReduce实现 6 项目讨论 课程项目讨论 7 特邀报告 邀请学术界或业界研究技术人员报告 8 项目报告 学生课程项目报告 51 5 MapReduce高层应用 Grading Policy Lab Lab Lab Lab 1 2 3 4 - Introduction to Hadoop, Eclipse – A Simple Inverted Index - PageRank over Wikipedia Corpus – Clustering the Netflix Movie Data Hw1 - Read - Intro Distributed system; Intro MapReduce Programming. Hw2 - Read MapReduce[1] Hw3 – Read GFS[2] Hw4 – Read Pig Latin[3] 52 30% Assignments 20% Readings 50% Course project 课程的要求 熟练一种Programming Language Lots of java programming practices 53 Teachers and Resources 课程网站 陈日闪助教 http://net.pku.edu.cn/~cour se/cs402/2009/ http://groups.google.com/g roup/cs402pku Hadoop 主页 闫宏飞老师 讨论组 http://hadoop.apache.org/c ore/ Resources http://net.pku.edu.cn/~cour se/cs402/2008/resource.ht ml 54 Homework 登记 组成小组 3-4人,为课程project准备 跨专业方向很好 Lab1 http://net.pku.edu.cn/~course/cs402/2009/ Lab 1 - Introduction to Hadoop, Eclipse HW Reading1 Intro Distributed system; Intro Parallel Programming. http://code.google.com/edu/parallel/dsd-tutorial.html http://code.google.com/edu/parallel/mapreduce-tutorial.html 55 Summary CloudComputing brings How to make it real? 56 Possible of using unlimited resources on-demand, and by anytime and anywhere Possible of construct and deploy applications automatically scale to tens of thousands computers Possible of construct and run programs dealing with prodigious volume of data … Distributed File System Distributed Computing Framework ………………………………… Q&A 参考文献 [1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150. [2] G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles. Bolton Landing, NY, USA: ACM Press, 2003. [3] O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, "Pig latin: a not-so-foreign language for data processing," in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. Vancouver, Canada: ACM, 2008. 58 Google App Engine App Engine handles HTTP(S) requests, nothing else App configuration is dead simple Think RPC: request in, processing, response out Works well for the web and AJAX; also for other services No performance tuning needed Everything is built to scale “infinite” number of apps, requests/sec, storage capacity APIs are simple, stupid 59 App Engine Architecture req/resp stateless APIs urlfech mail R/O FS Python VM process stdlib app images stateful APIs datastore memcache 60 60 Microsoft Windows Azure 61 Amazon Web Services Amazon’s infrastructure (auto scaling, load balancing) Elastic Compute Cloud (EC2) – scalable virtual private server instances Simple Storage Service (S3) Simple Queue Service (SQS) – messaging SimpleDB - database Flexible Payments Service, Mechanical Turk, CloudFront, etc. 62 Amazon Web Services Very flexible, lower-level offering (closer to hardware) = more possibilities, higher performing Runs platform you provide (machine images) Supports all major web languages Industry-standard services (move off AWS easily) Require much more work, longer time-to-market Deployment scripts, configuring images, etc. Various libraries and GUI plug-ins make AWS do help 63 Price of Amazon EC2 64