自動化大量資料切割與整合

Download Report

Transcript 自動化大量資料切割與整合

子計畫:自動化大量資料切割與整合

國立高雄大學 電機工程學系 賴智錦

第二季成果

What had been expected to do

• • • •

了解及熟悉現有分散式語言及工具對於分散式運算元與 資料

reduction/assembly

的處理方式。 資料切割與重組有關之演算法設計。 以

C

或類似語言進行單機系統模擬。 修正資料切割與重組有關之演算法

。 

What were achieved

• •

瞭解分散式工具

Hadoop

MapReduce

的執行原理。 以

Java

語言進行單機系統模擬。

Any difficulties

需有合適的測試範例 ,以進行演算法設計。

第二季成果

Future tasks

• • •

對於其他分散式資料切割

/

重組的瞭解需加強。 對於

MapReduce

的原理與應用,更深入瞭解。 嘗試開發符合

MapReduce

原理的雲端運算程式。

Comments

測試範例的特性,是否足以顯示雲端運算的功效。

單機系統模擬時,所遭遇困難之解決。

第二季成果

Fig. 1. MapReduce data flow with a single reduce task Fig. 2. MapReduce data flow with multiple reduce tasks

第二季成果 操作環境 機器名稱 網路環境 虛擬機器

: hadoop_centos.vmx(

趨勢科技

) : hadoop : DHCP : VMplayer

第二季成果

Average Rating MapReduce Example

• Data set (NetFlix Prize, 17,770 files, 480,189 users) • • To calculate the average movie rating per user Execute a MapReduce task over the dataset on a single node • Source: http://archive.ics.uci.edu/ml/datasets/Netflix+Prize

第二季成果

Average Rating MapReduce Example

 HadoopDriver.java

• 主程式。  UserRatingMapper.java

• Mapper : 使用者評分。  AverageValueReducer.java

• Reducer: 針對每個使用者評分進行平均計算。  IntArrayWritable.java • 全域物件宣告。

第二季成果

HadoopDriver.java

import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; public class HadoopDriver { public static void main(String[] args) { /* Require args to contain the paths */ if(args.length != 1 && args.length != 2) { System.err.println("Error! Usage: \n" +"HadoopDriver \n" +"HadoopDriver "); System.exit(1); } JobClient client = new JobClient(); JobConf conf = null; if(args.length == 2) { conf = new JobConf(HadoopDriver.class); /* UserRatingMapper outputs (IntWritable, IntArrayWritable(Writable[2])) */ conf.setMapOutputKeyClass(IntWritable.class); conf.setMapOutputValueClass(IntArrayWritable.class); /* AverageValueReducer outputs (IntWritable, FloatWritable) */

第二季成果

HadoopDriver.java

conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(FloatWritable.class); /* Pull input and output Paths from the args */ conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); /* Set to use Mapper and Reducer classes */ conf.setMapperClass(UserRatingMapper.class); conf.setCombinerClass(UserRatingMapper.class); conf.setReducerClass(AverageValueReducer.class); conf.set("mapred.child.java.opts", "-Xmx2048m"); } else { conf = new JobConf(args[0]); } client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } }

第二季成果

UserRatingMapper.java

import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; /** This class reads a list of IntWritables, and emits the average value under * the same key.

* * Not much to it.

* * @author Daniel Jackson, Scott Griffin */ public class AverageValueReducer extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0, count = 0; IntArrayWritable ratingInput = null; Writable[] inputArray = null;

第二季成果

UserRatingMapper.java

while(values.hasNext()) { atingInput = (IntArrayWritable)values.next();; inputArray = (Writable[])ratingInput.get();; sum += ((IntWritable)inputArray[0]).get(); count += ((IntWritable) inputArray[1]).get(); } output.collect(key, new FloatWritable(((float)sum)/count)); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */

第二季成果

AverageValueReducer.java

import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; /* This class reads a list of IntWritables, and emits the average value under * the same key.

* * Not much to it.

* * @author Daniel Jackson, Scott Griffin */ public class AverageValueReducer extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {

第二季成果

AverageValueReducer.java

int sum = 0, count = 0; IntArrayWritable ratingInput = null; Writable[] inputArray = null; while(values.hasNext()) { ratingInput = (IntArrayWritable)values.next();; inputArray = (Writable[])ratingInput.get();; sum += ((IntWritable)inputArray[0]).get(); count += ((IntWritable) inputArray[1]).get(); } } output.collect(key, new FloatWritable(((float)sum)/count)); } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */

第二季成果

import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; /* * Must subclass ArrayWritable if it is to be the input to a Reduce function * because the valueClass is not written to the output. Wish there was * some documentation which said that...

* @author Daniel */ public class IntArrayWritable extends ArrayWritable { public IntArrayWritable(Writable[] values) { super(IntWritable.class, values); } public IntArrayWritable() { super(IntWritable.class); } public IntArrayWritable(Class valueClass, Writable[] values) { super(IntWritable.class, values); } public IntArrayWritable(Class valueClass) { super(IntWritable.class); } public IntArrayWritable(String[] strings) { super(strings); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */ IntArrayWritable.java

第二季成果 NetFlix dataset (Input

Mapper)

Input :

 UserID, Rating value, Date  7 : 951709, 2, 2001-11-04 585247, 1, 2003-12-19 2625420, 2, 2004-06-03 2322468, 3, 2003-11-12 2056324, 2, 2002-11-10 1969230, 4, 2003-06-01       

Emit :

UserID,  Rating value, Rating sum   951709,  2, 1   585247,  2625420, 2322468, 2056324, 1969230, 1, 1       2, 1 3, 1 2, 1 4, 1        

第二季成果 NetFlix dataset (Reducer

Output)

      

Input :

UserID , 951709, 585247,    2625420, 2322468, 2056324, 1969230,     Rating value, Rating sum  , …  2, 1 1, 1   , 2, 1  3, 1  2, 1 4, 1    ,  , , , , 4, 1      ,  2, 1  ,  4, 1 4, 1   , 1, 1   2, 1   , 3, 1   3, 1  ,   2, 1  ,  3, 1   , 1, 1    1, 1 4, 1  ,    3, 1         

Emit :

UserID, Rating value sum / sum list  951709, 2.5  585247, 2  2625420, 2.4 2322468, 3.5 2056324, 1.5 1969230, 3    

第二季成果

NetFlix dataset: 500 Total Time= 18 :17 Total size=54.9MB

第二季成果

NetFlix dataset: 1000 Total Time= 34:14 Total size=98.3MB

第二季成果 Result

NetFlix dataset: 500 Output File

第二季成果 Result

NetFlix dataset: 1000 Output File