Transcript 自動化大量資料切割與整合
子計畫:自動化大量資料切割與整合
國立高雄大學 電機工程學系 賴智錦
第二季成果
What had been expected to do
• • • •
了解及熟悉現有分散式語言及工具對於分散式運算元與 資料
reduction/assembly
的處理方式。 資料切割與重組有關之演算法設計。 以
C
或類似語言進行單機系統模擬。 修正資料切割與重組有關之演算法
。
What were achieved
• •
瞭解分散式工具
Hadoop
中
MapReduce
的執行原理。 以
Java
語言進行單機系統模擬。
Any difficulties
•
需有合適的測試範例 ,以進行演算法設計。
第二季成果
Future tasks
• • •
對於其他分散式資料切割
/
重組的瞭解需加強。 對於
MapReduce
的原理與應用,更深入瞭解。 嘗試開發符合
MapReduce
原理的雲端運算程式。
Comments
•
測試範例的特性,是否足以顯示雲端運算的功效。
•
單機系統模擬時,所遭遇困難之解決。
第二季成果
Fig. 1. MapReduce data flow with a single reduce task Fig. 2. MapReduce data flow with multiple reduce tasks
第二季成果 操作環境 機器名稱 網路環境 虛擬機器
: hadoop_centos.vmx(
趨勢科技
) : hadoop : DHCP : VMplayer
第二季成果
Average Rating MapReduce Example
• Data set (NetFlix Prize, 17,770 files, 480,189 users) • • To calculate the average movie rating per user Execute a MapReduce task over the dataset on a single node • Source: http://archive.ics.uci.edu/ml/datasets/Netflix+Prize
第二季成果
Average Rating MapReduce Example
HadoopDriver.java
• 主程式。 UserRatingMapper.java
• Mapper : 使用者評分。 AverageValueReducer.java
• Reducer: 針對每個使用者評分進行平均計算。 IntArrayWritable.java • 全域物件宣告。
第二季成果
HadoopDriver.java
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; public class HadoopDriver { public static void main(String[] args) { /* Require args to contain the paths */ if(args.length != 1 && args.length != 2) { System.err.println("Error! Usage: \n" +"HadoopDriver
第二季成果
HadoopDriver.java
conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(FloatWritable.class); /* Pull input and output Paths from the args */ conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); /* Set to use Mapper and Reducer classes */ conf.setMapperClass(UserRatingMapper.class); conf.setCombinerClass(UserRatingMapper.class); conf.setReducerClass(AverageValueReducer.class); conf.set("mapred.child.java.opts", "-Xmx2048m"); } else { conf = new JobConf(args[0]); } client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } }
第二季成果
UserRatingMapper.java
import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; /** This class reads a list of IntWritables, and emits the average value under * the same key.
* * Not much to it.
* * @author Daniel Jackson, Scott Griffin */ public class AverageValueReducer extends MapReduceBase implements Reducer
第二季成果
UserRatingMapper.java
while(values.hasNext()) { atingInput = (IntArrayWritable)values.next();; inputArray = (Writable[])ratingInput.get();; sum += ((IntWritable)inputArray[0]).get(); count += ((IntWritable) inputArray[1]).get(); } output.collect(key, new FloatWritable(((float)sum)/count)); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */
第二季成果
AverageValueReducer.java
import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; /* This class reads a list of IntWritables, and emits the average value under * the same key.
* * Not much to it.
* * @author Daniel Jackson, Scott Griffin */ public class AverageValueReducer extends MapReduceBase implements Reducer
第二季成果
AverageValueReducer.java
int sum = 0, count = 0; IntArrayWritable ratingInput = null; Writable[] inputArray = null; while(values.hasNext()) { ratingInput = (IntArrayWritable)values.next();; inputArray = (Writable[])ratingInput.get();; sum += ((IntWritable)inputArray[0]).get(); count += ((IntWritable) inputArray[1]).get(); } } output.collect(key, new FloatWritable(((float)sum)/count)); } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */
第二季成果
import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Writable; /* * Must subclass ArrayWritable if it is to be the input to a Reduce function * because the valueClass is not written to the output. Wish there was * some documentation which said that...
* @author Daniel */ public class IntArrayWritable extends ArrayWritable { public IntArrayWritable(Writable[] values) { super(IntWritable.class, values); } public IntArrayWritable() { super(IntWritable.class); } public IntArrayWritable(Class valueClass, Writable[] values) { super(IntWritable.class, values); } public IntArrayWritable(Class valueClass) { super(IntWritable.class); } public IntArrayWritable(String[] strings) { super(strings); } } /* Copyright @ 2008 California Polytechnic State University * Licensed under the Creative Commons * Attribution 3.0 License -http://creativecommons.org/licenses/by/3.0/ */ IntArrayWritable.java
第二季成果 NetFlix dataset (Input
Mapper)
Input :
UserID, Rating value, Date 7 : 951709, 2, 2001-11-04 585247, 1, 2003-12-19 2625420, 2, 2004-06-03 2322468, 3, 2003-11-12 2056324, 2, 2002-11-10 1969230, 4, 2003-06-01
Emit :
UserID, Rating value, Rating sum 951709, 2, 1 585247, 2625420, 2322468, 2056324, 1969230, 1, 1 2, 1 3, 1 2, 1 4, 1
第二季成果 NetFlix dataset (Reducer
Output)
Input :
UserID , 951709, 585247, 2625420, 2322468, 2056324, 1969230, Rating value, Rating sum , … 2, 1 1, 1 , 2, 1 3, 1 2, 1 4, 1 , , , , , 4, 1 , 2, 1 , 4, 1 4, 1 , 1, 1 2, 1 , 3, 1 3, 1 , 2, 1 , 3, 1 , 1, 1 1, 1 4, 1 , 3, 1
Emit :
UserID, Rating value sum / sum list 951709, 2.5 585247, 2 2625420, 2.4 2322468, 3.5 2056324, 1.5 1969230, 3
第二季成果
NetFlix dataset: 500 Total Time= 18 :17 Total size=54.9MB
第二季成果
NetFlix dataset: 1000 Total Time= 34:14 Total size=98.3MB
第二季成果 Result
NetFlix dataset: 500 Output File
第二季成果 Result
NetFlix dataset: 1000 Output File