PowerPoint 簡報

Download Report

Transcript PowerPoint 簡報

Cloud Computing Era (Practice) Phoenix Liau Trend Micro

Three Major Trends to Chang the World

Cloud Computing Big Data Mobile

什麼是雲端運算？美國國家標準技術研究所 (NIST)的定義:

Essential Characteristics Service Models Deployment Models

以服務(as-a-service)的商業模式，透過Internet技術，提供具有擴充性(scalable)和彈性(elastic)的IT相關功能給使用者

It’s About the Ecosystem Structured, Semi-structured

Enterprise Data Warehouse Cloud Computing

SaaS PaaS IaaS

Generate Big Data Lead Business Insights create Competition, Innovation, Productivity

What is BigData? A set of files A database A single file

What is the problem • Getting the data to the processors becomes the bottleneck • Quick calculation – Typical disk data transfer rate: •

75MB/sec

– Time taken to transfer 100GB of data to the processor: • approx.

22 minutes!

The Era of Big Data – Are You Ready • Businesses are driving the growth of big data. The capable data storage, efficient management, and capturing values to business values of huge size of data are enterprise big challenges.

• Overwhelming quantities of big data will challenge enterprise storage infrastructure and data center architecture which will cause chain reactions in database storage, data mining, business intelligence, cloud computing, and computing application.

• • Data for business commercial analysis 2011: multi-terabyte (TB)

2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It?

Enterprise Database Hadoop

When to use?

•

Ad-hoc Reporting (<1sec)

•

Affordable Storage/Compute

•

Multi-step Transactions

•

Unstructured or Semi-structured

•

Lots of Inserts/Updates/Deletes

•

Resilient Auto Scalability

Hadoop!

– inspired by • Apache Hadoop project – inspired by Google's MapReduce and Google File System papers.

•

Open sourced , flexible

and

available

architecture for

large scale

computation and data processing on a network of

commodity hardware

• Open Source Software + Hardware Commodity – IT Costs Reduction

Hadoop Core

MapReduce HDFS

Hadoop Core Java

MapReduce HDFS

Java Java

Word Count Example Key: offset Value: line Key: word Value: count 0:The cat sat on the mat 22:The aardvark sat on the sofa Key: word Value: sum of count

The Hadoop Ecosystems

The Ecosystem is the System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • A collection of projects at Apache

Relation Map

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Zookeeper – Coordination Framework

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What is ZooKeeper • A centralized service for maintaining – – Configuration information Providing distributed synchronization • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data – – – Status information Configuration Location information

Flume / Sqoop – Data Integration Framework

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What’s the problem for data collection • Data collection is currently a priori and ad hoc • A priori – decide what you want to collect ahead of time • Ad hoc – each kind of data source goes through its own collection path

(and how can it help?) • A distributed data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats

Flume: High-Level Overview • Logical Node • Source • Sink

Flume Architecture

Log Flume Node

...

Log Flume Node HDFS

Sqoop • Easy, parallel database import/export • What you want do?

– – Insert data from RDBMS to HDFS Export data from HDFS back into RDBMS

Sqoop

HDFS Sqoop RDBMS

Sqoop Examples $ sqoop import --connect jdbc:mysql://localhost/world - username root --table City ...

$ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He rat,AFG,Herat,1868004,Mazar-e Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ...

Pig / Hive – Analytical Language

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Why Hive and Pig?

• Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by •

What is Hive?

– An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – – MapRuduce for execution HDFS for storage • Hive Query Language – – – Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query

SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive

SQL Hive MapReduce

Pig – Initiated by • A high-level scripting language (Pig Latin) • Process data one step at a time • • Simple to write MapReduce program • Easy understand Easy debug

A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

Pig

Script Pig MapReduce

Hive vs. Pig Language Schema Programmait Access

Hive

HiveQL (SQL-like) Table definitions that are stored in a metastore JDBC, ODBC

Pig

Pig Latin, a scripting language A schema is optionally defined at runtime PigServer

WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello , 1> < World , 1> < Bye, 1> < World , 1> < Hello , 1> < Hadoop , 1> < Goodbye, 1> < Hadoop , 1> • the reduce just sums up the values < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); } } job.waitForCompletion(true);

WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

4 1 The Story So Far SQL Java Java SQL

Hive Pig MapReduce HDFS Sqoop Flume RDBMS FS

Script Posix

Hbase – Column NoSQL DB

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Structured-data vs Raw-data

I – Inspired by • Coordinated by Zookeeper • Low Latency • Random Reads And Writes • Distributed Key/Value Store • Simple API – – – – PUT GET DELETE SCANE

Hbase – Data Model • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key]

Hbase – workflow

HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' ©2011 Cloudera, Inc. All Rights Reserved.

Oozie – Job Workflow & Scheduling

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What is ? • A Java Web Application • Oozie is a workﬂow scheduler for Hadoop • Crond for Hadoop • Triggered – – Time Data

Job 1 Job 2 Job 3 Job 4 Job 5

Mahout – Data Mining

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What is • Machine-learning tool • Distributed and scalable machine learning algorithms on the Hadoop platform • Building intelligent applications easier and faster

Mahout Use Cases • Yahoo: Spam Detection • Foursquare: Recommendations • SpeedDate.com: Recommendations • Adobe: User Targetting • Amazon: Personalization Platform ©2011 Cloudera, Inc. All Rights Reserved.

Use case Example • Predict what the user likes based on – – His/Her historical behavior Aggregate behavior of people similar to him

Conclusion

Today, we introduced: • Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem

Recap – Hadoop Ecosystem

Hue

(Web Console)

Oozie

(Job Workflow & Scheduling)

Mahout

(Data Mining)

Sqoop/Flume

(Data integration)

Pig/Hive

(Analytical Language)

MapReduce Runtime

(Dist. Programming Framework)

Hbase

(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

趨勢科技雲端防毒 Case Study

Collaboration in the underground

網路威脅呈現爆炸性的成長各式各樣的變種病毒、垃圾郵件、不明的下載來源等等，這些來自網路上的威脅，躲過傳統安全防護系統的偵測，一直持續呈現爆炸性的成長，形成嚴重的資安威脅 New Unique Malware Discovered

1M unique Malwares every month

New Design Concept for Threat Intelligence CDN / xSP Human Intelligence Honeypot Web Crawler Trend Micro Mail Protection Trend Micro Web Protection Trend Micro Endpoint Protection

150M+ Worldwide Endpoints/Sensors

Challenges We Are Faced The Concept is Great but ….

6TB of data and 15B lines of logs received daily by It becomes the

Big Data

Challenge!

Issues to Address Raw Data Information   

Volume: Infinite Time: No Delay Target: Keep Changing Threats

Threat Intelligence/Solution

SPN Feedback

HTTP POST

L4 Log Receiver SPAM Log Receiver L4 Log Post Processing Log Post Processing CDN Log Log Post Processing SPN High Level Architecture Web Pages

HTTP Download

Adhoc-Query (Pig) MapReduce HBase Hadoop Distributed File System (HDFS)

Feedback Information

Circus (Ambari) Email Reputation Service Lumber Jack Tracking Logging System (TLS) Malware Classificati on Correlation Platform Global Object Cache (GOC) Message Bus Web Reputation Service File Reputation Service

Trend Micro Big Data process capacity 雲端防毒每日需要處理的資料量 • 85 億個 Web Reputation 查詢 • 30 億個 Email Reputation查詢 • 70 億個 File Reputation 查詢 • 處理 6 TB 從全世界收集到的 raw logs • 來自1.5

億台終端裝置的連線

Trend Micro: Web Reputation Services Technology Process Operation Trend Micro Products / Technology CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling User Traffic | Honeypot Akamai Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis

8 billions/day

40% filtered

4.8 billions/day

82% filtered

860 millions/day

99.98% filtered Machine Learning Data Mining

25,000 malicious URL /day

Block malicious URL within 15 minutes once it goes online!

Big Data Cases

Line Data on HBase • Line data – – MODEL: -> INDEX: -> <[property in model> • User: -> , <-> • Consistency in HBase • Contact model: use column qualifier to store • Support range query (e.g. message box)

Pig at Linkedin

Linkedin - Pig Example • views = LOAD '/data/awesome' USING VoldemortStorage(); • views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days

=90;days.ago=1’)

Facebook Messages

Facebook Open Source Stack • Memcached --> App Server Cache • ▪ZooKeeper --> Small Data Coordination Service • ▪HBase --> Database Storage Engine • ▪HDFS --> Distributed FileSystem • ▪Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

PowerPoint 簡報

Transcript PowerPoint 簡報

Who Needs It?

MapReduce HDFS

MapReduce HDFS

Log Flume Node

Log Flume Node HDFS

HDFS Sqoop RDBMS

SQL Hive MapReduce

Script Pig MapReduce

Hive Pig MapReduce HDFS Sqoop Flume RDBMS FS

Job 1 Job 2 Job 3 Job 4 Job 5

趨勢科技雲端防毒 Case Study

Thank you!

Directory