data

Transcript data

Energy Issues in Data
Analytics
Domenico Talia
Carmela Comito
Università della Calabria & CNR-ICAR
Italy
[email protected]
2
Motivations for Taking Care of Data
 Data is everywhere (Big, complex, real-time, unstructured)
 Putting data at the center of research work on energy issues
may bring some benefits. (Today the focus is on algorithms).
 Cost metrics of data management techniques
(communication, storing, access, query, analysis) will help
professionals and users to save energy in data-intensive
apps.
 Energy-scalable data management is important for
sustainable data science.
3
Data Availability or Data Deluge?
• Every life process today is data intensive.
• The information stored in digital data archives is enormous and
its size is still growing very rapidly.
4
Data Availability or Data Deluge?
• Some decades ago the main
problem was the shortage of
information, now the challenge is
• the very large volume of
information to deal with and
• the associated complexity to
process it and to extract
significant and useful parts or
summaries.
Complex
Big
5
Problems …
• Bigger and more complex
problems must be solved
by using large-scale distributed
computing systems.
• DATA SOURCES are
larger and larger and ubiquitous
(Web, sensor networks, mobile
devices, telescopes, …).
Big Data
…and
• Even where accessible, much
data in many fields cannot be
read by humans
so
• The huge amount of data
available today requires smart
data analysys techniques to aid
people to deal with it
and
• Scalable algorithms, techniques,
and systems are needed (time
and energy scalability).
6
7
Data: From Storing to Analysis
• Storing data is not the only main problem.
• A key issue is analyse, mine, and process data for making it
useful.
Source: The
Economist
Towards Models for Energyaware Data Management
 The main focus today is on energy-aware algorithms,
tasks, applications.
 The other side of the coin is data and costs of operating
on it.
 Abstract energy-cost models for exchanging, accessing
and transform data are primary elements for energyaware data management at large scale.
 They are useful for sustainable data science.
8
An Example:
Energy-aware Mining of Data
 We evaluated the energy cost of analyzing data by using
some well-known data mining techniques on mobile
devices.
 Our interest was mainly on how the same technique
consumes energy when dimension of data change.
 Tests with different
• Data set dimensions,
• Attribute number,
• Class number.
9
10
Data Mining Techniques
 Energy characterization of data mining techniques running on mobile
devices
 k-means
(data clustering)
 J48
(data classification)
 Apriori
(association rules)
 Common performance parameters
 Number of instances (data set size)
 Number of attributes
 Algorithm-specific performance parameters
 k-means: number of clusters
 J48: decision tree size
 Apriori: Number of rules, minimum support and minimum confidence
k-means (1)

Increasing the number of instances,with different produced
clusters
11
k-means (2)

Increasing the number of attributes with different produced
clusters
12
Apriori (1)

Increasing the number of instances with different number of
attributes
13
Apriori (2)

Increasing the data set size with different number of rules
14
Apriori (3)

15
Increasing the data set size with different minimum confidence
J48

16
Increasing the number of instances with different number of
Attr_55
Attr_38
Attr-16
Attr_8
attributes
Energy Consumption (Joules)
120
100
80
60
40
20
0
1620
3601
6341
10826
Number of Instances
Attr_55
Attr_38
Attr_16
Attr_8
100
90
80
70
60
50
40
30
20
10
0
1620
Attr_55
Attr_38
Attr_16
Attr_8
98
97
CPU %
Time (sec)
99
96
95
94
93
92
1620
3601
6341
Number of Instances
10826
3601
6341
Number of Instances
10826
Results on different devices
 Results obtained with different smart phones
 Sony Xperia P:
1 GHz Dual CoreARM processor and 1 GB RAM
 HTC Hero:
528 MHz Qualcomm processor and 288 MB RAM
17
Results on different devices
 Results obtained with different smart phones
 Sony Xperia P:
1 GHz Dual CoreARM processor and 1 GB RAM
 HTC Hero:
528 MHz Qualcomm processor and 288 MB RAM
18
Results on different devices
 Results obtained with different smart phones
 Sony Xperia P:
GB RAM
 HTC Hero:
1 GHz Dual Core ARM processor and 1
528 MHz Qualcomm processor and 288 MB RAM
 Samsung Galaxy ACE: 800 MHz Qualcomm processor and 512 MB RAM
19
20
Concluding Remarks
 Data-intensive applications demands for energy cost models
based on data characteristics.
 This should be done for sensors, smart phones, HPC servers, and
clouds. In general, for large scale computing systems.
 Sustainible data center services and applications may benefit
from these models.
 Preliminary experiments show useful data.
21
 Data Sets
 Census (http://archive.ics.uci.edu/ml/datasets/Census+Income)
 Used with K-means
 Data set size: 14 MB
 Number of instances: 244348
 Number of attributes: 11
 Census_disc (http://archive.ics.uci.edu/ml/datasets/Census+Income)
 Used with Apriori
 Data set size: 19 MB
 Number of instances: 333011
 Number of attributes: 11
 Covertype (http://archive.ics.uci.edu/ml/datasets/Covertype)
 Used with J48
 Data set size: 14.5 MB
 Number of instances: 114556
 Number of attributes: 55
22
Method
Algorithm
Data Set
Size
RAM
Memory
(MByte)
Virtual
Memory
(MByte)
CPU
(%)
Battery
Charge
Depletion
(mAh)
Energy
Consumption
(J)
Time
(sec)
Association Rules
CENSUS_DISC.arff
Rule
Induction
Apriori
0,1
0,2
0,4
0,8
1,6
3,2
MB
MB
MB
MB
MB
MB
15,86
16,97
18,06
19,87
23,32
26,92
95,19
105,36
104,95
102,75
103,99
100,01
96,92
98,03
98,24
98,13
96,87
95,44
0
0
0
2,7
13,5
23,3
0
0
0
35,964
179,82
310,356
6
12
26
73
300
3960
6,4 MB
---
---
---
---
---
---
19,47
20,15
23,87
27,68
-------
104,94
104,92
105,6
103,87
-------
13,4
29,8
59,4
194,64
-------
178,488
396,936
791,208
2592,6048
-------
300
540
2040
8160
-------
6,75
8,1
18,9
18,9
43,2
-----
89,91
107,892
251,748
251,748
575,424
-----
55
150
300
600
1320
-----
Classification
COVERTYPE.arff
Trees
J48
0,1
0,2
0,4
0,8
1,6
3,2
6,4
MB
MB
MB
MB
MB
MB
MB
96,23
98,21
97,43
97,36
-------
Clustering
CENSUS.arff
Instancebased/La
zy
Learning
K-Means
0,1
0,2
0,4
0,8
1,6
3,2
6,4
MB
MB
MB
MB
MB
MB
MB
16,73
17,95
19,72
23,08
26,4
-----
96,56
102,05
102,16
101,86
95,96
-----
98,03
97,65
97,02
97,97
97,82
-----

data

Transcript data

Directory