Hardware Support for Code Integrity in Embedded Processors

Transcript Hardware Support for Code Integrity in Embedded Processors

CPE 619
Workloads:
Types, Selection, Characterization
Aleksandar Milenković
The LaCASA Laboratory
Electrical and Computer Engineering Department
The University of Alabama in Huntsville
http://www.ece.uah.edu/~milenka
http://www.ece.uah.edu/~lacasa
Part II: Measurement Techniques and Tools
Measurements are not to provide numbers but insight
- Ingrid Bucher

Measure computer system performance



Monitor the system that is being subjected to a
particular workload
How to select appropriate workload
In general performance analysis should know
1.
2.
3.
4.
5.
6.
What are the different types of workloads?
Which workloads are commonly used by other analysts?
How are the appropriate workload types selected?
How is the measured workload data summarized?
How is the system performance monitored?
How can the desired workload be placed on the system in a
controlled manner?
7. How are the results of the evaluation presented?
2
Types of Workloads
benchmark v. trans. To subject (a system) to a series of tests
In order to obtain prearranged results not available on
Competitive systems. – S. Kelly-Bootle, The Devil’s DP Dictionary


Test workload –
denotes any workload used in performance study
Real workload – one observed on a system while being used



Synthetic workload – similar characteristics to real workload





Cannot be repeated (easily)
May not even exist (proposed system)
Can be applied in a repeated manner
Relatively easy to port; Relatively easy to modify without affecting operation
No large real-world data files; No sensitive data
May have built-in measurement capabilities
Benchmark == Workload

Benchmarking is process of comparing
2+ systems with workloads
3
Test Workloads for Computer Systems





Addition instructions
Instruction mixes
Kernels
Synthetic programs
Application benchmarks
4
Addition Instructions

Early computers had CPU as
most expensive component





System performance == Processor Performance
CPUs supported few operations;
the most frequent one was addition
Computer with faster addition instruction
performed better
Run many addition operations as test workload
Problem


More operations, not only addition
Some more complicated than others
5
Instruction Mixes

Number and complexity of instructions increased


Could measure instructions individually,
but they are used in different amounts



=> Measure relative frequencies of various instructions
on real systems
Use as weighting factors to get average instruction time
Instruction mix – specification of various instructions coupled
with their usage frequency



Additions were no longer sufficient
Use average instruction time to compare different processors
Often use inverse of average instruction time
 MIPS – Million Instructions Per Second
 FLOPS – Millions of Floating-Point Operations Per Second
Gibson mix: Developed by Jack C. Gibson in 1959
for IBM 704 systems
6
Example: Gibson Instruction Mix
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Load and Store
Fixed-Point Add/Sub
Compares
Branches
Float Add/Sub
Float Multiply
Float Divide
Fixed-Point Multiply
Fixed-Point Divide
Shifting
Logical And/Or
Instructions not using regs
Indexing
Total
13.2
6.1
3.8
16.6
6.9
3.8
1.5
0.6
0.2
4.4
1.6
5.3
18.0
100
1959,
IBM 650
IBM 704
7
Problems with Instruction Mixes

In modern systems,
instruction time variable depending upon






Addressing modes, cache hit rates, pipelining
Interference with other devices during
processor-memory access
Distribution of zeros in multiplier
Times a conditional branch is taken
Mixes do not reflect special hardware such
as page table lookups
Only represents speed of processor

Bottleneck may be in other parts of system
8
Kernels

Pipelining, caching, address translation, … made
computer instruction times highly variable





Cannot use individual instructions in isolation
Instead, use higher level functions
Kernel = the most frequent function (kernel = nucleus)
Commonly used kernels: Sieve, Puzzle, Tree
Searching, Ackerman's Function, Matrix Inversion,
and Sorting
Disadvantages


Do not make use of I/O devices
Ad-hoc selection of kernels
(not based on real measurements)
9
Synthetic Programs

Proliferation in computer systems, OS emerged,
changes in applications


Use simple exerciser loops




Make a number of service calls or I/O requests
Compute average CPU time and elapsed time
for each service call
Easy to port, distribute (Fortran, Pascal)
First exerciser loop by Buchholz (1969)


No more processing-only apps, I/O became important
too
Called it synthetic program
May have built-in measurement capabilities
10
Example of
Synthetic
Workload
Generation
Program
Buchholz, 1969
11
Synthetic Programs

Advantages







Quickly developed and given to different vendors
No real data files
Easily modified and ported to different systems
Have built-in measurement capabilities
Measurement process is automated
Repeated easily on successive versions of the operating systems
Disadvantages





Too small
Do not make representative memory or disk references
Mechanisms for page faults and disk cache may not be
adequately exercised
CPU-I/O overlap may not be representative
Not suitable for multi-user environments because loops may
create synchronizations, which may result in better or worse
performance
12
Application Workloads

For special-purpose systems, may be able to run
representative applications as measure of
performance




Make use of entire system (I/O, etc)
Issues may be



E.g.: airline reservation
E.g.: banking
Input parameters
Multiuser
Only applicable when
specific applications are targeted

For a particular industry: Debit-Credit for Banks
13
Benchmarks

Benchmark = workload





Kernels, synthetic programs, application-level
workloads are all called benchmarks
Instruction mixes are not called benchrmarks
Some authors try to restrict the term benchmark only
to a set of programs taken from real workloads
Benchmarking is the process of performance
comparison of two or more systems by
measurements
Workloads used in measurements are called
benchmarks
14
Popular Benchmarks










Sieve
Ackerman’s Function
Whetstone
Linpack
Dhrystone
Lawrence Livermore Loops
SPEC
Debit-card Benchmark
TPC
EMBS
15
Sieve (1 of 2)



Sieve of Eratosthenes (finds primes)
Write down all numbers 1 to n
Strike out multiples of k for k = 2, 3, 5 … sqrt(n)

In steps of remaining numbers
16
Sieve (2 of 2)
17
Ackermann’s Function (1 of 2)


Assess efficiency of procedure calling mechanisms
Ackermann’s Function has two parameters,
and it is defined recursively




Average execution time per call, the number of
instructions executed, and the amount of stack space
required for each call are used to compare various
systems
Return value is 2n+3-3, can be used to verify
implementation
Number of calls:


Benchmark is to call Ackerman(3,n)
for values of n = 1 to 6
(512x4n-1 – 15x2n+3 + 9n + 37)/3
Can be used to compute time per call
Depth is 2n+3 – 4, stack space doubles when n++
18
Ackermann’s
Function (2 of 2)
(Simula)
19
Whetstone

Set of 11 modules designed to match observed
frequencies in ALGOL programs




Array addressing, arithmetic, subroutine calls,
parameter passing
Ported to Fortran, most popular in C, …
Many variations of Whetstone,
so take care when comparing results
Problems – specific kernel


Only valid for small,
scientific (floating) apps that fit in cache
Does not exercise I/O
20
LINPACK


Developed by Jack Dongarra (1983) at ANL
Programs that solve dense systems
of linear equations




Many float adds and multiplies
Core is Basic Linear Algebra Subprograms (BLAS),
called repeatedly
Usually, solve 100x100 system of equations
Represents mechanical engineering applications
on workstations


Drafting to finite element analysis
High computation speed and good graphics processing
21
Dhrystone






Pun on Whetstone
Intent to represent systems programming
environments
Most common was in C, but many versions
Low nesting depth and instructions in each call
Large amount of time copying strings
Mostly integer performance with no float operations
22
Lawrence Livermore Loops


24 vectorizable, scientific tests
Floating point operations


Physics and chemistry apps spend about 40-60% of
execution time performing floating point operations
Relevant for: fluid dynamics, airplane design, weather
modeling
23
SPEC

Systems Performance Evaluation Cooperative (SPEC)
(http://www.spec.org)



Aim: ensure that the marketplace has a fair and useful set of
metrics to differentiate candidate systems
Product: “fair, impartial and meaningful benchmarks for
computers“



Non-profit, founded in 1988, by leading HW and SW vendors
Initially, focus on CPUs: SPEC89, SPEC92, SPEC95, SPEC CPU
2000, SPEC CPU 2006
Now, many suites are available
Results are published on the SPEC web site
24
SPEC (cont’d)

Benchmarks aim to test "real-life" situations



E.g., SPECweb2005 tests web server performance by performing
various types of parallel HTTP requests
E.g., SPEC CPU tests CPU performance by measuring the run
time of several programs such as the compiler gcc and the chess
program crafty.
SPEC benchmarks are written in a platform neutral
programming language (usually C or Fortran), and the
interested parties may compile the code using whatever
compiler they prefer for their platform, but may not change the
code

Manufacturers have been known to optimize their compilers to
improve performance of the various SPEC benchmarks
25
SPEC Benchmark Suits (Current)

SPEC CPU2006: combined performance of CPU, memory and compiler














CINT2006 ("SPECint"): testing integer arithmetic, with programs such as compilers,
interpreters, word processors, chess programs etc.
CFP2006 ("SPECfp"): testing floating point performance, with physical simulations, 3D
graphics, image processing, computational chemistry etc.
SPECjms2007: Java Message Service performance
SPECweb2005: PHP and/or JSP performance.
SPECviewperf: performance of an OpenGL 3D graphics system, tested with various
rendering tasks from real applications
SPECapc: performance of several 3D-intensive popular applications on a given system
SPEC OMP V3.1: for evaluating performance of parallel systems using OpenMP
(http://www.openmp.org) applications.
SPEC MPI2007: for evaluating performance of parallel systems using MPI (Message
Passing Interface) applications.
SPECjvm98: performance of a java client system running a Java virtual machine
SPECjAppServer2004: a multi-tier benchmark for measuring the performance of Java 2
Enterprise Edition (J2EE) technology-based application servers.
SPECjbb2005: evaluates the performance of server side Java by emulating a three-tier
client/server system (with emphasis on the middle tier).
SPEC MAIL2001: performance of a mail server, testing SMTP and POP protocols
SPECpower_2008: evaluates the energy efficiency of server systems.
SPEC SFS97_R1: NFS file server throughput and response time
26
SPEC CPU Benchmarks
27
SPEC CPU2006 Speed Metrics

Run and reporting rules – guidelines required to build, run, and
report on the SPEC CPU2006 benchmarks


http://www.spec.org/cpu2006/Docs/runrules.html
Speed metrics




SPECint_base2006 (Required Base result);
SPECint2006 (Optional Peak result)
SPECfp_base2006 (Required Base result);
SPECfp2006 (Optional Peak result)
The elapsed time in seconds for each of the benchmarks is given
and the ratio to the reference machine (a Sun UltraSparc II
system at 296MHz) is calculated
The SPECint_base2006 and SPECfp_base2006 metrics are
calculated as a Geometric Mean of the individual ratios
 Each ratio is based on the median execution time
from three VALIDATED runs
28
SPEC CPU2006 Throughput Metrics



SPECint_rate_base2006 (Required Base result);
SPECint_rate2006 (Optional Peak result)
SPECfp_rate_base2006 (Required Base result);
SPECfp_rate2006 (Optional Peak result)
Select the number of concurrent copies of each benchmark to
be run (e.g. = #CPUs)



The same number of copies must be used
for all benchmarks in a base test
This is not true for the peak results where
the tester is free to select any combination of copies
The "rate" calculated for each benchmark is a function of:
(the number of copies run * reference factor for the
benchmark) / elapsed time in seconds
which yields a rate in jobs/time.

The rate metrics are calculated as a geometric mean from the
individual SPECrates using the median result from three runs
29
Debit-Credit (1/3)






Application-level benchmark
Was de-facto standard for Transaction Processing
Systems
Retail bank wanted 1000 branches, 10k tellers,
10,000k accounts online with peak load of 100 TPS
Performance in TPS where 95% of all transactions
with 1 second or less of response time (arrival of last
bit, sending of first bit)
Each TPS requires 10 branches, 100 tellers, and
100,000 accounts
System claiming 50 TPS performance should run:
500 branches; 5,000 tellers; 5,000,000 accounts
30
Debit-Credit (2/3)
31
Debit-Credit (3/3)






Metric: price/performance ratio
Performance: Throughput in terms of TPS such that 95% of all
transactions provide one second or less response time
Response time: Measured as the time interval between the
arrival of the last bit from the communications line and the
sending of the first bit to the communications line
Cost = Total expenses for a five-year period on purchase,
installation, and maintenance of the hardware and software in
the machine room
Cost does not include expenditures for terminals,
communications, application development, or operations
Pseudo-code Definition of Debit-Credit

See Figure 4.5 in the book
32
TPC

Transaction Processing Council (TPC)



Benchmark types






Mission: create realistic and fair benchmarks for TP
For more info: http://www.tpc.org
TPC-A (1985)
TPC-C (1992) – complex query environment
TPC-H – models ad-hoc decision support (unrelated queries, no
local history to optimize future queries)
TPC-W – transaction Web benchmark (simulates the activities of
a business-oriented transactional Web server)
TPC-App – application server and Web services benchmark
(simulates activities of a B2B transactional application server
operating 24/7)
Metric: transaction per second, also include response time
(throughput performance is measure only when response time
requirements are met).
33
EMBS

Embedded Microprocessor Benchmark Consortium
(EEMBC, pronounced “embassy”)






Non-profit consortium supported by member dues and license fees
Real world benchmark software helps designers select the right
embedded processors for their systems
Standard benchmarks and methodology ensure fair and reasonable
comparisons
EEMBC Technology Center manages development of new benchmark
software and certifies benchmark test results
For more info: http://www.eembc.com/
41 kernels used in different embedded applications







Automotive/Industrial
Consumer
Digital Entertainment
Java
Networking
Office Automation
Telecommunications
34
The Art of Workload Selection
The Art of Workload Selection

Workload is the most crucial part of
any performance evaluation


Inappropriate workload will result in
misleading conclusions
Major considerations in workload selection




Services exercised by the workload
Level of detail
Representativeness
Timeliness
36
Services Exercised


SUT = System Under Test
CUS = Component Under Study
37
Services Exercised (cont’d)




Do not confuse SUT w CUS
Metrics depend upon SUT: MIPS is ok for two CPUs
but not for two timesharing systems
Workload: depends upon the system
Examples:





CPU: instructions
System: Transactions
Transactions not good for CPU and vice versa
Two systems identical except for CPU
 Comparing Systems: Use transactions
 Comparing CPUs: Use instructions
Multiple services:
Exercise as complete a set of services as possible
38
Example: Timesharing Systems




Hierarchy of interfaces
Applications
Application benchmark
Operating System
Synthetic Program
Central Processing Unit
Instruction Mixes
Arithmetic Logical Unit
Addition instruction
39
Example: Networks

Application: user applications, such as mail, file transfer, http,…


Presentation: data compression, security, …


Workload: the source-destination matrix, the distance, and
characteristics of packets
Datalink: transmission of frames over a single link


Workload: frequency, sizes, and other characteristics of various
messages
Network: routes packets over a number of links


Workload: frequency and duration of various types of sessions
Transport: end-to-end aspects of communication between the
source and the destination nodes (segmentation and
reassembly of messages)


Workload: frequency of various types of security and
(de)compression requests
Session: dialog between the user processes on the two end
systems (init., maintain, discon.)


Workload: frequency of various types of applications
Workload: characteristics of frames, length, arrival rates, …
Physical: transmission of individual bits (or symbols) over the
physical medium

Workload: frequency of various symbols and bit patterns
40
Example: Magnetic Tape Backup System

Backup System





Services: Backup files, backup changed files, restore files, list
backed-up files
Factors: File-system size, batch or background process,
incremental or full backups
Metrics: Backup time, restore time
Workload: A computer system with files to be backed up. Vary
frequency of backups
Tape Data System




Services: Read/write to the tape, read tape label, auto load tapes
Factors: Type of tape drive
Metrics: Speed, reliability, time between failures
Workload: A synthetic program generating representative tape I/O
requests
41
Magnetic Tape System (cont’d)

Tape Drives





Services: Read record, write record, rewind, find record, move to
end of tape, move to beginning of tape
Factors: Cartridge or reel tapes, drive size
Metrics: Time for each type of service, for example, time to read
record and to write record, speed (requests/time), noise, power
dissipation
Workload: A synthetic program exerciser generating various types
of requests in a representative manner
Read/Write Subsystem




Services: Read data, write data (as digital signals)
Factors: Data-encoding technique, implementation technology
(CMOS, TTL, and so forth)
Metrics: Coding density, I/O bandwidth (bits per second)
Workload: Read/write data streams with varying patterns of bits
42
Magnetic Tape System (cont’d)

Read/Write Heads




Services: Read signal, write signal (electrical signals)
Factors: Composition, inter-head spacing, gap sizing,
number of heads in parallel
Metrics: Magnetic field strength, hysteresis
Workload: Read/write currents of various amplitudes,
tapes moving at various speeds
43
Level of Detail


Workload description varies
from least detailed to a time-stamped list of all requests
1) Most frequent request



Examples: Addition Instruction, Debit-Credit, Kernels
Valid if one service is much more frequent than others
2) Frequency of request types



List various services, their characteristics, and frequency
Examples: Instruction mixes
Context sensitivity
 A service depends on the services required in the past
 => Use set of services (group individual service requests)
 E.g., caching is a history-sensitive mechanism
44
Level of Detail (Cont)

3) Time-stamped sequence of requests (trace)




4) Average resource demand



Used for analytical modeling
Grouped similar services in classes
5) Distribution of resource demands



May be too detailed
Not convenient for analytical modeling
May require exact reproduction of component behavior
Used if variance is large
Used if the distribution impacts the performance
Workloads used in simulation and analytical modeling


Non executable: Used in analytical/simulation modeling
Executable: can be executed directly on a system
45
Representativeness



Workload should be representative
of the real application
How do we define representativeness?
The test workload and real workload should have
the same



Arrival Rate: the arrival rate of requests should be the
same or proportional to that of the real application
Resource Demands: the total demands on each of the
key resources should be the same or proportional to
that of the application
Resource Usage Profile: relates to the sequence and
the amounts in which different resources are used
46
Timeliness




Workloads should follow the changes in usage
patterns in a timely fashion
Difficult to achieve: users are a moving target
New systems  new workloads
Users tend to optimize the demand



Use those features that the system performs efficiently
E.g., fast multiplication  higher frequency of
multiplication instructions
Important to monitor user behavior
on an ongoing basis
47
Other Considerations in Workload Selection

Loading Level: A workload may exercise a system to its






Impact of External Components


Full capacity (best case)
Beyond its capacity (worst case)
At the load level observed in real workload (typical case)
For procurement purposes  Typical
For design  best to worst, all cases
Do not use a workload that makes external component a
bottleneck  All alternatives in the system give equally good
performance
Repeatability

Workload should be such that the results can be easily
reproduced without too much variance
48
Summary




Services exercised determine the workload
Level of detail of the workload should match that of
the model being used
Workload should be representative of the real
systems usage in recent past
Loading level, impact of external components, and
repeatability or other criteria in workload selection
49
Workload
Characterization
Workload Characterization Techniques
Speed, quality, price. Pick any two. – James M. Wallace



Want to have repeatable workload so can
compare systems under identical conditions
Hard to do in real-user environment
Instead



Study real-user environment
Observe key characteristics
Develop workload model
 Workload Characterization
51
Terminology


Assume system provides services
User (workload component, workload unit) – entity
that makes service requests at the SUT interface




Applications: mail, editing, programming ..
Sites: workload at different organizations
User Sessions: complete user sessions
from login to logout
Workload parameters – the measure quantities,
service requests, resource demands used to model
or characterize workload

Ex: instructions, packet sizes, source or destination
of packets, page reference pattern, …
52
Choosing Parameters



The workload component should be at the SUT interface.
Each component should represent as homogeneous a group
as possible. Combining very different users into a site
workload may not be meaningful.
Better to pick parameters that depend upon workload and
not upon system



Ex: response time of email not good
 Depends upon system
Ex: email size is good
 Depends upon workload
Several characteristics that are of interest


Arrival time, duration, quantity of resources demanded
 Ex: network packet size
Have significant impact (exclude if little impact)
 Ex: type of Ethernet card
53
Techniques for
Workload Characterization







Averaging
Specifying dispersion
Single-parameter histograms
Multi-parameter histograms
Principal-component analysis
Markov models
Clustering
54
Averaging

Mean

Standard deviation:

Coefficient Of Variation:

Mode (for categorical variables): Most frequent value

Median: 50-percentile
s=¹x
55
Case Study: Program Usage
in Educational Environments

High Coefficient of Variation
56
Characteristics of an
Average Editing Session

Reasonable variation
57
Techniques for Workload
Characterization







Averaging
Specifying dispersion
Single-parameter histograms
Multi-parameter histograms
Principal-component analysis
Markov models
Clustering
58
Single Parameter Histograms



n buckets £ m parameters £ k components values
Use only if the variance is high
Ignores correlation among parameters

E.g., short jobs have low CPU time and a small number of disk I/O
requests; With a single histogram parameters, we may generate a
workload with low CPU time and a large number of I/O requests –
something that is not possible in real systems
59
Multi-parameter Histograms

Difficult to plot joint histograms for more than two parameters
60
Techniques for Workload
Characterization







Averaging
Specifying dispersion
Single-parameter histograms
Multi-parameter histograms
Principal-component analysis
Markov models
Clustering
61
Principal-Component Analysis


Goal is to reduce number of factors
PCA transforms a number of (possibly) correlated
variables into a (smaller) number of uncorrelated
variables called principal components
62
Principal Component Analysis (cont’d)


Key Idea: Use a weighted sum of parameters to
classify the components
Let xij denote the ith parameter for jth component
yj = i=1n wi xij



Principal component analysis assigns weights wi's
such that yj's provide the maximum discrimination
among the components
The quantity yj is called the principal factor
The factors are ordered. First factor explains the
highest percentage of the variance
63
Principal Component Analysis (cont’d)

Given a set of n parameters {x1, x2, … xn},
the PCA produces a set of factors {y1, y2, … yn} such that

1) The y's are linear combinations of x's:
yi = j=1n aij xj

Here, aij is called the loading of variable xj on factor yi.
2) The y's form an orthogonal set, that is,
their inner product is zero:
<yi, yj> = k aikakj = 0

This is equivalent to stating that yi's
are uncorrelated to each other
3) The y's form an ordered set such that y1 explains the highest
percentage of the variance in resource demands
64
Finding Principal Factors



Find the correlation matrix
Find the eigen values of the matrix and sort them in
the order of decreasing magnitude
Find corresponding eigen vectors
These give the required loadings
65
Principal Component Analysis Example

xs – packets sent, xr – packet received
66
Principal Component Analysis

1) Compute the mean and standard deviations
of the variables:
67
Principal Component Analysis (cont’d)

Similarly:
68
Principal Component Analysis (cont’d)

2) Normalize the variables to zero mean and unit standard
deviation. The normalized values xs’ and xr’ are given by:
69
Principal Component Analysis (cont’d)

3) Compute the correlation among the variables:

4) Prepare the correlation matrix:
70
Principal Component Analysis (cont’d)

5) Compute the eigenvalues of the correlation matrix: By
solving the characteristic equation:

The eigenvalues are 1.916 and 0.084.
71
Principal Component Analysis (cont’d)

6) Compute the eigenvectors of the correlation matrix. The
eigenvector q1 corresponding to l1=1.916 =1.916 are defined
by the following relationship:
{C}{q}1 = l1 {q}1
or:
or:
q11=q21
72
Principal Component Analysis (cont’d)

Restricting the length of the eigenvectors to one:

7) Obtain principal factors by multiplying the eigen vectors by
the normalized vectors:
73
Principal Component Analysis (cont’d)


8) Compute the values of the principal factors
(last two columns)
9) Compute the sum and sum of squares of the
principal factors


The sum must be zero
The sum of squares give the percentage of variation
explained
74
Principal Component Analysis (cont’d)


The first factor explains 32.565/(32.565+1.435) or
95.7% of the variation
The second factor explains only 4.3% of the variation
and can, thus, be ignored
75
Techniques for Workload
Characterization







Averaging
Specifying dispersion
Single-parameter histograms
Multi-parameter histograms
Principal-component analysis
Markov models
Clustering
76
Markov Models


Sometimes, important not to just have number of
each type of request but also order of requests
If next request depends upon previous request, then
can use Markov model

Actually, more general. If next state depends upon
current state
77
Markov Models (cont’d)

Example: process between CPU, disk, terminal

Transition matrices can be used also for application transitions


E.g., P(Link|Compile)
Used to specify page-reference locality

P(Reference module i | Referenced module j)
78
Transition Probability

Given the same relative frequency of requests of different types, it is
possible to realize the frequency with several different transition
matrices



Each matrix may result in a different performance of the system
If order is important, measure the transition probabilities
directly on the real system
Example: Two packet sizes: Small (80%), Large (20%)
79
Transition Probability (cont’d)

Option #1: An average of four
small packets are followed
by an average of one big packet,
e.g., ssssbssssbssss.

Option #2: Eight small packets
followed by two big packets,
e.g., ssssssssbbssssssssbb

3) Generate a random number x;
If x < 0.8, generate a small
packet; otherwise generate a
large packet
80
Techniques for Workload
Characterization







Averaging
Specifying dispersion
Single-parameter histograms
Multi-parameter histograms
Principal-component analysis
Markov models
Clustering
81
Clustering
May have large number of components



Cluster such that components
within are similar to each other
Then, can study one member
to represent component class
Ex: 30 jobs with CPU + I/O. Five clusters.
Disk I/O

CPU Time
82
Clustering Steps
1. Take sample
2. Select parameters
3. Transform, if necessary
4. Remove outliers
5. Scale observations
6. Select distance metric
7. Perform clustering
8. Interpret
9. Change and repeat 3-7
10. Select representative components
83
1) Sampling

Usually too many components to do clustering
analysis


Select small subset


That’s why we are doing clustering
in the first place!
If careful, will show similar behavior to the rest
May choose randomly

However, if are interested in a specific aspect,
may choose to cluster only “top consumers”
 E.g., if interested in a disk, only do clustering
analysis on components with high I/O
84
2) Parameter Selection

Many components have a large number of
parameters (resource demands)



Two key criteria: impact on perf & variance



If have no impact, omit.
If have little variance, omit.
Method: redo clustering with 1 less parameter.


Some important, some not
Remove the ones that do not matter
Count the number of components that change
cluster membership. If not many change, remove
parameter
Principal component analysis: Identify
parameters with the highest variance
85
3) Transformation


If distribution is skewed, may want to transform the
measure of the parameter
Ex: one study measured CPU time
Two programs taking 1 and 2 seconds are as different
as two programs taking 10 and 20 milliseconds
 Take ratio of CPU time and not difference

(More in Chapter 15)
86
4) Outliers




Data points with extreme parameter values
Can significantly affect max or min
(or mean or variance)
For normalization (scaling, next) their
inclusion/exclusion may significantly affect outcome
Only exclude if do not consume
significant portion of resources

E.g., disk backup may make a number of disk I/O
requests, and should not be excluded if backup is done
frequently (e.g., several times a day); may be excluded
if done once in a month
87
5) Data Scaling

Final results depend upon relative ranges


Typically scale so relative ranges equal
Different ways of doing this
88
5) Data Scaling (cont’d)

Normalize to Zero Mean and Unit Variance:

Weights:
xik0 = wk xik
wk / relative importance or wk = 1/sk

Range Normalization

Change from [xmin,k,xmax,k] to [0,1] :
Affected by outliers
89
5) Data Scaling (cont’d)

Percentile Normalization

Scale so 95% of values between 0 and 1
Less sensitive to outliers
90
6) Distance Metric


Map each component to n-dimensional space and see which
are close to each other
Euclidean Distance between two components


{xi1, xi2, … xin} and {xj1, xj2, …, xjn}
Weighted Euclidean Distance


Assign weights ak for n parameters
Used if values not scaled or if significantly different in importance
91
6) Distance Metric (cont’d)
Chi-Square Distance



•
Used in distribution fitting
Need to use normalized or the relative sizes influence
chi-square distance measure
Overall, Euclidean Distance is most commonly used
92
7) Clustering Techniques


Goal: Partition into groups so the members of a
group are as similar as possible and different
groups are as dissimilar as possible
Statistically, the intragroup variance should be
as small as possible, and inter-group variance
should be as large as possible

Total Variance = Intra-group Variance + Intergroup Variance
93
7) Clustering Techniques (cont’d)


Nonhierarchical techniques: Start with an arbitrary set
of k clusters, Move members until the intra-group
variance is minimum.
Hierarchical Techniques:



Agglomerative: Start with n clusters and merge
Divisive: Start with one cluster and divide.
Two popular techniques:


Minimum spanning tree method (agglomerative)
Centroid method (Divisive)
94
Clustering Techniques:
Minimum Spanning Tree Method
1.
2.
3.
4.
5.
Start with k = n clusters.
Find the centroid of the ith cluster, i=1, 2, …, k.
Compute the inter-cluster distance matrix.
Merge the the nearest clusters.
Repeat steps 2 through 4 until all components are
part of one cluster.
95
Minimum Spanning Tree Example (1/5)

Workload with 5 components (programs),
2 parameters (CPU/IO)

Measure CPU and I/O for each 5 programs
96
Minimum Spanning Tree Example(2/5)


Step 1): Consider 5 clusters with ith cluster having
only ith program
Step 2): The centroids are
{2,4}, {3,5}, {1,6}, {4,3} and {5,2}
c
b
5
4
d
3
e
2
Disk I/O
a
1
1
2
3
4
CPU Time
5
97
Minimum Spanning Tree Example (3/5)
Step 3) Euclidean distance:
c
b
5
4
d
3
e
2
Disk I/O
a
1
1
Step 4) Minimum
 merge
2
3
4
5
CPU Time
98
Minimum Spanning Tree Example (4/5)

The centroid of AB is {(2+3)/2, (4+5)/2}
= {2.5, 4.5}. DE = {4.5, 2.5}
c
b
5
4
d
3
e
x
2
Disk I/O
ax
1
1
2
3
4
CPU Time
5
Minimum
 merge
99
Minimum Spanning Tree Example (5/5)

Centroid ABC {(2+3+1)/3, (4+5+6)/3} = {2,5}
Minimum
Merge
Stop
100
Representing Clustering

Spanning tree called a dendrogram

Each branch is cluster, height where merges
5
4
Can obtain clusters
for any allowable distance
Ex: at 3, get abc and de
3
2
1
a b
c
d e
101
Nearest Centroid Method



Start with k = 1.
Find the centroid and intra-cluster variance for ith cluster,
i= 1, 2, …, k.
Find the cluster with the highest variance and arbitrarily divide
it into two clusters




Find the two components that are farthest apart, assign other
components according to their distance from these points.
Place all components below the centroid in one cluster and all
components above this hyper plane in the other.
Adjust the points in the two new clusters until the inter-cluster
distance between the two clusters is maximum
Set k = k+1. Repeat steps 2 through 4 until k = n
102
Interpreting Clusters

Clusters will small populations may be discarded



Name clusters, often by resource demands


If use few resources
If cluster with 1 component uses 50% of
resources, cannot discard!
Ex: “CPU bound” or “I/O bound”
Select 1+ components from each cluster as a
test workload

Can make number selected proportional to cluster
size, total resource demands or other
103
Problems with Clustering
104
Problems with Clustering (Cont)


Goal: Minimize variance
The results of clustering are highly variable.
No rules for:




Labeling each cluster by functionality is difficult


Selection of parameters
Distance measure
Scaling
In one study, editing programs appeared
in 23 different clusters
Requires many repetitions of the analysis
105
Homework #2


Read chapters 4, 5, 6
Read documents in /doc directory






performance.measurements.txt
papi.README.ver2.s07.txt
Submit answers to exercises 6.1 and 6.2
Due: Wednesday, January 23, 2008, 12:45 PM
Submit by email to instructor with subject
“CPE619-HW2”
Name file as:
FirstName.SecondName.CPE619.HW2.doc
106

Hardware Support for Code Integrity in Embedded Processors

Transcript Hardware Support for Code Integrity in Embedded Processors

Directory