Transcript Slide 1

Causing Incoherencies
Parallel sorting algorithms and study
cache behaviors L1 and L2 on MultiCore Architecture
PRESENTED
MOHAMMED ALHARBI
BY
MOHAMMED ALEISA
Instructor Prof. Gita Alaghband
BAKRI AWAJI
Outlines
 Motivation
 Our implementation in details
 Contributions
 Experiments
 Evaluation
 Related work
 Conclusion
 Challenges
 What did we learn ?
Motivation
 Our motivation in this project is to cause incoherence by





simulate three sorting algorithms(bubble sort, Quick
sort, insertion sort) :
but, the big question is :
Why would we want to cause incoherencies ?
Coherence is needed to meet an architectural assumption
held by the software designer. The bad program design
identified by this project demonstrates what happens
when the coherence assumption is ignored.
As we know, when we use multi-core on processor
effectively , that might cause coherence problems.
We need to learn how the Architecture reacts with the
shared data when using multi-core processor.
Motivation(Important Questions)
How would we use the sorting algorithms in such details to
demonstrate our project?
The sorting algorithms simply provide the necessary traces and
overlapping reads and writes to cause coherence issues.
Why are we choosing sorting Algorithms instead of
applications?
Many applications use sorting algorithms of various kinds. A
sorting algorithm can be an entire application (utility). Sorting
algorithms provide a clear area of coherence issues when
executed in parallel on the same data.
Motivation(Important Questions)
When presenting the sorting algorithms as sequential
algorithms ? why are they related to cache and multicore
architecture?
Sorting algorithms are frequently executed on multicore architectures
and make heavy use of caches. The algorithms, again, are simply to
provide functional traces that result in coherence issues.
Why are we choosing these sorting algorithms ?
They are commonly known and applied and provide opportunities to
examine incoherence.
How will we count the miss and hit ?
Misses were counted as compulsory (never had the data to begin with),
conflict (fighting same slot in direct-mapped architecture), and, in a
way, coherence (in the form of updates that invalidate blocks). A hit can
either be a read hit or a write hit.
Our Implementation in details
 Simulating three sorting algorithms to study:
 Causing Incoherence in L1 cache by applying
coherences (Invalidate policy) with write
through policy or write back policy
 For measuring; read hit , write hit , coherence
miss, conflict miss, compulsory miss
Sorting Algorithms (bubble sort, Quick sort, insertion
sort)

Input long data array for example:
7
3
2
1
5
4
6
Our Implementation in details
 Simulating two different sorting algorithms
 For example (bobble sort vs. Quick sort) in parallel
on two cores with same array using (write through
policy) with (invalidation policy).
 L2 is sharing.
 Showing in figures the fighting on the same data
between both.
Our Implementation in details
For example: causing incoherence (write through policy) in case of invalidation
Running two different algorithms
In the same time with same array
Quick
sort
Bubble
sort
search
swap
73
Core
1
Core
2
L1
L1
L2
Bus snooper
RAM
7
3
Updating
with Write
through
policy
Our Implementation in details
For example: causing incoherence (write through policy) ) in case of invalidation
Quick
sort
Bubble
sort
37
Core
1
Core
2
L1
L1
L2
Bus snooper
RAM
73
Updating
with Write
through
policy
Our Implementation in details
For example: causing incoherence (write through policy) ) in case of invalidation
Quick
sort
Bubble
sort
37
Core
1
Core
2
L1
L1
L2
Bus snooper
RAM
3 7
Updating
with Write
through
policy
Our Implementation in details
For example: causing incoherence (write through policy) in case invalidation
Running two different algorithms
In the same time
With the same data
Quick
sort
Bubble
sort
Core
1
Core
2
L1
L1
L2
Bus snooper
RAM
Updating
with Write
through
policy
Analysis(Scenario of Invalidation with write through )
 For example: we apply bubble sort algorithm on core1 and Quick sort






algorithm on core 2.
The array will be placed first in Main memory
Then it sends all array that is inside two black from main memory to L2
cache and then each L1 cache has the same block.
For example : In first (data access time), Quick sort on core 2 is searching
while bubble sort algorithm on core1 is swapping that means it wants to
write, so core1 updates the value of all array in L2 cache and then main
memory by using write through policy.
After that core1 sends request as broadcast on the bus snoopy to
invalidation the same data on another core.
Hence the core 2 read miss the data , so it needs to update its data from L2.
That means cache coherence problem happens since each data access
occurs. Two algorithms have fighting data on the same array that causes
(duplicate data, losing data or wrong sort and flashing copies).
Contribution
 Bubble sort algorithm
 Quick sort Algorithm
 Insertion sort Algorithm
 Trace
Contribution
 Bubble Sort :
compares the numbers in pairs from left to right
exchanging when necessary. the first number is compared
to the second and as it is larger they are exchanged.
Contribution
 Bubble Sort
Contribution
 Quick Sort :
Given an array of n elements (e.g., integers):
 If array only contains one element, return
 Else
 pick one element to use as pivot.
 Partition elements into two sub-arrays:
 Elements less than or equal to pivot
 Elements greater than pivot
 Quick sort two sub-arrays
 Return results
Contribution
Quick Sort :
Contribution
 Insertion Sort :
Our Experiments
 Simulating Parallel different sorting algorithms on two
cores with the same data array to study the behavior of
cache.
 Case 0: Bubble sort vs. Insertion sort
 Case 1: Bubble sort vs. Quick sort
 Case 2: Quick sort vs. insertion sort
Our Millstones
Cases
Implementation
Caches Polices
Polices of
Coherences
1
Bubble sort vs. Quick sort
Write through
Invalidation
Done
2
Insertion sort vs. Bubble
sort
Write through
Invalidation
Done
3
Quick vs. Insertion sort
Write through
Invalidation
Done
4
Insertion sort vs. Bubble
sort
Write back
Invalidation
Still
5
Bubble sort vs. Quick sort
Write back
Invalidation
Still
Our Experiments
 In our experiment we studied the higher and lower
levels cache behaviors
 We measured the Hits and Misses rate in the Cache.
 These measurements are appeared by incoherence
that occurred as a result of applying invalidation
policy.
Our Experiments (Parameters)
We used the same parameters on all cases
 The input data: the same array
•
•
•
•
Coherence policy: Invalidate
Cache size: 64 byte
Block size: 32 byte
Numbers of cores: 2
Our Experiments
 The data type: one dimension array with size of 64 bytes.
 The Trace file is generated by the code .
0x00000003 1 5
0x00000003 0 5
 The Bubble Sort trace file size= 126 KB.
 The Insertion Sort trace file size= 62 KB.
 The Quick Sort trace file size= 23 KB
Our Experiments (result)
3- Measuring Coherence misses rate for all cases on 2 cores

Cases
Coherence
miss
Bubble sort vs. Insertion sort
1574
Bubble sort vs. Quick sort
352
Quick sort vs. insertion sort
675
Coherence Miss = Coherence Write Miss + Coherence Read Miss
Our Experiments(chart)
 Measuring Coherence misses rate for all cases on 2 cores
Coherence misses rate for all cases
1800
Coherance Miss
1600
1400
1200
1000
800
600
400
200
0
Bubble sort vs. Insertion sort
Bubble sort vs. Quick sort
Cases
Quick sort vs. insertion sort
Our Experiments (analysis)
 This figure shows the incoherence in our simulator
 How ??? The incoherence happened because of the invalidation policy.
 That happens because of each both algorithms fighting on the same
data
 As we see in the chart, the coherence misses rate is high in the first
case. Why ?
 The array of the bubble sort can be helpful or wasteful for the insertion
sort in the same case which increases the data accessed.
 The insertion sort behaviors can increase or decrease iteration
numbers of algorlthm sorting for the bubble sort because of the
wrong sorting that caused by the fighting on the same data.
Our Experiments(result)
3- Measuring Read Coherence misses rate and Write
Coherence misses rate for all cases on 2 cores
Read Coherence Misses
case 0
case 1
case 2
Write Coherence Misses
944
1352
1736
5204
700
1169
Our Experiments (chart)
 Measuring Read Coherence misses rate and Write
Coherence misses rate for all cases on 2 cores
Coherence Read and WriteMisses Rate
6000
5204
Coherence Misses Rate
5000
4000
3000
1736
2000
1000
1352
1169
944
700
0
Bubble Vs Insertion
Bubble Vs Quick
Cases
Read Coherence Misses
Write Coherence Misses
Quick Vs Insertion
Our Experiments (analysis)
 1-This figure shows the write coherence miss and read coherence miss
for all cases in details for the previous coherence miss's figure
 2- As we can see, write coherence miss is higher than read coherence
miss. Why ?
 3- Because of the incoherence that was caused by invalidation each
algorithm did a lot of swapping
 4- That happens because of each both algorithms fighting on the same
data
Our Experiment (result)
1- Measuring Miss and Hit Rate with write through for all
cases using invalidation
ALL CACES ( HIT/MISS)- WRITE Through
Hit
Miss
Bubble Vs Insertion
14909
3151
Bubble Vs Quick
11042
707
Quick Vs Insertion
7136
1353
Our Experiments (Chart)
Measuring Hits and Misses Rate with right through for all
cases using invalidation
Hits and Misses Rate - Write Through
16000
14909
14000
11042
Hit and Miss Rate
12000
10000
7136
8000
6000
4000
3151
2000
707
1353
0
Bubble Vs Insertion
Bubble Vs Quick
Cases
Hit
Miss
Quick Vs Insertion
Our Experiments (analysis)
 Measuring the Cache performance.
 The Hit rates in all cases are greater than the Miss rates.

The reason is: the write hit occurs more often; because of swapping
operation.
 The higher rate of Hit is showed in the Bubble sort.

This algorithm has more 'comparing and swapping' operations than the
other sorting algorithms, and it is not efficient algorithm.
Our Experiments (Result)
2- Measuring Hits and Misses Rate with write through on
each core for each case using invalidation
(to show impact of the fighting on the same data on higher
level).
Bubble Sort Vs Insertion Sort
Algorithms
HIT
MISS
6702
1826
CORE 0 (Bubble Sort)
3048
CORE 1 (Insertion Sort)
1324
Quick Sort Vs Insertion Sort
Bubble Sort Vs Quick Sort
Algorithms
HIT
MISS
CORE 0 (Bubble Sort)
7489
500
CORE 1 (Quick Sort)
1140
454
algorithms
HIT
MISS
CORE 0 (Quick Sort)
7489
952
CORE 1 (Insertion Sort)
1140
400
Our Experiments(charts)
Measuring Hits and Misses Rate with write through
on each core for each case using invalidation
Hit and Miss Rate for case 0
(Bubble Vs Insertion)
Hit and Miss Rate for case 1
(Bubble Vs Quick)
8000
7000
8000
6702
8000
7489
6000
Miss and Hit Rate
6000
Miss and Hit Rate
5000
4000
3084
3000
2000
1826
1324
1000
5000
4000
3000
0
CORE 0 (Bubble Sort) CORE 1 (Insertion Sort)
MISS
4000
3000
1140
500
454
0
1000
1140
952
400
0
CORE 0 (Bubble Sort)
CORES
HIT
5000
2000
2000
1000
7489
7000
7000
6000
Miss and Hit Rate
Hit and Miss Rate for case 2
(Bubble Vs Quick)
CORE 1 (Quick Sort)
CORE 0 (Quick Sort)
CORES
HIT
MISS
CORE 1 (Insertion Sort)
CORES
HIT
MISS
Our Experiments (analysis)
 These figures show the high rate of hits and misses in cache for




each case.
That happens because of each both algorithms fighting on the
same data on the higher level.
The hits rate and misses rate occurred by incoherence that was
caused by invalidation.
The array of the first algorithm can be helpful or wasteful for the
second algorithm in the same case. It can reduce or increase the
data accesses .
The first algorithm behaviors can increase or decrease iteration
numbers of sorting for the second algorithm because of the
wrong sorting that caused by the fighting on the same data.
Evaluation
 In our evaluation we studied the impact of varying
parameters on cache optimization :
 Different block size with constant cache size
To measure the coherence misses.
 To measure the hits and misses rate in both levels of cache.

 Different cache size with constant block size

To measure the conflict misses in both levels of cache .
Evaluation (block size)
Coherence misses rate for all cases
Block Size= 64 Cache Size= 64
Coherence misses rate for all cases
Block Size= 32 Cache Size= 64
3500
1800
3000
2935
1600
1400
Coherence Miss
2500
Coherence Misses Rate
1574
2000
1500
1200
1000
800
675
600
911
1000
695
352
400
500
200
0
0
Bubble Vs Insertion
Bubble Vs Quick
Cases
Quick Vs Insertion
Bubble sort vs. Insertion
sort
Bubble sort vs. Quick
sort
Cases
Quick sort vs. insertion
sort
Evaluation(analysis)
 We increased the parameter value for block size and we
constant the cache size with write through for all cases.
Why ? to study the impact of causing incoherence
 These two figures show the increase of coherence misses
since we increased the block size .Why ?
 Because we put the same data that means the
invalidation applied many times which showed the
fighting of data
 The block size vector parameter plays the role of
impacting on cache.

Evaluation (Block size)
Hits and Misses Rate - Write Through
Block Size= 64 Cache Size= 64
16000
Hits and Misses Rate - Write Through
Block Size= 32 Cache Size= 64
14909
16000
14909
14000
14000
11042
12000
Hit and Miss Rate
Hit and Miss Rate
12000
10000
8000
6000
7136
5872
4000
10000
8000
1392
1824
0
7136
6000
4000
2000
11042
3151
2000
707
1353
0
Bubble Vs Insertion
Bubble Vs Quick
Quick Vs Insertion
Bubble Vs Insertion
Bubble Vs Quick
Axis Title
Hit
Miss
Cases
Hit
Miss
Quick Vs Insertion
Evaluation(analysis)
 We increased the parameters values for block size
with constant cache size with write through for all
cases. Why ? to study the impact of this change on
the cache optimization.
 These two figures show the same hits rate and
increase the misses since we increased the block size.
 Because increasing of the invalidation message
Evaluation (Block size)
Hit and Miss Rate for
case 1 (Bubble Sort Vs Quick Sort)
Block Size= 64 byte L1 Cache Size= 64
8000
byte
7347
Hit and Miss Rate for
case 1 (Bubble Sort Vs Quick Sort)
Block Size=32 byte L1 Cache Size= 64 byte
8000
7000
7000
6000
6000
Miss and Hit Rate
5000
Miss and Hit rate
7489
4000
3000
2000
5000
4000
3000
2000
940
1000
1140
854
537
1000
0
Bubble
Quick
500
454
0
CORE 0 (Bubble Sort)
Cores
Hit
Miss
CORE 1 (Quick Sort)
CORES
HIT
MISS
Evaluation (analysis)
 We increased the parameters values for block size and we
constant the cache size with write through for specific
cases. Why ? to study the impact of this change on the
higher level optimization.
 These two figures show the different of hits and
increased misses since we increased the block size .
 Here we studied each algorithms behavior as we see hits
of the bubble is high because has much more cycles than
another which clearly seen in its trace file
 The block size vector parameter plays the role of
impacting on L1 cache .
Evaluation (Conflict VS. Cache Size)
Conflict Misses Rate for All case
Block Size= 32 Cache Size= 32
Conflict Misses Rate for All case
Block Size= 32 Cache Size= 64
2500
1800
1400
1200
1000
800
673
600
400
1982
1572
Conflict Misses Rate
Conflict Misses Rate
1600
2000
1500
1260
1000
710
500
350
200
0
Bubble sort vs.
Insertion sort
0
Bubble sort vs. Insertion Bubble sort vs. Quick
sort
sort
Cases
Bubble sort vs. Quick Quick sort vs. insertion
sort
sort
Quick sort vs. insertion
sort
Cases
Evaluation (analysis)
 These two figures distinguish that when the cache size
increased, the conflict miss rate will be decreased. Why ?
 In the figure 1, the block size is 32 bytes and when the
level 1 cache size is equal to block size , the conflict miss
rate will be increased since the cache size fits to only one
block
 In the figure 2, In this case, the block size is 32 bytes
and when the level 1 cache size is twice the block size ,
the conflict miss rate will be decreased since the cache
size fits with number of blocks.
Related Work
 the effect of false sharing on parallel algorithm
performance occurs depending on many factors such
as block size, access pattern and coherence
polices[9]
 The impact of false sharing to be main vector in
performance among the optimal policy that uses
traditional coherence policies with the new merge
facility[9]
Conclusion
 Coherence is needed to meet an architectural
assumption held by the software designer.
 flashing data extra time prevents data losing and
duplicate data and fixes performance of cache.
 Invalidation message increases when we changes
block size with static cache size.
Future Work
 Using Update coherence policy with write through
and write back.
 Executing algorithms parallel on more cores and
counting the false sharing
 Using the large data size.
Challenges
 Clarifying the project idea to the class.
 First time to simulate the caches in software.
 We had a large effort for implementation with short
time.
 The write bake and the Quick sort algorithm were
too complicated.
 We read a lot of papers to find a related work to our
project; because of our big area.
What did we learn ?
 Reacting of Architecture with software
 How to pick up small feature to make it big research.
 How to make a big project from a specific feature.
 The comprehensive questions from labs assignment,
we have learned how to analyze our simulation
performance.
References
[1] Prabhu, Gurpur M. "COMPUTER ARCHITECTURETUTORIAL." Computer Architecture Tutorial. 2 Feb. 2003.
pubesher. 05 Apr. 2014 http://www.cs.iastate.edu/~prabhu/Tutorial/title.html.[2-12]
[2] Gita Alaghband (2014) CSC 5593 Graduate Computer Architecture Lecture 2[2-12].
[3] Shaaban, Muhammed A. "EECC 550 Winter 2010 Home Page." EECC 550 Winter 2010 Home Page. 27 Nov.
2010. RIT. 07 Apr. 2014.[10-25]. <http://people.rit.edu/meseec/eecc550-winter2010/>.[2-17]
[4] Guanjun Jiang; Du Chen; Binbin Wu;Yi Zhao; Tianzhou Chen; Jingwei Liu, "CMP Thread Assignment Based on
Group Sharing L2 Cache," Scalable Computing and Communications; Eighth International Conference on
Embedded Computing, 2009. SCALCOM-EMBEDDEDCOM'09. International Conference on, vol., no., pp.298, 303,
25-27 Sept. 2009[13-17]
[5] Kruse and Ryba (2001). Mergesort and Quicksort [42-72]. Retrieved from
www.cs.bu.edu/fac/gkollios/cs113/Slides/quicksort.ppt[16-17]
[6] Wei Zhang (2010). Multicore Architecture [ 73-81]. Retrieved from
https://www.pdffiller.com/en/project/16525498.htm?form_id=11909329.[2-12]
[7] J. Hennessy, D. Patterson. Computer Architecture: A Quantitative Approach (4th ed.). Morgan Kaufmann,
2011.[2-12]
[8] D. Patterson, J. Hennessy. Computer Organization and Design (5th ed.). Morgan Kaufmann, 2011.[2-12}
[9] W. Bolosky and M. Scott. False sharing and its effect on shared memory performance. In Proceedings of the
USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA,
September 1993.