Transcript Slide 1

Hash Objects – Why Use
Them?
Carolyn Cunnison
SAS Technical Training Specialist
Copyright © 2008, SAS Institute Inc. All rights reserved.
Agenda
 What are a Hash objects?
 When should I use them?
 Some sample code.
Copyright © 2008, SAS Institute Inc. All rights reserved.
What are HASH objects?
• Hash object can be thought of as rows of keys and data
loaded into memory.
Keys
3
Copyright © 2008, SAS Institute Inc. All rights reserved.
Data
Data
...
Advantages of Hash Objects
 Values can be hard-coded or loaded from a SAS
data set.
 Keys and data can be a mixture of character and
numeric.
 Provides in-memory data storage and retrieval.
 Does not require that data be sorted.
 Is sized dynamically.
Copyright © 2008, SAS Institute Inc. All rights reserved.
When to Use Hash Objects
(1) Joining tables
 I cut my processing time by 90% using hash
tables - You can do it too!; Jennifer K. WarnerFreeman
• http://www.nesug.info/Proceedings/nesug07/bb/bb1
6.pdf
 Jennifer took an existing Proc SQL join which
took between 2 and 4 hours to run. When she
rewrote the program to use Hash tables, the
program ran in 11 minutes.
Copyright © 2008, SAS Institute Inc. All rights reserved.
When to Use Hash Objects
(2) Summary-less summarization
 Hash-Crash and Beyond; Paul Dortman et al
• http://www2.sas.com/proceedings/forum2008/0372008.pdf
 Compared PROC SUMMARY with NWAY option
to Hash Object
proc summary data = input nway ;
class k1 k2 ; var num ;
output out = summ_sum (drop = _:) sum =
sum ;
 The Hash Object did “the job more than twice as
fast at the same time utilizing ⅓ the memory”
Copyright © 2008, SAS Institute Inc. All rights reserved.
When to Use Hash Objects
(3) Dynamically output to multiple files
 Paul Dortman paper (continued)
 Use a Hash table instead of the following:
Data out1 out2;
Set tablein;…
If id = 1 then output out1;
Else if id = 2 then output out2;
Copyright © 2008, SAS Institute Inc. All rights reserved.
When to Use Hash Objects
(4) Removing data extremes
 Knowledge Base Sample 25990
• http://support.sas.com/kb/25/990.html
 Removes top and bottom 10% of data values.
Copyright © 2008, SAS Institute Inc. All rights reserved.
When to Use Hash Objects
(5) Perform data sampling without Proc Surveyselect
 Better Hashing in SAS9.2; Robert Ray and
Jason Secosky
• http://support.sas.com/rnd/base/datastep/dot/betterhashing-sas92.pdf
 Select observations from a table without
replacement.
 Perform sampling and data manipulation in one
step.
Copyright © 2008, SAS Institute Inc. All rights reserved.
Terminology
Partial list of methods:
10
Objects
Methods
HASH
Definedata
Definekey
Definedone
Find
HITER
First
Last
Next
Prev
Copyright © 2008, SAS Institute Inc. All rights reserved.
Add - a row
Remove - a row
Replace - data for key
Delete - hash table
Business Scenario
You need to read orion.product_list and then look up
information in the orion.supplier table.
Structure of orion.product_list
Structure of orion.supplier
11
Loading Data from a SAS Data Set
data supplier_info;
drop rc;
length Supplier_Name $40 Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
S.definedata('Supplier_Name',
'Supplier_Address','Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
12
p305d02
Execution
Partial HASH Object S
KEY:
Supplier
_ID
DATA:
Supplier_
Name
50
Scandinavian
Clothing A/S
109
Petterson AB
316
Prime Sports
Ltd
.
.
.
.
.
.
3298
A Team Sports
DATA:
Supplier_
Address
Kr.
Augusts
Gate 13
Blasieholmstorg
1
9
Carlisle
Place
.
.
.
2687
Julie
Ann Ct
DATA:
Country
NO
SE
GB
.
.
.
US
data supplier_info;
drop rc;
length Supplier_Name $40
Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
S.definedata('Supplier_Name',
'Supplier_Address',
'Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,
Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
Partial PDV
Supplier_
Name
Supplier_
Address
Country
Product
_ID
.
13
Product_
Name
Supplier
_ID
.
. . .D
rc _N_
.
1
...
Execution
Partial HASH Object S
KEY:
Supplier
_ID
DATA:
Supplier_
Name
50
Scandinavian
Clothing A/S
109
Petterson AB
316
Prime Sports
Ltd
.
.
.
.
.
.
3298
A Team Sports
DATA:
Supplier_
Address
Kr.
Augusts
Gate 13
Blasieholmstorg
1
9
Carlisle
Place
.
.
.
2687
Julie
Ann Ct
DATA:
Country
NO
SE
GB
.
.
.
US
data supplier_info;
drop rc;
length Supplier_Name $40
Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
S.definedata('Supplier_Name',
'Supplier_Address',
'Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,
Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
Partial PDV
Supplier_
Name
Supplier_
Address
Country
Product
_ID
210200100009
14
Product_
Name
Kids Sweat Round
Neck,Large Logo
Supplier
_ID
3298
. . .D
rc _N_
0
6
...
Execution
Partial HASH Object S
KEY:
Supplier
_ID
DATA:
Supplier_
Name
50
Scandinavian
Clothing A/S
109
Petterson AB
316
Prime Sports
Ltd
.
.
.
.
.
.
3298
A Team Sports
DATA:
Supplier_
Address
Kr.
Augusts
Gate 13
Blasieholmstorg
1
9
Carlisle
Place
.
.
.
2687
Julie
Ann Ct
DATA:
Country
NO
SE
GB
.
.
.
US
data supplier_info;
drop rc;
length Supplier_Name $40
Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
S.definedata('Supplier_Name',
'Supplier_Address',
'Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,
Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
Partial PDV
Supplier_
Name
A Team Sports
15
Supplier_
Address
Country
2687 Julie Ann Ct
US
Product
_ID
210200100009
Product_
Name
Kids Sweat Round
Neck,Large Logo
Supplier
_ID
3298
. . .D
rc _N_
0
6
...
Execution
Partial HASH Object S
KEY:
Supplier
_ID
DATA:
Supplier_
Name
50
Scandinavian
Clothing A/S
109
Petterson AB
316
Prime Sports
Ltd
.
.
.
.
.
.
3298
A Team Sports
DATA:
Supplier_
Address
Kr.
Augusts
Gate 13
Blasieholmstorg
1
9
Carlisle
Place
.
.
.
2687
Julie
Ann Ct
DATA:
Country
NO
SE
GB
.
.
.
US
data supplier_info;
drop rc;
length Supplier_Name $40
Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
S.definedata('Supplier_Name',
'Supplier_Address',
'Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,
True
Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
Partial PDV
Supplier_
Name
A Team Sports
16
Supplier_
Address
2687 Julie Ann Ct
Country
US
Product
_ID
210200100009
Product_
Name
Kids Sweat Round
Neck,Large Logo
Supplier
_ID
3298
. . .D
rc _N_
0
6
...
Execution
Partial HASH Object S
KEY:
Supplier
_ID
DATA:
Supplier_
Name
50
Scandinavian
Clothing A/S
109
Petterson AB
316
Prime Sports
Ltd
.
.
.
.
.
.
3298
A Team Sports
DATA:
Supplier_
Address
Kr.
Augusts
Gate 13
Blasieholmstorg
1
9
Carlisle
Place
.
.
.
2687
Julie
Ann Ct
DATA:
Country
NO
SE
GB
.
.
.
US
data supplier_info;
drop rc;
length Supplier_Name $40
Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
Implicit OUTPUT;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
Implicit RETURN;
S.definedata('Supplier_Name',
'Supplier_Address',
'Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,
Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
Partial PDV
Supplier_
Name
A Team Sports
17
Supplier_
Address
2687 Julie Ann Ct
Country
US
Product
_ID
210200100009
Product_
Name
Kids Sweat Round
Neck,Large Logo
Supplier
_ID
3298
. . .D
rc _N_
0
6
...
Execution
Partial HASH Object S
KEY:
Supplier
_ID
DATA:
Supplier_
Name
50
Scandinavian
Clothing A/S
109
Petterson AB
316
Prime Sports
Ltd
.
.
.
.
.
.
3298
A Team Sports
DATA:
Supplier_
Address
Kr.
Augusts
Gate 13
Blasieholmstorg
1
9
Carlisle
Place
.
.
.
2687
Julie
Ann Ct
DATA:
Country
NO
SE
GB
.
.
.
US
data supplier_info;
drop rc;
length Supplier_Name $40
Supplier_Address $ 45
Country $ 2;
if _N_=1 then do;
declare hash S(dataset:'orion.supplier');
S.definekey('Supplier_ID');
S.definedata('Supplier_Name',
'Supplier_Address',
'Country');
S.definedone();
call missing(Supplier_Name,
Supplier_Address,
Country);
end;
set orion.product_list;
rc=S.find();
if rc=0;
run;
Continue until EOF
Partial PDV
Supplier_
Name
A Team Sports
18
Supplier_
Address
2687 Julie Ann Ct
Country
US
Product
_ID
210200100009
Product_
Name
Kids Sweat Round
Neck,Large Logo
Supplier
_ID
3298
. . .D
rc _N_
0
6
...
Results
proc print data=supplier_info(obs=10);
var Product_ID Supplier_ID Supplier_Name
Supplier_Address Country;
title "Product Information";
run;
Partial PROC PRINT Output
Product Information
Obs
Product_ID
Supplier_ID
1
2
3
4
5
6
7
8
9
10
210200100009
210200100017
210200200022
210200200023
210200300006
210200300007
210200300052
210200400020
210200400070
210200500002
3298
3298
6153
6153
1303
1303
1303
1303
1303
772
19
Supplier_Name
Supplier_Address
A Team Sports
A Team Sports
Nautlius SportsWear Inc
Nautlius SportsWear Inc
Eclipse Inc
Eclipse Inc
Eclipse Inc
Eclipse Inc
Eclipse Inc
AllSeasons Outdoor Clothing
2687 Julie Ann Ct
2687 Julie Ann Ct
56 Bagwell Ave
56 Bagwell Ave
1218 Carriole Ct
1218 Carriole Ct
1218 Carriole Ct
1218 Carriole Ct
1218 Carriole Ct
553 Cliffview Dr
Country
US
US
US
US
US
US
US
US
US
US
Could I do the same thing with a MERGE ?
Yes. But ……
• Would have to sort both tables.
• Reading from disk is slower than reading from
memory.
20
What about data size ?
Scalability of Table Lookup Techniques, Rick
Langston
 http://support.sas.com/resources/papers/proceedi
ngs09/037-2009.pdf
 Compared Hash table, Sort/Merge, Indexing,
Proc SQL and Proc Format as table lookup
techniques.
 Hash object processing was successful up to
around 1,900,000 rows and then ran out of
memory.
21
Did you know that….
PROC SQL sometimes uses hashing to join tables.
• Possible processing methods are:
sqxjsl - Step Loop Join (Cartesian product)
sqxjm - Merge Join
sqxjndx- Index Join
sqxjhsh- Hash Join
•
22
To view the method used:
Proc sql _method;
The HITER object
• The HITER object must point to a HASH object.
• Read the HITER using the following methods.
23
Conclusion
 Hash and Hiter objects are very flexible.
 Data has to fit into memory.
 Results will depend on your data, your
environment, and what you are trying to do.
 You have to benchmark.
Copyright © 2008, SAS Institute Inc. All rights reserved.
Want to know more?
• SAS Programming III: Advanced Techniques and Efficiencies
https://support.sas.com/edu/schedules.html?ctry=ca&id=279
• Also available as Live Web course.
25
Copyright © 2008, SAS Institute Inc. All rights reserved.
Questions?
Copyright © 2008, SAS Institute Inc. All rights reserved.