Sort with SAS and Big Data - Washington University in St

Transcript Sort with SAS and Big Data - Washington University in St

SORTING WITH SAS
LONG,
VERY LONG AND
LARGE,
VERY LARGE
DATA
Aldi Kraja
Division of Statistical Genomics
SAS seminar series
June 02, 2008
SORT AND MERGE EXAMPLE













data a;
input id m1 $ m2 $ m3 $ DNAreserve;
datalines;
1 1/1 1/2 1/1 12
2 1/2 1/1 2/2 14
3 2/2 1/1 1/1 15
4 1/2 1/2 1/2 16
5 1/1 2/2 1/1 15
;
run;
proc sort data=a;
by id;
run;
SORT AND MERGE EXAMPLE (CONT.)












data b;
input id age sex SBP DBP;
datalines;
1 23 1 128 95
2 25 2 115 84
3 30 1 120 85
4 27 1 130 90
5 35 2 122 82
;
run;
proc sort data=b;
by id; run;
SORT AND MERGE EXAMPLE (CONT.)





data ab;
merge a (in=in1) b (in=in2);
by id ;
if in1 and in2; run;
proc print data=ab; title "A and B merged"; run;

A and B merged

Obs





1
2
3
4
5
id
1
2
3
4
5
Monday, June 2, 2008
m1
1/1
1/2
2/2
1/2
1/1
m2
1/2
1/1
1/1
1/2
2/2
1/1
2/2
1/1
1/2
1/1
m3
DNAreserve
12
14
15
16
15
23
25
30
27
35
1
2
1
1
2
age
128
115
120
130
122
sex
95
84
85
90
82
SBP
DBP
EXAMPLE 2: JOIN TABLES WITH SQL
proc sql;

create table sqlab as

select *

from a, b

where a.id=b.id;

quit;

proc print data=sqlab;

title "SQL joined tables"; run;

TIME:
Merge:
 sorting a:
 real time 0.01 seconds
 cpu time 0.01 seconds
 sorting b:
 real time 0.01 seconds
 cpu time 0.01 seconds
 Merge:
 real time 0.01 seconds
 cpu time 0.01 seconds

NOTE: PROCEDURE SQL
used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Test it with large and long
data if there is any
advantage of using proc sql
EXAMPLE 3: SORT FLAGS (IN THE
DESCRIPTOR PORTION OF A DATASET)
 The CONTENTS Procedure

Data Set Name
WORK.A

Member Type
DATA



Observations
Variables
Sort Information
Sortedby
Validated
id
YES
5
5
EXAMPLE 3: SORT FLAGS (CONT.)













data one (sortedby=id);
input id;
datalines;
1
4
3
5
2
;
run;
proc contents data=one;
title " data one with option sortedby=id ";
run;
EXAMPLE 3: SORT FLAGS (CONT.)













proc sort data=one;
by id; run;
data two;
set one;
by id;
run;
proc sql;
create index id on one(id);
quit;
proc datasets nolist;
modify one;
index create id;
run;
SORTING LARGE DATA ON MANY KEYS






Problems:
Disk space or temporary space may be inadequate
Time needed may be quite long
The software or the operating system may not work
correct during the sorting of large data
Work directory normally is located under /tmp of a
server. If my data to be sorted is 3 GB and the /tmp is
set to 1GB can SAS do the SORT?
What about if 8-jobs run in parallel in the same
server with 8 processors, and try to do SORT on
different very large and long sets , but for different
purposes?
EXAMPLE 4: TAGSORT OPTION














data a;
input pedid id m1 $ m2 $ m3 $ DNAreserve;
datalines;
1 1 1/1 1/2 1/1 12
1 2 1/2 1/1 2/2 14
1 3 2/2 1/1 1/1 15
2 6 1/2 1/2 1/2 16
2 5 1/1 2/2 1/1 15
2 4 2/2 2/2 1/2 12
;
run;
proc sort tagsort data=a nodupkey out=sorted_a;
by pedid id ;
run;
TAGSORT
Introduced in versin 6.07
 Can produce important improvements in clock
time but increases the cpu time
 Internally sort will store in the temporary files
only the sort-keys and observation numbers
 These sort-keys and the observation numbers are
the “tags” of tagsort.
 At the end of the sort, the tags are used to
retrieve the entire record from the entire set, but
now ordering them in sorted order.
 Potential gains when the set is very large

EXAMPLE 5: GENESTAR PROJECT PROBLEM
8 large text files
 Read into SAS
 8 SAS datasets

S1-S400
By
1,044,977
S1-S400
By
1,044,977

The data are very large
S1-S149
By
1,044,977
S1-S687
By
1,044,977
GENESTAR PROJECT PROBLEM
A. split data for each subject as a new dataset
 d1-d3236
 B. split data for each subject into 25
chromosomes
 d1c1-d1c25 ……..
 d3236c1-d3236c25
 Transpose markers by batches of 200 markers at
a time and place data together for a chromosome
 Finally with proc append, place together subjects
Subject m1 m2 …
of the same chromosome.

Subject marker geno genocall
1 m1 1/1 0.7560
1 m2 1/2 0.76899
………………
started
1 0.7560 0.76899
Subject
m1 m20.98999
…
2 0.9999
1 1/1
1/3
………………
2 1/2 3/3
………………
ended
SORT IN THE GENESTAR PROJECT
sas -memsize 16G pgm.sas &
 MPRINT(SORTIT): proc sort data=in1.rawdataf8
nodupkey out=a (keep=barcoden) ;
SYMBOLGEN: Macro variable BYL resolves to
barcoden MPRINT(SORTIT): by barcoden ;
MPRINT(SORTIT): run;
 NOTE: There were 718126154 observations read
from the data set IN1.MYDATA. ERROR:
Insufficient memory. NOTE: The SAS System
stopped processing this step because of errors.
NOTE: SAS set option OBS=0 and will continue
to check

SORT ON LARGE DATA, IS IT NECESSARY?
I resolved the problem in the following way: a)
removed from the data every other variable
and kept only the by variable in the set. b)
only after a), the sorting with nodupkey worked.
 In addition where I had another similar sorting, I
removed the sorting and used steps that do
the same thing without sorting.
 Only now the program does not run out of
memory, which means that SAS did not have
limit toward the number of observations, but the
limit was on the memory use in our server
(needed more than 16GB of mem) ???.
(32/64b issues and -memsize 0)

EXAMPLE 6, SORT WITH SQL
proc sql;
 create table sql_a as

select *

from a

order by pedid, id;

quit;

EXAMPLE 7: MERGE WITH INDEX WITHOUT
SORTING DATA



















proc contents data=a;
title "a is not sorted";
run;
proc contents data=b;
title "b is not sorted";
run;
data a_index (index=(id));
set a;
run;
data b_index (index=(id));
set b;
run;
data final;
set b_index ;
set a_index key=id;
run;
proc print data=final;
title "Merged data based on index= id";
run;
PROBLEMS WITH INDEXING
Indexing can be faster than sorting
 The difference can be significant in large data
 SAS will create an extra file for the index and
this will be a large file. For example in a 1.2GB
dataset SAS may create an index file of ~ 340 MB
 Advantage: a set indexed on many variables can
be used as just sorted in one of the variables
 Proc datasets has an index, also SQL has
indexing: for example
 proc datasets library=work; modify a; create
index idlist=(pedid id); run;

READINGS:
Paul M. Dorfman. QuickSorting an array. Paper 9626.
 Paul M. Dorfman. Table look-up by direct addressing:
key indexing – Bitmapping – Hashing. Paper 8-26

Paper 075-29
 Randomly Selecting Observations
 Robert Patten, Lands’ End, Dodgeville, WI
 http://www2.sas.com/proceedings/sugi29/075-29.pdf
