SQL for IR Staff: A crash course for novices in

Download Report

Transcript SQL for IR Staff: A crash course for novices in

SQL for IR Staff
(You can do it…We can help! ©)
A crash course for novices in
appropriately selecting and
transforming data from relational
databases using
SAS® Proc SQL
October 19, 2008 SAIR @ Nashville, TN.
Evan Davies
The College of
William and Mary
Chartered 1693
Dennis Catley
Why are you here?
• You may have some combination of:
• Enough training to run someone else’s query and to
wonder about its syntax…
• Any knowledge gained from point & click environments,
help files, or ‘SQL for dummies’-type books…
• The need to join two or more tables to select data for
analysis…
• The desire to return to work with some new skills to
justify the SAIR trip!
SAS and SQL Knowledge Evaluation
• The reality is
that it is hard to
assess where
anyone fits
along any
learning
continuum.
STARTUP
• Find the workshop datasets and programs at
their storage location;
• Invoke SAS interactive session;
• Explain interactive windows and program flow;
• Explain the necessary LIBNAME statement to
target the sample data sets;
• Execute a simple print job to confirm operations
on everyone’s computer
Proc SQL under
• SQL first proposed by IBM in 1970s;
popularized and standardized in late-1980s.
• Incorporated into SAS system starting with
version 6.18 (experimental).
• SAS Proc SQL has a few more features than
ANSI SQL and is slightly different from other
flavors.
• Wholly integrated into the SAS programming
environment; SAS capable of addressing nonSAS relational databases natively.
Reasons for a Crash Course
• You often have no dedicated instruction in
a small I.R. shop.
• Your I.T. staff provide limited support and
are sometimes puzzled by your requests.
• Available (i.e. free) resources don’t
concentrate on selecting data for analysis
and often use business sales examples.
Mango chutney, anyone?
More Reasons…
• You may be transitioning into an institutional
RDBMS, like Banner or Peoplesoft, with a
data warehouse or mart being attached to it.
• Your drag & drop SQL generator or prewritten programs may be capable of
answering questions, but you are not sure
how to modify the queries and you spend
time dumping data into intermediate
applications for more processing.
Today’s Effort
• We will concentrate on selecting data because your job
focus is institutional research and analysis.
• We will use IR-related examples.
• We can’t teach you all about your own data or your
campus access methods.
• We have lots of material to cover during the workshop,
so we may have to defer questions occasionally.
• SQL wasn’t built in a day. Your mastery of it will not be
achieved in a day…
Concept Bridging
An IR data
selection task
Common SQL
learning
roadblock
A resulting table
with valid data
Code to
produce
desired result
Concept Bridging Topics
1.
2.
3.
4.
5.
6.
7.
8.
Terminology is the primary key
Lose your mind in a single query
Conditions are optimal
Groups, having, functions
Cohorts versus survivors
Keys to the kingdom
Just passing through
The power of sub-queries
An Important FYI
• Data are never changed when you run a
SQL query…..never.
• You may bring the institution’s computing
services to a halt, or dim the lights for a
square mile, but the data are fine.
Terminology is the primary key
The IR professional’s need:
• Knowing the proper names for the parts of the SQL
statement is the first building block of SQL learning. It
serves to:
– Help you understand their functions, different forms,
and relationships,
– Help you plan your data selection process,
– Help you talk about problems more precisely to
the DBAs, data warehouse staff, and/or the IT
group (who are immersed in buckets of data and
often need brought into your ‘problem arena’)
SQL Concepts Explained
• Table – a collection of data, a SAS data set,
ORACLE table, or excel spreadsheet, for example
• Database – most often a collection of related
tables
• Columns – you can think in the Excel sense
• Rows – again, think Excel
• DBA – Data Base Analyst (your friend if you
behave, and stay away from Cartesian joins).
• Metadata – information about data, info about a
particular table for instance
At VT I have a superb group of data warehouse people who serve a
similar role for me but are more data and information literate than a
typical DBA.
Those types can be invaluable!!! (know how to kiss up)
Tables & Rows & Columns
(oh my!)
The 4 records below are a Table.
columns
rows
Seinfeld
Kramer
Benes
Costanza
T
T
T
P
M
M
F
M
Associate
Assistant
Professor
Instructor
PHD
MFA
MM
MARCH
Columns - think vertical…..rows - think horizontal
Lose your mind in a single query
The IR reality:
• Rapid demand for data from relatively new sources (like
warehouses and changing RDBMS),
• the ad-hoc specialty nature of many requests,
• your own novice SQL status,
• concerns about the logical relationships between tables
• and concerns about the cleanliness and completeness of data
all are legitimate reasons to:
• develop and test queries in iterative stages and steps, and to
• follow good programming practices like using indentation,
white spacing, and documentation
Introducing SQL
• SQL Stands for Structured Query
Language
• Typically built via as a SELECT-FROMWHERE construction
• Repeat after me… “SELECT, FROM,
WHERE”
• Again… “SELECT, FROM, WHERE”
• Is it SQL (see-kwil) or SQL (ess-que-ell)?
Who cares?
Select, From & Where Explained
• SELECT – is used to choose which
columns you desire, & in what order you
want them to appear
• FROM – references the table(s) that
contains the columns you need
• WHERE – is employed to select which
rows you want to include
Jumping Off to Select Statements
Sample group of faculty records
(FACULTY table)
Lname
tenure_code gender rank
SSN
---------------------------------------------------Catley
N
M
Lecturer
123456789
Davies
T
M
Assistant 987654321
Superman
P
M
Professor 111222333
Seinfeld
T
M
Associate 444555666
Kramer
T
M
Instructor 777888999
Benes
T
F
Professor 999888777
Costanza
N
M
Instructor 666555444
Newman
P
M
Assistant 333222111
Susan
N
F
Associate 123454321
Select all columns and rows from a table
Proc sql;
Select *
from faculty;
Quit;
Note there is no WHERE clause. SELECT and FROM are
required. All other statements, including WHERE, are
optional.
Seldom will you NOT use a WHERE clause, you typically are
looking for a subset of both rows and columns
Result set from prior query
Lname
tenure_code gender rank
SSN
---------------------------------------------------Catley
N
M
Lecturer
123456789
Davies
T
M
Assistant 987654321
Harriman
P
M
Professor 111222333
Seinfeld
T
M
Associate 444555666
Kramer
T
M
Instructor 777888999
Benes
T
F
Professor 999888777
Costanza
N
M
Instructor 666555444
Newman
P
M
Assistant 333222111
Susan
N
F
Associate 123454321
The * syntax with no WHERE clause results in all
columns and rows being selected. The * symbol
means select all columns.
The Where Clause
The where clause in SQL has 2 main functions.
• Filtering the rows chosen, i.e. providing
row selection criteria.
• Providing join commonality between 2 or
more tables (keys will be discussed later)
Filtering Rows – a Use of the WHERE Clause
SELECT lname,
tenure_code
FROM faculty
WHERE tenure_code = ‘T’;
SELECT is choosing AND ordering columns, and WHERE is
choosing which rows to keep.
SQL uses single quotes around text values (most database systems
will also accept double quotes). Numeric values should not be
enclosed in quotes.
Coding for Readability
• You will notice indentation used when we
write our SQL code. Is it syntactically
required?
• No -- it is not, but it makes the code much
easier to read and to understand!!
Result set of previous query
Lname
tenure_code
----------------------Davies
T
Seinfeld
T
Kramer
T
Benes
T
What’s one change we may wish to see in this result set?
We Now Want All Tenure-track
SELECT lname
,
tenure_code
FROM faculty
WHERE tenure_code in ( ‘T’,’P’)
ORDER BY lname;
Note the addition of an ORDER BY, and use of the IN
operator. Note also that we could override the default
ORDER with a DESC option, and move the Z’s to the top.
order by lname desc;
Result set after ordering by lname
Lname
tenure_code
---------------------Benes
T
Davies
T
Harriman
P
Kramer
T
Newman
P
Seinfeld
T
Sample group of faculty records
(FACULTY table)
Lname
tenure_code gender rank
dept
---------------------------------------------------Catley
N
M
Lecturer
Accounting
Davies
T
M
Assistant Biology
Superman
P
M
Professor Biology
Seinfeld
T
M
Associate Biology
Kramer
T
M
Instructor Accounting
Benes
T
F
Professor Accounting
Newman
N
F
Associate Accounting
Could We Get a List of Depts?
(from the faculty table)
SELECT dept
FROM faculty
ORDER BY dept ;
dept
-------------------Accounting
Accounting
Accounting
Accounting
Biology
Biology
Biology
How do we solve this inadequacy?
Distinct Keyword
select distinct dept
from faculty
order by dept ;
The DISTINCT keyword removes duplicate items from the result
set.
Jumping Off to Conditions
Conditions are optimal
The IR Professional’s need:
• Available logical operators, to select data precisely:
• IN and NOT IN (require comma separators)
• EXISTS, IS NULL, NOT EQUAL TO (<>) , LIKE (%)
Note: In some versions of SQL the <> operator may be written
as !=
Roadblock: if your intuition tells you that there is a better way to
construct your WHERE clause, then there likely is. This is where
your SQL friends, DBAs, and Data Warehouse contacts come in.
Ask them. Ask us, we’re an email or phone call away!
The LIKE Operator
The LIKE (%) condition is used to specify a search for a
pattern in a column.
Samples
The following SQL statement will return persons with first names that
start with a ‘P':
SELECT *
FROM Persons
WHERE FirstName LIKE ‘P%‘;
The following SQL statement will return persons with first names that
end with an 'a':
SELECT *
FROM Persons
WHERE FirstName LIKE '%a' ;
BETWEEN …. AND
The BETWEEN ... AND operator selects a range of data between two
values. These values can be numbers, text, or dates.
Roadblock: The BETWEEN...AND operator is treated differently in
different databases. NOT BETWEEN can be used as well.
SELECT *
FROM Persons
WHERE LastName NOT BETWEEN ‘Catley' AND ‘Clemente' ;
Jumping Off to Groups and
Functions
GROUP, HAVING, Functions
The IR Professional’s need:
•Grouping data allows flexible aggregation and summary
data. (an IR must!)
•SQL provides 2 related statements for summarizing data.
GROUP BY and HAVING.
•The HAVING clause is reserved for aggregate functions. It
is usually placed near the end of a SQL block which also
enhances code readability. (think of HAVING as a final
filter)
The Having Clause
Lname
gender
salary
department
---------------------------------------Davies
M
88000
Biology
Seinfeld
M
77000
Accounting
Catley
M
45000
Accounting
Benes
F
60000
Biology
Susan
F
70000
Biology
Wilson
F
45000
Accounting
Kelly
F
50000
Accounting
Given this table called FACULTY
We’d like to know which departments have a mean salary
for males or females above $50,000.
Sample of Having Clause
SELECT dept
,
gender
,
avg(salary)
FROM faculty
GROUP BY dept ,
gender
HAVING avg(salary) > 50000;
Results of prior query
dept
gender
----------------------------------------Accounting
M
61000
Biology
F
65000
Biology
M
88000
What can we do to improve this?
A couple quick improvements
SELECT dept
,
gender
,
count(*) as faculty
,
avg(salary) as Mean_salary
FROM faculty
GROUP BY dept ,
gender
HAVING avg(salary) > 50000
ORDER BY dept ,
gender;
Two summary stats used, COUNT and AVG
Result of prior Query
dept
gender faculty Mean_salary
------------------------------------------------------Accounting M
2
61000
Biology
F
2
65000
Biology
M
1
88000
Be careful!
SELECT dept
,
gender
,
avg(salary)
FROM faculty
HAVING avg(salary) > 50000
A having clause used with no group by…..any guesses?
Result of prior query
dept
gender
----------------------------Biology
M
62142.86
Accounting
M
62142.86
Accounting
M
62142.86
Biology
F
62142.86
Biology
F
62142.86
Accounting
F
62142.86
Accounting
F
62142.86
What a mess!!! GROUP BY is usually coupled with
HAVING.
Be Careful 2 (the sql sequel)
SELECT dept
,
gender
,
count(*) as faculty ,
avg(salary) as Mean_salary
FROM faculty
GROUP BY dept
HAVING avg(salary) > 50000
ORDER BY dept ,
gender ;
Grouping by dept alone may not be what you want?
Result of Prior Query
dept
gender
faculty
Mean_salary
------------------------------------------------Accounting
F
4
54250
Accounting
F
4
54250
Accounting
M
4
54250
Accounting
M
4
54250
Biology
F
3
72666.67
Biology
F
3
72666.67
Biology
M
3
72666.67
Always examine what you expect to get versus what you *DO* get !
Some of the Common Aggregate Functions
AVG(Column)
Returns the average value of a column
COUNT (Column) Returns the number of rows (without a NULL value) of a
column
COUNT(*)
Returns the number of selected rows
FIRST(Column)
Returns the value of the first record in the specified field
LAST(Column)
Returns the value of the last record in the specified field
MAX(Column)
Returns the highest value of a column
MIN(Column)
Returns the lowest value of a column
SUM(Column)
Returns the total sum of a column
Who has the Highest Salary in Each Department?
SELECT lname
,
dept
,
gender
,
salary
FROM faculty
GROUP BY dept
HAVING salary = max(salary)
ORDER BY dept ;
This is basically evaluating each row against a summary stat
from the query.
Results of Prior Query
lname
dept
gender salary
---------------------------------Seinfeld Accounting
M
77000
Davies
Biology
M
88000
Functions
SELECT dept
,
gender
,
count(*) as faculty
,
avg(salary) as Mean_salary
FROM salaries
WHERE substr(dept,1,1) = 'A'
GROUP BY dept ,
gender
HAVING avg(salary) > 50000
ORDER BY dept ,
gender ;
Table Aliases
• A table alias is a temporary alternate name for a table, a
nickname of sorts.
• You can specify table aliases in the FROM clause
• The AS keyword is optional, pros include maybe
enhanced readability, cons include the added keystrokes
• Table aliases are used in joins to qualify column names
so that the correct columns will be processed.
• Table aliases MUST be used when multiple (2 or more)
tables have the same column names, but are optional
otherwise
Storing Results
Proc sql;
Create table Engineering as
SELECT a.lname
,
a.retire_sum ,
b.dept
FROM faculty_age a ,
faculty_depts b
WHERE a.lname = b.lname
and college = ‘Engineering’;
Quit;
Note the use of table ‘aliases’ to assign a short
reference to each table being used in a join.
Jumping Off to Joins..
After the Break!
Cohorts vs. Survivors
The IR professional’s need:
• IR data selection generally operates on a
‘cohort’ model. Much SQL instruction is
geared toward developing ‘survivor’
datasets. While efficient, the survivor
model is not useful for many of our
purposes.
• We concatenate (union) rather than joining
datasets for many analytical purposes, like
year-to-year studies.
Cohorts vs. Survivors
Explicit Join form of the select statement
SELECT
Source1 variables , Source2 variables
FROM
Source1
JOIN
Source2
ON
Join key criteria (columns)
WHERE
Row selection criteria (rows)
Cohorts vs. Survivors
• A cohort group is the result of an
OUTER join of the Left, Right, or
Full type.
• A survivor group is the result of
an INNER join.
The Simple Explanation
Cohorts vs. Survivors
• Inner join – Retrieves only matching rows that “survive”
the join selection criteria.
• Outer Left join – The first dataset mentioned (left of the
join keyword) is the cohort and the second (right) is
transactional. All rows from the cohort are returned with
information, as available, added from the transaction
when rows from both tables match on the key column(s).
• Outer Right join – The second dataset mentioned is the
cohort and the first is the transactional.
• Outer Full join – The resulting table has cohort records
from both data sets, including “matching” rows as well
as “non-matching” rows.
Survivors
R
L
A
B
C
D
E
FR
SO
JR
SR
UN
INNER
JOIN
A
B
D
E
FR
SO
SR
UN
VA
MA
TX
VA
A VA
B MA
D TX
E VA
G AR
Cohorts
R
L
A
B
C
D
E
FR
SO
JR
SR
UN
OUTER
LEFT
JOIN
A FR VA
B SO MA
C JR
D SR TX
E UN VA
A VA
B MA
D TX
E VA
G AR
Cohorts
R
L
A
B
C
D
E
OUTER
RIGHT
JOIN
FR
SO
JR
SR
UN
A
B
D
E
G
FR
SO
SR
UN
VA
MA
TX
VA
AR
A VA
B MA
D TX
E VA
G AR
Cohorts
R
L
A
B
C
D
E
OUTER
FULL
JOIN
FR
SO
JR
SR
UN
A
B
C
D
E
G
FR
SO
JR
SR
UN
VA
MA
TX
VA
AR
A VA
B MA
D TX
E VA
G AR
Cohorts vs. Survivors
Implicit Join form of the select statement
*acquire admissions h.s. region tables;
SELECT
L.*, R.hs_name, R.hs_type, R.hs_recruit_region
FROM
admissions_applicant_200410 L ,
admissions_geo_2004 R
WHERE
L. high_school _ code = R.geo_hs_code
and
L.term_code = '200410'
No JOIN ON statement ! Can’t do a SAS outer join with this syntax…
Cohorts vs. Survivors
Select expanded to join three datasets
*acquire admissions cohort and h.s. region from freeze tables;
SELECT
*
FROM
admissions L
LEFT JOIN
gorsgeo
ResultRtable from first join
ON
L.sbgi_code = R.gorsgeo_code
LEFT JOIN
Admissions_officer A
ON
L.sbgi_code = A.sbgi_code_key
WHERE
L.term = '200410' and R.gorsgeo_status_ind IS NULL
Cohorts vs. Survivors
Concatenation of data tables
create table alldata as
select
a.pidm_key as pidm,
a.reports_frp
from ssn_cohort a
OUTER UNION
CORRESPONDING
select
b.pidm, b.reports_frp
from HHMI b ;
SQL assumes that your variable names
and data characteristics match…
1
2
3
4
4
5
6
1
2
3
4
4
5
6
Cohorts vs. Survivors
Non-duped concatenation of data tables
create table alldata as
select
a.pidm_key as pidm,
a.reports_frp
from ssn_cohort a
UNION CORRESPONDING
select
b.pidm, b.reports_frp
from HHMI b ;
All row information
must be the same…
1
2
3
4
4
5
6
1
2
3
4
5
6
Jumping Off to Join Statements
Keys to the kingdom
The IR professional’s need:
• You may need to manipulate variables to make
them successful join keys. Those variables may
not be the obvious primary key for the table in
which they reside, and they currently may not
have the necessary values for a successful join.
• There are attributes about data tables and keys
that can materially affect how quickly and
successfully you can use them in join
operations. Discovering this meta-data is worth
the time and effort it will take. Tools can help.
Keys to the kingdom
Let’s review the notion of keys
• Keys are always made up of variables
(columns), not observations (rows). Keys and
variables are always associated with the join
criteria, not the row selection criteria, in SQL.
• A key column is any variable that can reliably
join together tables of data that are somehow
related. Sometimes more than one column must
be combined to form the equivalent key column
for another table’s key. Occasionally, data must
be transformed in order to form an equivalent
key column.
Keys to the kingdom
Key and Variable Transformation
• The ‘select *’ syntax is only marginally useful.
While it saves you from typing in column names,
using it brings redundancy if you want to alias or
transform variables, and may result in system renamed variables.
• Use aliases to shorten, rename, or otherwise
better identify columns for joining to other tables.
• You may have to re-label an aliased variable in
your analysis package with a label command.
Keys to the kingdom
Key and Variable Transformation
• There are many arithmetic and text-related
operations that you can use on columns in the
selection process to make them match other
keys, or to transform them. Frequently used are:
• Substringing:
Select substr(course,1,4) as course_abbrev
• Scanning:
Select = scan(packlevlhr,2,'{') as hrs_earn
• Addition:
Select sat_verb+sat_math as sat_total
Key and Variable Transformation
• Any arithmetic operation including ():
Select (s_contrib+f_contrib)/(aid_awarddiscret_aid) as contrib_ratio
• Concatenation:
Select course_name || course_number ||
course_section as course_key
• Direct Substitution:
Select ‘Registered’ as student_status
Creating a New Column from Scratch
Proc SQL;
Select lname ,
‘tenure-track’ as faculty_type
from faculty
where tenure_code in ( ‘T’,’P’);
Quit;
Additional Select Control Syntax
Proc SQL;
Select lname label = ‘Last Name’ ,
‘tenure-track’ as faculty_type format = $25.
from faculty
where tenure_code in ( ‘T’,’P’);
Quit;
• Note: faculty_type could also be set to a single flag variable
such as ‘T’, and using a format = $1.
Jumping Off to Keys and Variable
Transformation
Keys to the kingdom
Using a case statement to transform a key
SELECT
course_number,
course_identification,
case when academic_period = '200410' then '20039'
when academic_period = '200420' then '20041'
when academic_period = '200510' then '20049'
else '?'
end as term,
actual_enrollment
FROM schedule_offering
WHERE
school in (‘UG' , ‘GA')
and faculty_sts = ‘TR'
Keys to the kingdom
Other case statements
case
when sexappl in (' ','N')
then ‘U’
else sexappl
end as gender,
case
when bethn IS NULL
then ‘NR’
else bethn
end as ethnic_desc,
case
when reg_dapp = ‘999’ AND
app_code <> ‘UNCL’ then ‘200’
when ug_appst = '200001' then '300‘
when ug_appst = '200006' then '400'
when lw_appst = '200007' then '500'
when gs_mapp = '200009' then ‘700‘
else '???‘
end as init_enroll_status
Keys to the kingdom
The notion of keys
• Tables can have primary keys. A primary key
has two special conditions – It is not allowed to
be empty (null) and it is not allowed to have
repeating values, which assists in referential
integrity. Not all tables have keys that meet
these criteria, nor should that be expected as a
norm in IR data joins.
• Keys can be indexed, which means the join
operation runs faster when those keys are
referenced. Conversely, if they are not
referenced, the join can be very, very slow.
Keys to the kingdom
Key findings
• You need to develop a plan for finding out about
primary keys and indexed key variables in large
data tables to avoid long execution times.
• Your data base analyst might help you with
information about “explain plans” and SQL
developer tools that may be located within your
primary analysis package or available
separately. TOAD is a highly useful tool for an
Oracle-based RDBMS.
Keys to the kingdom
Keys to the kingdom
Keys to the kingdom
Keys to the kingdom
Just Passing Through
The IR professional’s need:
• You may have direct connections through
SQLnet to the processor of your relational
database as well as the computing power of
your particular analysis package. Recognizing
this fact and managing that connection is
important for faster, and sometimes, any results.
• There may be differences between SQL
implementations in the two locations.
Just Passing Through
Remote connection or ‘Pass-through’
• The concept of ‘pass-through’ refers to
connections made between your computer
(and analysis application) and the server
where your RDBMS keeps all the data.
DATA
Just Passing Through
Remote connection or ‘Pass-through’
• When you join tables that are all on your
computer, joins are made promptly.
SQL
RESULT
Just Passing Through
Remote connection or ‘Pass-through’
• When you join tables that are all on your
server, joins are made rapidly.
SQL
RESULT
Just Passing Through
Remote connection or ‘Pass-through’
• When you join tables that are in both places,
joins are made s l o w l y
SQL
SQL
FCP
All join processing initially produces a full "Cartesian product".
Just Passing Through
‘Pass-through’ solution #1
• To the greatest extent possible, define
your first table as a select statement that
uses tables already on the server.
SQL
RESULT
Just Passing Through
‘Pass-through’ solution #1
SELECT test_score1 , test_score2 , test_score1 +
test_score2 as totsat, r.person_key
FROM
test_slot L
right join
admissions_application R
on
l.person_key = r.person_key
WHERE
r.term = 200510 and r.level = ‘UG’
Just Passing Through
‘Pass-through’ solution #2
• Ask your DBA for table space on your
server to transfer cohort files for joins
SQL
RESULT
Just Passing Through
‘Pass-through’ solution #3
select test_score1 as verbal,
test_score2 as math, person_key
from
test_slot
where person_key in (‘238540’, ‘342453’
‘223472’, ‘349432’, ‘546326’, ‘743324’)
• Most systems will allow 800~1000 values in
an ‘IN’ statement before balking.
Just Passing Through
Pass-through Awareness
• Proprietary RDBMS implementations of SQL
can differ in the way they store and compare
values, especially dates and null values. If
you know there are values which should join,
and they do not, check for ‘store and
compare’ issues.
• For example, Oracle-based date values and
operator symbols are different from those
used in SAS and MS_Access.
Just Passing Through
Pass-through awareness
• Where to_date(controlling_date,'DD-MON-RRRR')
= to_date('19-MAR-2001','DD-MON-RRRR')
• Where control_date LE ‘19-MAR-2001’
•
= <=
>=
<>
• Where stu_id in (‘23453’ ‘64354’ ‘43564’)
• Where stu_id in (‘23453’,‘64354’,‘43564’)
Just Passing Through
SAS Pass-through Code
proc sql;
connect to odbc as mydb (datasrc="ODSP" user=esdav2
password=xyz);
create table person as select * from
connection to mydb
(
select r.person_uid, tax_id, full_name_lfmi, birth_date,
nation_of_citizenship, visa_type_desc
from person l
right join
admissions_application r
on l.person_uid = r.person_uid
where r.academic_period = 200510
);
quit;
The power of sub-queries
The IR professional’s need:
• Using sub-queries reduces processing time by
reducing the number of rows available to join in
the main join.
• The Effective-Dated Row concept in many
RDBMS tables make using a sub-query an
effective way to select only valid rows.
• Using sub-queries may help you keep selections
on the server side of a pass-through connection.
The power of sub-queries
• AKA ‘inner query’ – all you need is a pair of parentheses!
• A sub-query is just a select query expression that is
nested within another query expression.
• They are particularly useful when forming cohorts for
study, where part of the cohort criteria is in a separate
table or tables from the main data of interest. They can
return a single value or multiple values to the main query.
• Often used in WHERE clauses and HAVING
expressions, but can occur just about anywhere.
Start slow and build on the use of non-correlated sub-queries.
Develop a small sample of your large tables, and test your sub-query
there. Then you can move on to correlated sub-queries.
The logic of sub-queries
SELECT
L.key , L.status , R.gender
FROM
Student L ,
Demogr R
WHERE
L.key = R.key
and
L.term = 200410
and R.visa = ‘F1’
L
R
JOIN
The logic of sub-queries
SELECT
L.key , L.status , R.gender
FROM
Student L ,
(SUBQUERY from Demogr) R
WHERE
L.key = R.key
and
L.term = 200410
and R.visa = ‘F1’
L
R
JOIN
The logic of sub-queries
SELECT
L.key , L.status , R.gender
FROM
Student L ,
(SUBQUERY from Demogr) R
WHERE
L.key = R.key
and
L.term = 200410
R
R
L
JOIN
The logic of sub-queries
SELECT
L.key , L.status , R.gender
FROM
Student L ,
(SELECT key, gender from
Demogr
WHERE visa = ‘F1’ ) R
WHERE
L.key = R.key
and
L.term = 200410
R
L
Notice that the subquery took the place of a table…
JOIN
The power of sub-queries
SELECT
s.person_key, m.*
FROM demographics m
WHERE
m.person_key =
( SELECT
person_key
from registration
where academic_period = 200420
and registration_ind = ‘Y’ ) s
and m.legacy_status in (‘M’, ‘F’, ‘P’)
Notice that this sub-query took the place of a where criteria equality…
The power of sub-queries
select a.fac_id ,
(a.age + a.yos) as retire_sum ,
b.dept
from faculty
a ,
faculty_depts b
where a.fac_id =
(select distinct academicfac.fac_id
from academicfac
where academicfac.serv_award = ‘Dist’)
and a.fac_id = b.fac_id
;
The sub-query first builds a table of one column, faculty ids, which
belong to faculty who have won at least one Distinguished Service
award. Then this sub-group joined to the first table, which is then
joined to the department table.
The power of sub-queries
Selecting effective-dated data
• Effective-dated rows occur in many places
in modern databases, like address tables,
term-domicile tables, and course
distribution requirement tables.
• An ED table contains one or more timedated rows per key variable, with a single
valid (active) row at a given time. There
may be rows with future time dates, and
not all time-periods have a separate row!
The power of CORRELATED sub-queries
Effective-dated CORRELATED sub-query
• SELECT
z.identity_key, A.student _status_desc,
FROM
identity z, student A
WHERE
z.identity_key = A.student _key
and
A.student _term_eff =
(
SELECT max(b. student _term_eff)
FROM student B
WHERE B.student _key = A.student _key
and B.student_term_eff <= 200510
)
Notice that this sub-query is a join equality in the where statement…
Jumping Off to Sub-queries
Recap of Concept Bridging Topics
1.
2.
3.
4.
5.
6.
7.
8.
Terminology is the primary key
Lose your mind in a single query
Conditions are optimal
Groups, having, functions
Cohorts versus survivors
Keys to the kingdom
Just passing through
The power of sub-queries
SQL for IR Staff
(You can do it…We hope we helped! ©)
Thanks for attending. This presentation is
available after 10/25/2008 at:
http://web.wm.edu/ir/conferencepres.html
[email protected]
Evan Davies
[email protected]
Dennis Catley
The College of
William and Mary
Chartered 1693
SAS is a registered trademark of SAS Institute Inc, Cary, NC 27513