Scraping the Web with SAS

Download Report

Transcript Scraping the Web with SAS

www.OASUS.ca
Scraping the Web with
SAS
Tom Kari
Tom Kari Consulting
TASS, March 7 2014
“Come out of the desert of ignorance to the OASUS of knowledge”
www.OASUS.ca
Google is wonderful,
but…
• The first page is full of junk!
• I can’t tell how many pages I’m getting
from each site.
• I KNOW the page I want is in here
somewhere, how can I find it?
• I’m not using SAS when I use Google!
• How can I keep ALL the results to
analyze?
March 7, 2014
Tom Kari, Tom Kari Consulting
2
www.OASUS.ca
The Basics
data URL_Retrieval_Results;
length HTML_Rec $32767;
filename HTML_In url "http://www.dolphinsdance.ca";
infile HTML_In lrecl=32767;
input;
HTML_Rec = _infile_;
run;
March 7, 2014
Tom Kari, Tom Kari Consulting
3
www.OASUS.ca
March 7, 2014
The Process
What goes in the
reference to
google?
Get results from
Google
How do I find the
web sites listed by
Google?
Extract the web
sites
Figure out how to
get 1000 web site
listings
Post process the
results (SAS data
management)
Tom Kari, Tom Kari Consulting
4
1. How to send a search
to Google?
www.OASUS.ca
• In Internet Explorer:
•
•
•
•
•
F12 to open Developer Tools
Network  Start Capturing
Enter your search string
Stop Capturing
Dig around in the results
http://www.google.ca/s?gs_rn=14&gs_ri=psy-ab&cp=41&gs_id=a&xhr=t&q=beautiful%20vaca
tion%20resort%20puerto%20vallarta&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq
=&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.47008514,d.dmg&fp=5ad817295c2c0080&biw=1
123&bih=374&tch=1&ech=1&psi=8xOlUdWjBOT84AO1iYCwDw.1369773041400.1
http://www.google.ca/search?q=beautiful+vacation+resort+puerto+vallarta&start=1
March 7, 2014
Tom Kari, Tom Kari Consulting
5
www.OASUS.ca
2. Get Results from
Google
data GoogleResults;
length HTML_Rec $32767;
filename HTML_In url
"http://www.google.ca/search?q=beautiful+vacation+resort+puert
o+vallarta%nrstr(&start)=1";
infile HTML_In lrecl=32767;
input;
HTML_Rec = _infile_;
32,767 bytes
run;
March 7, 2014
Tom Kari, Tom Kari Consulting
6
www.OASUS.ca
3. How do I find the web
sites listed by Google?
<div id="res"><div id="topstuff"></div><div id="search"><div id="ires"><ol><li
class="g"><h3 class="r"><a
href="/url?q=http://www.tripadvisor.ca/Hotel_Review-
g150793-d481596-ReviewsDreams_Puerto_Vallarta_Resort_SpaPuerto_Vallarta.html&amp;sa=U&amp;ei=bhmlUbyUHPKw0QHk1YFg&a
mp;ved=0CCYQFjAAOAE&amp;usg=AFQjCNFLqCMjy4b4raYjbA8nvqHjJARGlA">
Dreams <b>Puerto Vallarta Resort</b> &amp; Spa - All-inclusive <b>Resort</b>
Reviews <b>...</b></a></h3><div class="s"><div class="kv" style="marginbottom:2px"><cite>www.tripadvisor.ca/Hotel_Review-g150793-d481596-ReviewsDreams_ <b>Puerto</b>_<b>Vallarta</b>_<b>Resort</b>_Spa<b>Puerto</b>_<b>Vallarta</b>.html</cite><span class="flc"> - <a
href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:gaglP
rouhbkJ:http://www.tripadvisor.ca/Hotel_Review-g150793-d481596-ReviewsDreams_Puerto_Vallarta_Resort_Spa-
March 7, 2014
Tom Kari, Tom Kari Consulting
7
3. How do I find the web sites
www.OASUS.ca
listed by Google? (cont’d)
The magic of PRX routines!
"Pattern matching enables you to search for and extract multiple matching
patterns from a character string in one step. Pattern matching also enables
you to make several substitutions in a string in one step. You do this by using
the PRX functions and CALL routines in the DATA step.
For example, you can search for multiple occurrences of a string and replace
those strings with another string. You can search for a string in your source file
and return the position of the match. You can find words in your file that are
doubled."
March 7, 2014
Tom Kari, Tom Kari Consulting
8
4. Extract the web sites
www.OASUS.ca
March 7, 2014
data GoogleHTMLResult;
retain prxid;
if _n_=1 then
prxid=prxparse('/(?<=<h3 class="r"><a
href="\/url\?q=)[[:alnum:]\._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o');
length HTML_Rec $32767;
filename HTML_In url
"http://www.google.ca/search?q=beautiful+vacation+resort+pue
rto+vallarta%nrstr(&start)=1";
infile HTML_In lrecl=32767;
input;
HTML_Rec = _infile_;
call prxsubstr(prxid,HTML_Rec,pos,len);
CiteData=substr(HTML_Rec,HTML_Pos,HTML_Len);
output;
Tom Kari, Tom Kari Consulting
9
run;
www.OASUS.ca
5. Figure out how to get
1000 web site listings
Quirks to remember
• Many characters can’t appear in Google search
strings, so must be encoded (spaces to +, etc.)
• Ampersands in your URL need %nrstr or will fail in
SAS
• To use a new url infile in SAS, you need a new data
step. This is easy with a macro loop.
• Every now and then it fails – “ERROR: Invalid reply
received from the HTTP server. Use the debug
option for more info.” Beats me!
March 7, 2014
Tom Kari, Tom Kari Consulting
10
5. Figure out how to get 1000
www.OASUS.ca
web site listings (cont’d)
Code is in “Example 4 Extract 1000 URLs”
March 7, 2014
Tom Kari, Tom Kari Consulting
11
6. Post-process the results
www.OASUS.ca
• Count how many time each URL appears
• For each unique URL, retain the page and
index where it first appears
• Create a nice looking HTML page
• Code is in “Example 5 Post-processed”
March 7, 2014
Tom Kari, Tom Kari Consulting
12
www.OASUS.ca
March 7, 2014
Tom Kari, Tom Kari Consulting
13
Appendix B: PRX parse strings
www.OASUS.ca
prxid=prxparse('parse string');
/(?<=<h3 class="r"><a
href="\/url\?q=)[[:alnum:]\._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o
outer control non-captured group any-of
one or more as-is as-is escaped escaped
grouping
March 7, 2014
Tom Kari, Tom Kari Consulting
14