Data-Mining the Web Using Perl Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University.

Download Report

Transcript Data-Mining the Web Using Perl Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University.

Data-Mining the Web
Using Perl
Burt L. Monroe
Director, Quantitative Social Science Initiative
Department of Political Science
The Pennsylvania State University
Data-Mining the Web

Examples
• Election Returns in Luxembourg


Luxembourg Official Election Results, 2004
http://qssi.psu.edu/files/luxembourg.pl
• Parliamentary Speech

The Congressional Record
How’d You Do That?

There are several programming languages
with “straightforward” facilities for doing
this. Most notably,
• Perl
• Python
• Java

I’m going to talk about Perl, because
• it’s the most established
• it’s the one I know

It appears that Python may be preferable,
but that’s for someone else to say.
What’s Perl?


Open source (free / flexible / extensible / a little
wild and woolly – like Linux, R) programming
language.
It is very very good at processing text.
• note, webpages are just texts.
• note, datasets (like a flat spreadsheet or Stata file) are
just texts.
• Social scientists might have some use for turning one
into the other, no?

It has very useful facilities for building
• Spiders
• Scrapers
• (and “agents”, “robots”, “crawlers”, etc.)
What’s a Spider?


A spider is a program designed to
automatically gather webpages.
If, for example, you want to
automatically download all of the
speeches delivered in Congress
today – without manually clicking on
every one, cutting and pasting, etc.
– you might want to build a spider.
What’s a scraper?


A scraper (or “screen-scraper”)
extracts the information you want –
whatever you consider to be data –
from a given webpage.
If you want to know who said
“health” and how many times, you
might want to build a scraper.
BEWARE!

Spiders (and other similar types of programs –
“robots”, “crawlers”) can be put to nefarious use:
•
•
•
•

appropriating copyrighted materials
extracting email addresses for spammers
overwhelming servers to create “denial of service”
generally violating a site’s “terms of service” or
“acceptable use policy”
If you are not careful to use legal and ethical
good practices, you can
• be denied access to a website altogether
• get yourself or the university sued or even subjected to
criminal penalties
Perl


Open-source
Cross-platform
• (Windows – I recommend “ActivePerl” from
http://www.activestate.com)

There are many websites with resources:
• http://www.cpan.org (Comprehensive Perl
Archive Network)
• http://www.perlmonks.org (PerlMonks)
• http://www.perl.org
• http://perl.oreilly.com (O’Reilly Publishing)

Lots of mailing lists, etc.
Books

Basics of Perl
• The best books are put out by O’Reilly Publishing and
are generally known by the animal on the cover.
• Learning Perl (the Llama)

or, Learning Perl on Win32 Systems (the Gecko)
• Programming Perl (the Camel)

Web-mining
• Perl & LWP (the Blesbok, apparently)
• Spidering Hacks

These books, and some others, are or will be
available in the “QuaSSI Library” (in Pond 216).
Running Perl

For machines with approved ActivePerl
installations in Pond ...
• Perl is located in c:/Perl/

For today,
• we will operate entirely in the directory c:/Perl/eg/
• To get there,




open Programs -> Accessories -> Command Prompt
At the prompt, type c:
Type cd Perl/eg
(In your particular installation, or in a Mac, or
something like Unix on high performance
computing, these details will be different.)
The First Perl Program

Go to the QuaSSI Website for the example
scripts for todays workshop:
• http://qssi.psu.edu/files/howdy.pl




Right-click on the first script, “howdy.pl”,
and save it to c:\Perl\eg\
Open up the text-editor WinEdt (you could
use almost anything) and then open
howdy.pl
That’s a complete Perl program.
Note: that’s all a program is – a text file.
Running a Perl Program



Go back to your command prompt.
Type perl howdy.pl –w
(The –w tells perl to give you
warnings about what might be
wrong if the program is broken.)
Modifying a program






Go back to WinEdt
Edit the text between the quotation
marks to say something new
Click File -> Save
Go back to the command prompt
Hit the up arrow (to get the last
command, perl howdy.pl –w
Look at that – you’re a programmer!
Break the program





Go back to WinEdt
Delete the semicolon at the end of
the line
Save the file
Go back to the command prompt and
run the program, with –w, again
What happened?
Perl at 30,000 feet


Much of the next set of slides is
stolen shamelessly from Andy
Tester’s “Perl at 10,000 Feet” at
www.petdance.com
(I’m skipping even more than he
did.)
Some generalities about Perl




Statements in Perl are, or usually can be,
constructed in a fairly natural English-like
way.
There are many ways to do any one thing.
The syntax can be offputting and hard to
read, especially at first. It is easy to
“obfuscate” Perl code and this is
sometimes done intentionally.
Main syntax rule: end all lines with ;
Data Types






Scalars
Arrays and Lists
Hashes
References
Filehandles
Objects
Scalars

Numbers
• Generally decimal floating point
• (Can be made integer, octal,
hexadecimal)

Strings
• Can contain any character
• Can be null: “”
• Can be arbitrarily large
Strings

Single-quoted
• characters are as shown with only two exceptions.



single-quote in a single-quoted string requires \’
backslash in a single-quoted string requires \\
Double-quoted
• it will interpolate – calculate variables or control sequences.

For example
• $foo = “myfile”;
• $datafile = “$foo.txt”;
• will result in the variable $datafile holding the string “myfile.txt”

Another example
• print ‘Howdy\n’; will print:
 Howdy\n
• print “Howdy\n”; will print
 Howdy
• (\n is a control sequence, standing for “new line”).
Scalar operators

Math
• *, /, % (for modulo), ** (for exponentiation),
etc.

Strings
• x to repeat the thing on the left

“b” x 10 gives “bbbbbbbbbb”
• . concatenates strings


(“na” x 16).“ Batman!” gives ...
Perl knows to convert when mixing these
two types:
• “3”*4 gives 12
• “3”.4 gives “34”
Comparing Scalars






Comparison
Equal
Not equal
Less than
Greater than
Less / equal
Greater / equal
Numeric
==
!=
<
>
<=
>=
8 < 25
“8” lt “25”
TRUE!
FALSE!
String
eq
ne
lt
gt
le
ge
Variables


A sign, followed by a letter, followed by pretty much
whatever.
Sign determines the type:
• $foo is a scalar
• @foo is a list
• %foo is a hash

Variables default to global (they apply in all parts of your
program). This can be problematic.
• local $var will make the variable active only for the current
“block” of code.
• my $var does the same, and is the more usual construction.
• the very common use strict; at the beginning of code forces
good practice in the use of local variables (creates more
syntax errors, but prevents more whoppers that could blow
everything up.)
Lists and Arrays





A list is an ordered set of (usually)
scalars.
An array is a variable holding a list.
my @foo = (1,2,3)
my @bar = (“elephant”, 3.14)
Can be constructed as lists of scalar
variables:
• my @data = ($name, $address, $SSN)
Using Arrays

Elements are indexed, from 0.
• my @animals = (“frog”, “bear”, “elephant”);
• print $animals[2]; # prints elephant
• Note: element is a scalar, so $ rather than @

Subsections are “slices”.
• my @mammals = @animals[1,2];

Lots of functions for
• using as a stack (moving things on and off the right or left side
of the array).
• sorting
• joining two arrays
• splitting a scalar string into an array



my $sentence = “This is my sentence.”;
my @words = split(“ “, $sentence);
# now @words contains (“This”, “is”, “my”, “sentence”);
Programming Controls

Control structures
•
•
•
•
•
•

if / then / elsif / else
while
do {} while
do {} until
for ()
foreach() # loops over a list
Errors / warnings
• die “message” kills program and prints
“message”.
• warn “message” prints message and keeps
going.
Hashes


“Associative arrays”
A set of
• values (any scalar), indexed by
• keys (strings)

Example
• my %info;
• $info{ “name” } = “Burt Monroe”;
• $info{ “age” } = 39;

With hashes and arrays you can create almost
any arbitrary data structure (even arrays of
arrays, arrays of hashes, hashes of arrays, etc.)
File Handling


open() function opens a file for
processing.
Prefix the filename to define how
• “<“ for input from existing file (read)
• “>” to create for output (write)
• “>>” to append to a file (that may not yet
exist)


open (IN, “<myfile.txt”) or die
“Can’t open myfile.txt”;
Can then use <> to refer to the file. The
above would be <IN>.
Matching string patterns using
regular expressions




This is where much of the power of Perl lies.
m/pattern/ will check the last stored variable ($_) for
pattern.
$var =~ m/pattern/; will check $var for pattern.
If the pattern is in $var, then
• $var =~ m/pattern/ is TRUE.

If you “group” part of the pattern and it is present,
• $var =~ m/(pattern)/ is true, AND, now a variable names $1
contains the first match it found.
• Group more pieces of the pattern and the matches are stored
in $2, $3, etc.

This only grabs the *first* match. To grab all, say
• my @matches = ($var =~ m/(pattern)/g);
• This will store every match in the array @matches.
What’s a “regular expression”?

Combination of
any literal
.
*
+
?
[aeiou]
^
$
\b
\d \D
\s \S
\w \W
|
()

character, number, etc.
any single character
zero or more of the previous
one or more of the previous
zero or one of the previous
character class – this is the vowels
beginning of the line
end of the line
word boundary
digit / non-digit
space / non-space
word character / non-word character
or – match this or that
grouping
See handout for more.
Examples





Romeo|Juliet
“Romeo” or “Juliet”
\d\d\d-\d\d\d\d
a phone number
(\d\d\d-)?\d\d\d-\d\d\d\d
phone #, maybe w/ area
\b[aeiou]\w+
a word starting w/ a vowel
\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b email add.
Modules


Hundreds of modules / packages
available through cpan.
ActivePerl gives a GUI for installing
them in its “Perl Package Manager”.
A basic Perl example

Counting words.
• counter1.pl
Grabbing from the web


The basic idea is simply to have Perl
act as an “agent”, in the way a
browser like Explorer or Firefox does
-- requesting and interpreting
webpages.
There are a few basic modules that
can do this.
LWP::Simple

lwpsimpleget.pl
LWP::UserAgent


More elaborate than LWP::Simple.
I’m going to skip that one today, but
it’s covered in details in the main
books
• Perl & LWP
• Spidering Hacks

Pretty much all of the functionality
has been wrapped more intuitively
into ...
WWW::Mechanize

mechanizeget.pl
Scraping


At its base, this is just extracting
information from the page(s) you
download.
Simple example:
• freshair.pl
Your agent can interact ...


For example, what if the webpage
involves a form ...
Example
• abstracts.pl

You can authenticate with username
and password, run through proxy
servers, and so on.
Spiders

Type 1 Requester
• Requests a few items with known urls from a website.

Type 2 Requester
• Requests a few items, then requests (some set of) pages to
which those items link.

Type 3 Requester
• Starts at a given url, and then requests everything linked,
everything linked by that, etc. at the same host server. The
idea here is usually to download an entire website.

Type 4 Requester
• Starts at a given url, requests everything linked anywhere,
everything linked by that, etc. until it, perhaps, visits the
entire web.

YOU – I am talking to YOU – in all likelihood have no
business writing Type 3 or Type 4 spiders. These can easily
go seriously awry causing mayhem of many sorts. Write
only spiders with known finite scope.
Back to the Luxembourg Miner

Commune-level election results from
Luxembourg.
• luxembourg.pl
More on Scraping


All of the examples scraped / parsed using
regular expressions.
More structured data like HTML is often better (or
only) addressed with more specialized tools:
• HTML::TokeParser
• HTML::TreeBuilder

There are modules for scraping from XML,
spreadsheets, databases, Word docs, PDFs.