Introduction to Python
Download
Report
Transcript Introduction to Python
Parsing HTML
Topic 3, Chapter 7
Network Programming
Kansas State University at Salina
Picking information from an HTML
page
A difficult problem
HTML defines page layout, not content –
advantage XML
Very useful because of volume of data
available
If the format of the page changes, your
program is broken.
HTML
Definition: Token – one piece of information
in an HTML formatted page
HTML tag – usually only relates to formatting
URL or image reference
Textual information
Must look at several tokens to determine
context of the data
Start-tag, End-tag structure leads parsing
code to use finite state machines and stacks.
( <TABLE> … </TABLE> )
Tokens
<HTML>
<HEAD>
<TITLE> Tim Bower </TITLE>
</HEAD>
<BODY BGCOLOR="lightyellow">
<TABLE> <TR>
<TD>
<H1>Tim Bower</H1>
{'data': [], 'type': 'StartTag', 'name': u'html'}
{'data': [], 'type': 'StartTag', 'name': u'head'}
{'data': u'\n
', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'title'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'title'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'head'}
{'data': u'\n\n', 'type': 'SpaceCharacters'}
{'data': [(u'bgcolor', u'lightyellow')],
'type': 'StartTag', 'name': u'body'}
{'data': u' \n\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'table'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'tbody'}
{'data': [], 'type': 'StartTag', 'name': u'tr'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'td'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'h1'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': [], 'type': 'EndTag', 'name': u'h1'}
Two main programming strategies
The call-back approach (HTMLParser shown
in text book)
Define your own class that extends the
HTMLParser class
Nice use of inheritance and polymorphism
Pass the HTML page to the parser and it calls
functions from your class as needed to process
the start-tags, data elements, end-tags and a few
other miscellaneous tags.
The document tree approach
Parser builds a tree (data structure object) based
on the page contents
You iterate through the tree or a list of tokens
taken from the tree looking for desired data.
HTMLParser
import HTMLParser
class TitleParser(HTMLParser):
def __init__(self):
self.title = ''
self.readingtitle = 0
HTMLParser.__init__(self)
def handle_starttag(self, tag, \
attrs):
if tag == 'title':
self.readingtitle = 1
def handle_data(self, data):
if self.readingtitle:
self.title += data
def handle_endtag(self, tag):
if tag == 'title':
print “*** %s ***” % \
self.title
self.readingtitle = 0
fd = open(sys.argv[1])
tp = TitleParser()
tp.feed(fd.read())
Argh!, HTMLParser is fragile and hard
to debug.
Traceback (most recent call last):
File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\
Topic 3 - Web\weatherParser.py", line 258, in <module>
parser.feed(data)
File "C:\Python25\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python25\lib\HTMLParser.py", line 301,
in check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python25\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: malformed start tag, at line 120, column 477
html5lib
Found on Python package index
Install setuptools then use Python to install
html5lib (see the README file). Both are on
K-State Online.
Advantages:
Robust, standards based parser
Filtering data after the page is parsed is easier to
follow and debug than the call-back approach
Disadvantage:
Documentation of API for traversing the tree
html5lib Usage
Build the tree:
p = html5lib.HTMLParser( \
tree=treebuilders.getTreeBuilder("dom"))
f = open( "weather.html", "r" )
dom_tree = p.parse(f)
f.close()
Loop through tokens:
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em', \
u'strong', u'br', u'img', \
u'dl', u'dt', u'dd' ]
for token in stream:
# Don't show non interesting stuff
if token.has_key('name'):
if token['name'] in passtags:
continue
print token
The DOM tree alternative
The DOM tree may be used directly.
Not documented with html5lib, but xml.dom
package is standard with Python.
DOM trees are normally used with XML, but
html5lib can make a DOM tree from HTML.
Walk through the tree by examining children
nodes of each node. With knowledge of the
page structure, you may be able to go almost
directly to the desired information.
See chapter 8 and DOMtry.py posted file.
html5lib tokens
Stream of tokens is a list
Each token is a dictionary
token[ ‘data’ ]
String (unicode encoding)
Empty list
List of tuples for formatting attributes
token[ ‘type’ ] – (StartTag, EndTag, Characters,
SpaceCharacters)
token[ ‘name’ ] – description of start and end tags.
(table, tr, td, h1, br, ul, li, … )
See example of tokens on previous slide
html5lib token parsing
doingTitle = False
for token in stream:
if token.has_key('name'):
if token['name'] in passtags:
continue
else:
tName = token['name']
tType = token['type']
if tType == 'StartTag':
if tName == u'title':
title = ''
doingTitle = True
if tType == 'EndTag':
if tName == u'title':
print "*** %s ***\n" % title
doingTitle = False
if tType == 'Characters':
if doingTitle:
title += token['data']