extract links

Download Report

Transcript extract links

Agenda
• Overview of the project
• Resources
CS172 Project
crawling
indexing
ranking
Phase 1 Options
• Web data
– Needs to come out with your own crawling
strategy
• Twitter data
– Can use third-party for Twitter Streaming API
– Still needs some web crawling
Crawling
getNext
Frontier
•
•
www.cs.ucr.edu
www.cs.ucr.edu/~v
agelis
getNext()
Add(List<URLs>)
1
Download contents of page
2
Parse the downloaded file
to extract links the page
3
Clean and Normalize the
extracted links
4
Store extracted links in the
Frontier
addAll(List)
1. Download File Contents
2. Parsing HTML to extract links
<- This is what you
will see when you download
a page. Notice HTML Tags.
2. Parsing HTML file
• Write your own parser
Some suggestions: Parse HTML file as XML.
Two Parsing methods
– SAX (Simple API for XML)
– DOM (Document Object Model)
• Use existing library
– JSoup (http://jsoup.org/). Can be used to download the page.
– HTML Parser (http://htmlparser.sourceforge.net/)
2. Parsing HTML file
• Things to think about
– How to handle Malformed HTML?
Browser can still display it, but how do you handle it?
3. Clean extracted URLs
• Some URL entries while crawling www.cs.ucr.edu
/intranet/
/inventthefuture.html
systems.engr.ucr.edu
news/e-newsletter.html
http://www.engr.ucr.edu/sendmail.html
http://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html
/faculty/
/
/about/
#main
http://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104
3. Clean extracted URLs
What to avoid
• Parse only http links (avoid ftp, https or any other protocol)
• Avoid duplicates
– Bookmarks : #main – Bookmarks should be stripped off.
– Self paths: /
• Avoid downloading pdfs or images
– /news/GraphenePublicationsIndex.pdf
– Its ok to download them, but you cannot parse them.
• Take care of invalid characters in URLs
– Space: www.cs.ucr.edu/vagelis hristidis
– Ampersand: www.cs.ucr.edu/vagelis&hristidis
– These characters should be encoded else you will get a
MalformedURLException
Normalize Links Found on the page
• Relative URLs:
– These URLs have no host address
– E.g. While crawling (www.cs.ucr.edu/faculty) you find urls
such as:
– Case 1: /find_people.php
• A “/” at the beginning means path starts from the root of the host
(www.cs.ucr.edu) in this case.
– Case 2: all
• No “/” means the path is relative to current path.
• Normalize them (respectively) to
– www.cs.ucr.edu/find_people.php
– www.cs.ucr.edu/faculty/all
Clean extracted URLs
• Different Parts of the URL highlighted with different colors
• http://www.pe.com:8080/local-news/riversidecounty/riverside/riverside-headlines-index/20120408riverside-ucr-develops-sensory-detection-forsmartphones.ece?ssimg=532988#ssStory533
• Protocol
• Port
• Host
• Path
• Query
• Bookmark
java.net.URL
Has methods that can separate different parts of the URL.
getProtocol: http
getHost: www.pe.com
getPort: -1
getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riversidegetQuery: ssimg=532988
getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-u
Normalizing with java.net.URL
• You can normalize URLs with simple string
manipulations and using methods from
java.net.URL class.
• Here is the snippet for normalizing “Case 1”
root relative URLs
Crawler Ethics
• Some websites don’t want crawlers swarming
all over them.
• Why?
– Increases load on the server
– Private websites
– Dynamic websites
–…
Crawler Ethics
• How does the website tell you (crawler) if and
what is off limits.
• Two options
– Site wide restrictions: robots.txt
– Webpage specific restrictions: Meta tag
Crawler Ethics
robots.txt
• A file called “robots.txt” in
the root directory of the
website
• Example:
http://www.about.com/robots.txt
• Format:
User-Agent: <Crawler name>
Disallow: <don’t follow path>
Allow: <can-follow-paths>
Crawler Ethics
robots.txt
• What should you do?
– Before starting on a new website:
– Check if robots.txt exists.
– If it does, download it and parse it for all
inclusions and exclusions for “generic crawler” i.e.
User-Agent: *
– Don’t’ crawl anything in the exclusion list including
sub-directories
Crawler Ethics
Website Specific: Meta tags
• Some webpages have one the following metatag entries:
• <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
• <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
• <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
• Options:
– INDEX or NOINDEX
– FOLLOW or NOFOLLOW
Twitter data collecting
• Collecting through Twitter Streaming API
– https://dev.twitter.com/docs/platform-objects/tweets, where you can check
the data schema.
– Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get
about 4.3M tweets per day (about 2GB)
– You need to have a Twitter account for that. Check https://dev.twitter.com/
Third-party libarary
Twitter4j for Java.
• You can find supports for other languages also.
• Well documented and code examples. e.g.,
http://twitter4j.org/en/code-examples.html
Important Fields
• At least following fields you should save:
–
–
–
–
–
Text
Timestamp
Geolocation
User of the tweet
Links
Crawl links in Tweets
• Tweets may contain links.
– It may contains useful information. E.g., links to
news articles.
• After collect the tweets, use another process
to crawl the links.
– Because the crawling is slower, so you may not
want to crawl it right after you get the tweet.