Transcript Document

Using Search Engines and
Web Crawlers in Social
Science Research
Mike Thelwall
Head, Statistical Cybermetrics Research Group
University of Wolverhampton, UK
http://linkanalysis.wlv.ac.uk
RC33 August 2004
Link Analysis in Social Science
Research

Use to study web phenomena



Use to study offline phenomena with web aspects



E.g. NGO web site interlinking
E.g. university web site interlinking
E.g. scholarly communication
E.g. the perception of news events
The web is a free, accessible massive data source
for information about many aspects of life
What use is hyperlink data to
qualitative researchers?

Part of a mixed methodology



Numbers to back up theories
To obtain samples of types of Web pages for
qualitative analyses
Background information on how the Web is
used
Quick example 1:
UK university
interlinking
with geographic
clusters indicated
Quick example 2:
Asia-Pacific
university
interlinking.
{Research with
Alastair Smith,
VUW, NZ}
Quick example 3:
Geographic interlinking trends for UK universities.
Talk overview



A social science approach for link analysis
Data collection with commercial search
engines
Data collection and analysis with
SocSciBot
A social science approach for link
analysis 1: Preliminary steps
1.
2.
3.
4.
Formulate an appropriate research question,
taking into account existing knowledge of web
structure
Conduct a pilot study
Identify web pages or sites that are appropriate to
address a research question
Collect link data from a commercial search
engine or a personal crawler

taking appropriate safeguards to ensure that the
results obtained are accurate
A social science approach for link
analysis 2: Validation
5.
6.
Partially validate the link count results
through correlation tests
Partially validate the interpretation of the
results through a link classification
exercise or web author interviews
A social science approach for link
analysis 3: Reporting
8.
Report results with an interpretation
consistent with link classification exercise

9.
include either a detailed description of the
classification or exemplars to illustrate the
categories
Report the limitations of the study and
parameters used in data collection and
processing
Link data from commercial
search engines

Commercial search engines can give
information about the existence of links in
the web


Can be used for data collection
Advanced interfaces are usually needed, or
special commands
Google


Can find all links to a given web page with
the link: command
E.g. link:http://www.siswo.uva.nl/rc33/
Yahoo! site-specific searches

Yahoo! allows searching for links between
pairs of web sites/web spaces

E.g. linkdomain:db.dk +site:ac.uk returns
web pages in the ac.uk domain that link to the
db.dk site
…ac.uk/…
…db.dk/…
SocSciBot



Personal crawler for link research
Available free at socscibot.wlv.ac.uk
Crawls sets of web sites and analyses the
links between them, producing:



Link lists
Link counts
Network diagrams
Reprise: Link Analysis in Social
Science Research

Use to study web phenomena



Use to study offline phenomena with web aspects



E.g. NGO web site interlinking
E.g. university web site interlinking
E.g. scholarly communication
E.g. the perception of news events
The web is a free, accessible massive data source
for information about many aspects of life
But don’t forget the need for validation!