Transcript Document

Link analysis as a social science
technique
Mike Thelwall
Statistical Cybermetrics Research Group
University of Wolverhampton, UK
http://cybermetrics.wlv.ac.uk/
Link Analysis Manifesto

Links are:



A wonderful new source of information about
relationships between people, organisations
and information
An easy to collect data source
But:

Results should be interpreted with care
Talk Structure



Part 1: Academic link analysis –mainly
from an information science perspective
Part 2: Software demonstration
Part 3: A social science link analysis
methodology
Link Analysis: Motivation



Individual hyperlinks reflect concrete creation
reasons such as connections between web page
contents or creators
Counts of large numbers of hyperlinks may
reflect wider underlying social processes
Links may reflect phenomena that have
previously been difficult to study, opening up
new research areas

E.g. informal scholarly communication
Part 1: Academic Hyperlink
Analysis


To map patterns of communication between
researchers in a country based upon university
web sites
Patterns of communication are also mapped based
upon journal citations or journal title words



Provides useful information about the structure and
evolution of research fields
Can identify previously unknown field connections
Web analysis could illustrate wider and more
current patterns
Data Collection


Web crawler
AltaVista advanced queries, e.g. Links from
Wolves Uni. to Oxford Uni.
domain:wlv.ac.uk AND linkdomain:ox.ac.uk

Google link queries
Find links to specific URLs, e.g. links to the Institute
home page
link:www.oii.ox.ac.uk

Types of link count

Direct link counts


Co-inlink counts


Inter-site links only
E
B and C are co-inlinked
Co-outlink counts

D
A
B
D and E are co-outlinked
C
F
Alternative Document Models

A method to ignore multiple similar links

E.g., domain ADM: count links between
domains instead of pages
P1
P2
P3
www.scit.wlv.ac.uk
P4
P5
P6
www.oii.ox.ac.uk
Some Inter-University
Hyperlink Patterns
Mainly for the UK and Europe
Citation-Style Hyperlink
Analysis

Citation counts are known to be reasonable
indicators of research quality but is the same true
for inlink counts?


Counts of links to universities within a country can
correlate significantly with measures of research
productivity
The significance of this result is in giving
‘permission’ to investigate the use of inter-university
links for researching scholarly communication
Most links are only loosely
related to research

90% of links between UK university sites have
some connection with scholarly activity,
including teaching and research


But less than 1% are equivalent to citations
So link counts do not measure research
dissemination but are more a natural by-product
of scholarly activity


Cannot use link counts to assess research
Can use link counts to track an aspect of
communication
Links to UK universities against
their research productivity
The reason for the
strong correlation is
the quantity of Web
publication, not its
quality
This is different to
citation analysis
Universities tend to link to
neighbours
Universities
cluster
geographically
Language is a factor in
international interlinking
English the dominant language for Web sites in
the Western EU
 In a typical country, 50% of pages are in the
national language(s) and 50% in English
 Non-English speaking extensively interlink in
English
{Research with Rong Tang & Liz Price}

Can map patterns of international
communication
Counts of links
between EU
universities in
Swedish are
represented by
arrow thickness.
Counts of
links between
EU
universities in
French are
represented
by arrow
thickness.
Which
language???
Which
language???
Linking patterns vary enormously
by discipline



No evidence of a significant geographic trend
Disciplinary differences in the extent of
interlinking: e.g., history Web use is very low,
Chemistry is very high
Individual research projects can have an
enormous impact upon individual departments


E.g. Arts web sites are often for specific exhibitions
or for digital media projects
Links not frequent enough to reliably reveal
patterns of interdiscipliniarity
The next slide is a (Kamada-Kawai)
network of the interlinking of the “top” 5
universities in AEAN countries (Asia and
Europe) with arrows representing at least
100 links and universities not connected
removed.
(Research with Han Woo Park)

Clustering using links
Background: Power laws in
Academic Webs

Academic Webs have a topology dominated by
power laws, including




Counts of links to pages (inlink counts)
Counts of links to pages (outlink counts)
Groups of interconnected pages
Power laws mean that


Link creation obeys the ‘rich get richer’ law
“Communities” of pages or sites are rarely pure but
tend to multiply overlap
Page Outlinks
Topological component sizes:
“pure link communities”
Community Identification
Algorithm: “Impure communities”


Can apply to pages, directories and domains
Gives complimentary results: a “layered
approach”
100000
10000
1000
Frequency
Frequency
10000
1000
100
100
10
10
1
1
1
10
100
1000
10000
Community size: page model, k = 32
100000
1
10
100
1000
Community size: Directory model, k = 32
10000
Stretching links further: coinlinks, co-outlinks

More interlinked does not imply more similar


Can use any type of link to look for similar sites


For the UK academic Web, about 42% of domains
connected by links alone host similar disciplines, and
about 43% connected by links, co-inlinks and cooutlinks
Over 100 times more domains are co-inlinked or cooutlinked than are directly linked
Links in any form are less than 50% reliable as
indicators of subject similarity
Summary

Studies of the relatively restricted
subdomain of university web sites


Produce direct research results
For Web Information Retrieval (e.g. search
engines), they also


Help refine methodologies
Help build intuition about web structure
Part 2: Software Demonstration

SocSciBot


SocSciBot Tools


Link analyser for SocSciBot data
Cyclist


Web crawler for social sciences research
Search engine with some corpus linguistics capability
(e.g. word frequency lists for each site)
http://socscibot.wlv.ac.uk/
Part 3: A General Social Science
Link Analysis Methodology

A general framework for using link counts in
social sciences research



For research into link creation or
Together with other sources, for research into other
online or offline phenomena
Applicable when there are enough links relevant
to the research question to count


For collections of large web sites or
For large collections of small web sites
Nine stages for a research project
1.
2.
3.
Formulate an appropriate research
question, taking into account existing
knowledge of web structure
Conduct a pilot study
Identify web pages or sites that are
appropriate to address the research
question
Nine stages for a research project
4.
5.
6.
Collect link data from a commercial
search engine or a personal crawler, taking
appropriate accuracy safeguards
Apply data cleansing techniques to the
links, if possible, and select an appropriate
counting method
Partially validate the link count results
through correlation tests, if possible
Nine stages for a research project
7.
8.
9.
Partially validate the interpretation of the results
through a link classification exercise
Report results with an interpretation consistent
with link classification exercise, including either
a detailed description of the classification or
exemplars to illustrate the categories
Report the limitations of the study and
parameters used in data collection and
processing
Interpreting link counts

For most research, need to be able to place an
interpretation on link counts



E.g. A links to B more than C, therefore…
A is inlinked more than B therefore…
Do links ‘measure’ visibility, luminosity,
authority, information exports/imports,
communication, impact, online impact, quality,
importance, interpersonal communication,
nothing, random actions,…?
Interpreting link counts

Classifying random samples of links can
help decide how to interpret them


E.g. Links predominantly reflect…
Correlation test are also useful as a form of
triangulation

E.g. Links counts associate with…
The theoretical perspective for
link counting

In order to be able to reliably interpret link
counts, all links should be created




individually and independently,
by humans,
through equivalent gravity judgments (e.g., about the
quality of the information in the target page).
Additionally, links to a site should target pages
created by the site owner or somebody else
closely associated with the site.
Summary



Link counts are an information source that
may reveal new insights into online and
offline phenomena
Can be used in conjunction with other data
sources to address many research questions
With existing tools, are relatively easy to
use in research