下載/瀏覽

Download Report

Transcript 下載/瀏覽

Mining Web Site's Clusters
from Link Topology and Site
Hierarchy
彭金隆
2005年12月20日
Outline





Introduction
HITS & Web Cluster Identification
ReHITS - A Recursive HITS Algorithm
Experimental Results and Discussion
Qualitative comparison with related
algorithms
Introduction

The category of web structure mining
• focus on web subgraph identification

Do not assume the union of the
clusters to form the whole web site,
the approach is also called cluster
mining
Introduction


Made the conjecture that bipartite
graphs also form good approximates
of logically coherent clusters in web
sites
Proposed to extend a link analysis
algorithm called the HITS algorithm
for mining the clusters
Introduction

Authorities pages
• 被許多其他的網頁所指到的頁面

Hubs pages
• 指向其他網頁的頁面

一個網頁可以是authority 或 hub 或都有
Introduction


The HITS algorithm is an iterative
algorithm which was originally
proposed to identify authorities and
hubs of bipartite subgraphs in web
search results.
Extend HITS algorithm and propose
an algorithm called ReHITS
Outline





Introduction
HITS & Web Cluster
Identification
ReHITS - A Recursive HITS Algorithm
Experimental Results and Discussion
Qualitative comparison with related
algorithms
HITS & Web Cluster Identification


Google based on HITS algorithm
The HITS algorithm contains two
main steps:
• Step 1: Computing Authority Scores
HITS & Web Cluster Identification
• Step 2: Computing Hub Scores:

The algorithm iterates Step 1 and 2
until the authority scores and the
hub scores converge.
HITS & Web Cluster Identification

For multiple clusters in a web site,
we note that due to the differences
in link semantics and page context,
they have their specific haracteristics
as compared to communities
emerged in the Web.
Outline





Introduction
HITS & Web Cluster Identification
ReHITS - A Recursive HITS
Algorithm
Experimental Results and Discussion
Qualitative comparison with related
algorithms
ReHITS - A Recursive HITS
Algorithm

Apply the HITS algorithm to the whole web site
• The authorities and hubs of the densest cluster(s) can
be identified.



Remove the authorities and hubs as well as other
vertices in the cluster
Apply the HITS algorithm again to the remainings
to further explore the presence of more clusters.
until no more clusters can be identified.

For example, the following authorities
http://www.abc.edu/doc/java/classes/index.html
http://www.abc.edu/doc/java/index.html
http://www.abc.edu/doc/proj/Slides/index0.html
http://www.abc.edu/doc/proj/Slides/App/page.h
tml

will be grouped and form two potential
subtrees as
Potential Subtree 1: /doc/java/
Potential Subtree 2: /doc/proj/Slides/
Outline





Introduction
HITS & Web Cluster Identification
ReHITS - A Recursive HITS Algorithm
Experimental Results and
Discussion
Qualitative comparison with related
algorithms
Experimental Results and
Discussion
The datasets used in our experiment
contain all the web pages retrieved
from the web sites


Except for media files (e.g., *.jpg),
dynamically generated pages (e.g.,
*.php) as well as other resources
accessed with protocols other than
http (e.g., ftp).
Experimental Results and
Discussion


The graph data structure of each of
the web sites was then created by
extracting the hyperlinks embedded
in the <A> tags.
That all the javascripts in the web
pages are discarded
Experimental Results and
Discussion
Experimental Results and
Discussion

1.
2.
3.
4.
Characteristics of identified clusters
system documentations, research project
description , and personal web sites.
personal home and the theorem summary page
of this thesis was identified as the strongest
hub
The discrepancy is mainly due to the different
design of hyperlinks as navigation aids
/pub/ possesses a giant tree structure without
any obvious authorities or hubs. This also
implies that our algorithm is not effectively in
extracting groups of logically related web pages
which follow tree structure
Outline





Introduction
HITS & Web Cluster Identification
ReHITS - A Recursive HITS Algorithm
Experimental Results and Discussion
Qualitative comparison with
related algorithms
Qualitative comparison with related
algorithms

The other algorithm does not provide
a way to represent the contents of
the clusters and we did that using
the authority and hub pairs.
Q&A