Transcript 下載/瀏覽
Mining Web Site's Clusters from Link Topology and Site Hierarchy 彭金隆 2005年12月20日 Outline Introduction HITS & Web Cluster Identification ReHITS - A Recursive HITS Algorithm Experimental Results and Discussion Qualitative comparison with related algorithms Introduction The category of web structure mining • focus on web subgraph identification Do not assume the union of the clusters to form the whole web site, the approach is also called cluster mining Introduction Made the conjecture that bipartite graphs also form good approximates of logically coherent clusters in web sites Proposed to extend a link analysis algorithm called the HITS algorithm for mining the clusters Introduction Authorities pages • 被許多其他的網頁所指到的頁面 Hubs pages • 指向其他網頁的頁面 一個網頁可以是authority 或 hub 或都有 Introduction The HITS algorithm is an iterative algorithm which was originally proposed to identify authorities and hubs of bipartite subgraphs in web search results. Extend HITS algorithm and propose an algorithm called ReHITS Outline Introduction HITS & Web Cluster Identification ReHITS - A Recursive HITS Algorithm Experimental Results and Discussion Qualitative comparison with related algorithms HITS & Web Cluster Identification Google based on HITS algorithm The HITS algorithm contains two main steps: • Step 1: Computing Authority Scores HITS & Web Cluster Identification • Step 2: Computing Hub Scores: The algorithm iterates Step 1 and 2 until the authority scores and the hub scores converge. HITS & Web Cluster Identification For multiple clusters in a web site, we note that due to the differences in link semantics and page context, they have their specific haracteristics as compared to communities emerged in the Web. Outline Introduction HITS & Web Cluster Identification ReHITS - A Recursive HITS Algorithm Experimental Results and Discussion Qualitative comparison with related algorithms ReHITS - A Recursive HITS Algorithm Apply the HITS algorithm to the whole web site • The authorities and hubs of the densest cluster(s) can be identified. Remove the authorities and hubs as well as other vertices in the cluster Apply the HITS algorithm again to the remainings to further explore the presence of more clusters. until no more clusters can be identified. For example, the following authorities http://www.abc.edu/doc/java/classes/index.html http://www.abc.edu/doc/java/index.html http://www.abc.edu/doc/proj/Slides/index0.html http://www.abc.edu/doc/proj/Slides/App/page.h tml will be grouped and form two potential subtrees as Potential Subtree 1: /doc/java/ Potential Subtree 2: /doc/proj/Slides/ Outline Introduction HITS & Web Cluster Identification ReHITS - A Recursive HITS Algorithm Experimental Results and Discussion Qualitative comparison with related algorithms Experimental Results and Discussion The datasets used in our experiment contain all the web pages retrieved from the web sites Except for media files (e.g., *.jpg), dynamically generated pages (e.g., *.php) as well as other resources accessed with protocols other than http (e.g., ftp). Experimental Results and Discussion The graph data structure of each of the web sites was then created by extracting the hyperlinks embedded in the <A> tags. That all the javascripts in the web pages are discarded Experimental Results and Discussion Experimental Results and Discussion 1. 2. 3. 4. Characteristics of identified clusters system documentations, research project description , and personal web sites. personal home and the theorem summary page of this thesis was identified as the strongest hub The discrepancy is mainly due to the different design of hyperlinks as navigation aids /pub/ possesses a giant tree structure without any obvious authorities or hubs. This also implies that our algorithm is not effectively in extracting groups of logically related web pages which follow tree structure Outline Introduction HITS & Web Cluster Identification ReHITS - A Recursive HITS Algorithm Experimental Results and Discussion Qualitative comparison with related algorithms Qualitative comparison with related algorithms The other algorithm does not provide a way to represent the contents of the clusters and we did that using the authority and hub pairs. Q&A