Transcript Chapter 10

Chapter 10 Link Analysis

Introduction

• Airline Route Maps are useful – Information can tell you about both history and politics • Call Detail Records tell use about relationships between people • Web is based on (hyper)links between documents • Claim that there are no more than 6 degrees of separation between any two people • Link Analysis is the data mining technique that addresses relationships and connections • Link Analysis is based on

Graph Theory

2

Introduction

• Effective in many situations – Identifying authoritative sources of information on the WWW by analyzing page links – Understanding physician referral patterns – Analyzing telephone call patterns • MCI Friends and Family • Could give out private info – You know Mary Smith, also on MCI, so join MCI » But your wife does not know Mary Smith » Far-fetched: Facebook does it all of the time!!!!

– Can identify fraud: calling card thief's call same people – Can you think of other applications? Links?

3

Basic Graph Theory

• Graphs are an abstraction used to represent relationships • Graphs consist of – Nodes (vertices) which are the things in the graph that have relationships – Edges are pairs of nodes connected by a relationship • Visualization is a key characteristic of a graph 4

Basic Graph Theory

• A

path

is an ordered sequence of nodes connected by edges – Flight Segments (legs) such as LA – Denver – Boston • A

weighted graph

is one in which the edges have weights associated with them – Example: Weights support the association between two products being purchased together 5

Graph Theory Classic Problems

1. Finding an Euler path in the graph that visits every edge exactly one time (Seven Bridges – edges are bridges and nodes are land). Simple rule: at most 2 nodes with odd degree.* 2. Finding the shortest path that visits the nodes in the graph exactly one time (Traveling Salesman) – Completely connected graph with

n

nodes has n! paths an no algorithm exists for solving that is not exponential in n – No simple rul.

*No simple algorithm to determine Hamiltonian path that visits each vertex exactly once 6

Directed vs Undirected Graphs

Undirected graphs

– edges between nodes go in both directions (A to B; B to A) •

Directed graphs

– edges between nodes only go in one direction (A to B is different than B to A) – Ex: WWW 7

Google – Directed Graph Example

• Web pages = nodes • Hyperlinks = edges • Spiders & Web crawlers updating • Kleinberg’s Algorithm – Hub – a page that links to many authorities – Authority – a page that is linked to by many hubs 8

Google – example continued

• Authority versus mere popularity – Rank by number of unrelated sites linking to a site yields

popularity

– Rank by number of

subject related hubs

that point to them yields

authority

– Helps to overcome the situation that often arises in popularity where the real authority (eg Home Page) is ranked lower because of lack of popularity of links to it 9

Kleinberg Algorithm

• Search process: – begins with text based keyword matching that returns a root set of hundreds of good matches.

– Identify candidate pages: add all pages linked to by the root set and a subset of pages that link to the root set. Leads to 1K-5K pages.

– Rank hubs and authorities (see defn): iterative algorithm that rewards hubs that are associated with strong authorities and authorities associated with strong hubs • Pages start with weight of 1 and hubs (authorities) rewarded based on weights of associated authorities (hubs) • (see the pagerank example linked to on our schedule) 10

Link Analysis Applications

• Can use link analysis to identify fax machines – Fax machines generally call other fax machines – Can run iterative algorithms to propagate information on how each phone number is used • At AT&T I used a non-link approach to identify voice vs. data vs. fax lines – Used call detail records to describe each phone number and used autodialer to generate training data 11

Using Links to Generate Recommendations

• A grad student and I built a graph with nodes to represent movies and people – A link from a person to a movie indicated the review rating – Graph is very sparse: most people do not see most movies – Matched people to those with similar movie preferences and then filled in edges • Once more edges filled in, easier to compute similarity between users and process iterates 12

End of Chapter 10

13