Transcript Slides

Multimedia Indexing and Dimensionality Reduction

Multimedia Data Management • The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video tracks) has increased in the recent years.

• Joint Research from Database Management, Computer Vision, Signal Processing and Pattern Recognition aims to solve problems related to multimedia data management.

Multimedia Data • There are four major types of multimedia data: images, video sequences, sound tracks, and text.

• From the above, the easiest type to manage is text, since we can order, index, and search text using string management techniques, etc.

• Management of simple sounds is also possible by representing audio as signal sequences over different channels.

• Image retrieval has received a lot of attention in the last decade (CV and DBs). The main techniques can be extended and applied also for video retrieval.

Content-based Image Retrieval • Images were traditionally managed by first annotating their contents and then using text-retrieval techniques to index them.

• However, with the increase of information in digital image format some drawbacks of this technique were revealed: • Manual annotation requires vast amount of labor • Different people may perceive differently the contents of an image; thus no objective keywords for search are defined • A new research field was born in the 90’s: based Image Retrieval images based on their aims at indexing and retrieving visual contents .

Content-

Feature Extraction • The basis of Content-based Image Retrieval is to extract and index some visual features • There are general features (e.g., color, texture, shape, etc.) and domain-specific features (e.g., objects contained in the image).

• Domain-specific feature extraction can vary with the application domain and is based on pattern recognition • On the other hand, general features can be used independently from the image domain.

of the images.

Color Features • To represent the color of an image compactly, a histogram is used. Colors are partitioned to to their similarity and the is measured. • Images are transformed to similarity between them.

percentage k k color groups according of each group in the image -dimensional points and a distance metric (e.g., Euclidean distance) is used to measure the k -dimensional space k -bins

Using Transformations to Reduce Dimensionality • In many cases the embedded is much lower than the actual dimensionality • Some methods apply transformations on the data and approximate them with low-dimensional vectors • The aim is to reduce dimensionality and at the same time maintain the data characteristics dimensionality of a search problem • If d(a,b) is the distance between two objects a, b in real (high dimensional) and d’(a’,b’) is their distance in the transformed low dimensional space, we want d’(a’,b’)  d(a,b).

d’(a’,b’) d(a,b)

Problem - Motivation   Given a database of documents, find documents containing “ data ”, “ retrieval ” Applications:     Web law + patent offices digital libraries information filtering

Problem - Motivation  Types of queries:  boolean (‘data’ AND ‘retrieval’ AND NOT ...)   additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’)  How to search a large collection of documents?

Text – Inverted Files

Text – Inverted Files Q: space overhead?

A: mainly, the postings lists

Text – Inverted Files  how to organize dictionary?

  stemming – Y/N?

 Keep only the root of each word ex. inverted, inversion  invert insertions?

Text – Inverted Files    how to organize dictionary?

 B-tree, hashing, TRIEs, PATRICIA trees, ...

stemming – Y/N?

insertions?

Text – Inverted Files  postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’ log(freq) freq ~ 1/rank / ln(1.78V) log(rank)

Text – Inverted Files  postings lists  Cutting+Pedersen   (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92]    geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead!

Conclusions: needs space overhead (2%-300%), but it is the fastest

Text - Detailed outline  Text databases       problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Vector Space Model and Clustering     Keyword (free-text) queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

Vector Space Model and Clustering  main idea: each document is a vector of size d: d is the number of different terms in the database document ‘indexing’ aaron data zoo ...data...

d (= vocabulary size)

Document Vectors  Documents are represented as “bags of words” OR as vectors.

    A vector is like an array of floating points Has direction and magnitude Each vector holds a place for the collection every Therefore, most vectors are sparse term in

Document Vectors One location for each word.

A B C D I E F G H

nova 10 5 5 galaxy heat h’wood film role 5 10 3 6 10 8 7 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A 9 10 7 2 5 8 diet 10 9 1 fur 10 10 3

Document Vectors One location for each word.

A B C D I E F G H

nova 10 5 5 galaxy heat h’wood film role diet 6 5 3 10 “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I 10 9 7 9 10 7 2 5 8 1 fur 10 10 3

Document Vectors Document ids

I A B C D E F G H

nova 10 5 5 galaxy heat 5 10 3 6 7 10 h’wood film 10 9 7 10 2 5 8 role 7 5 9 8 diet 10 9 1 fur 10 10 3

We Can Plot the Vectors Star Doc about astronomy Doc about movie stars Doc about mammal behavior Diet

Assigning Weights to Terms    Binary Weights Raw term frequency tf x idf  Recall the Zipf distribution  Want to weight terms highly if they are  frequent in relevant documents … BUT  infrequent in the collection as a whole

Binary Weights  Only the presence (1) or absence (0) of a term is included in the vector

docs

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 0 0 0 0 1

t1

1 1 0 1 1 1 1 1 0 1 0

t2

0 0 1 0 1 1 0 0 1 1 1

t3

1 0 1 0 1 0

Raw Term Weights  The frequency of occurrence for the term in each document is included in the vector

docs

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 0 0 0 0 4

t1

2 1 0 3 1 3 8 10 0 3 0

t2

0 0 4 0 6 5 0 0 1 5 1

t3

3 0 7 0 3 0

Assigning Weights   tf x idf measure:  term frequency (tf)  inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

tf x idf

w ik

tf ik

* log(

N

/

n k

)

T k

 term

k tf ik idf k

N

 frequency inverse of  total number term document of T

k

in document frequency documents of in the

D i

term T

k

in

C

collection

C n k

 the number of documents in

C

that contain T

k idf k

 log

N n k

Inverse Document Frequency  IDF provides high values for rare words and low values for common words For a collection of 10000 documents log 10000 10000 log 10000 5000 log 10000 20 log 10000 1  0  0 .

301  2 .

698  4

Similarity Measures for document vectors |

Q

D

| 2 | |

Q Q

|   |

D D

| | | |

Q

Q

D D

| | |

Q

D

| |

Q

| 1 2  |

D

| 1 2 |

Q

D

| min(|

Q

|, |

D

|) Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

tf x idf normalization  Normalize the term weights (so longer documents are not unfairly given more weight)  normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

w ik

tf ik

t k

 1 (

tf ik

log(

N

/ ) 2 [log(

n k N

) /

n k

)] 2

Vector space similarity (use the weights to compare the documents) Now, the similarity of two documents is :

sim

(

D i

,

D j

) 

k t

  1

w ik

w jk

This is also called the cosine, or normalized inner product.

0.8

0.6

0.4

0.2

Computing Similarity Scores 1.0

D

2  2  1

Q D

1

D

1  ( 0 .

8 , 0 .

3 )

D

2

Q

  ( 0 .

2 , 0 .

7 ( 0 .

4 , 0 .

8 ) ) cos  1 cos  2  0 .

74  0 .

98 0.2

0.4

0.6

0.8

1.0

Vector Space with Term Weights and Cosine Matching Term B 1.0

0.8

D 2 0.6

0.4

 2 0.2

0  1 0.2

Q 0.4

0.6

Term A Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) 0.8

D 1 1.0

D i =(

d i1 ,w di1 ;d i2 , w di2 ;…;d it , w dit

) Q =(

q i1 ,w qi1 ;q i2 , w qi2 ;…;q it , w qit

)

sim

(

Q

,

D i

)  

t j

 1

w q j w d ij

t j

 1 (

w q j

) 2 

t j

 1 (

w d ij

) 2

sim

(

Q

,

D

2 )  ( 0 .

4  0 .

2 )  ( 0 .

8  0 .

7 ) [( 0 .

4 ) 2  ( 0 .

8 ) 2 ]  [( 0 .

2 ) 2  ( 0 .

7 ) 2 ] 

sim

(

Q

,

D

1 )  0 .

64  0 .

98 0 .

42 .

56 0 .

58  0 .

74

Text - Detailed outline  Text databases       problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Information Filtering + LSI

  [Foltz+,’92] Goal:   users specify interests (= keywords) system alerts them, on suitable news documents Major contribution: LSI = Latent Semantic Indexing  latent (‘hidden’) concepts

Information Filtering + LSI

Main idea   map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g.

 “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

Information Filtering + LSI

Pictorially: term-document matrix (BEFORE) 'data' 'system' 'retrieval' 'lung' 'ear' TR1 1 1 1 TR2 1 TR3 TR4 1 1 1 1 1 1

Information Filtering + LSI

Pictorially: concept-document matrix and...

'DBMS concept' TR1 1 'medical concept' TR2 1 TR3 TR4 1 1

Information Filtering + LSI

... and concept-term matrix data 'DBMS concept' 1 system 1 retrieval 1 lung ear 'medical concept' 1 1

Information Filtering + LSI

Q: How to search, eg., for ‘system’?

Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents data system 1 retrieval 1 'DBMS concept' 1 lung ear 'medical concept' 1 1 'DBMS concept' TR1 1 'medical concept' TR2 1 TR3 TR4 1 1

Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents data system 1 retrieval 1 'DBMS concept' 1 lung ear 'medical concept' 1 1 'DBMS concept' TR1 1 'medical concept' TR2 1 TR3 TR4 1 1

Information Filtering + LSI

Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

SVD - Detailed outline

      Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Motivation

  problem #1: text - LSI: find ‘concepts’ problem #2: compression / dim. reduction

SVD - Motivation

 problem #1: text - LSI: find ‘concepts’

SVD - Motivation

 problem #2: compress / reduce dimensionality

Problem - specs

  ~10**6 rows; ~10**3 columns; no updates; random access to any cell(s) ; small error: OK

SVD - Motivation

SVD - Motivation

SVD - Definition

A [n x m]

= U

[n x r]

L [

r x r]

(V

[m x r] )

T     A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts) L : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts)

SVD - Properties

THEOREM [Press+92]: always possible to decompose matrix A into A = U L

V

T , where 

U,

L, V: unique (*)   U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) 

U

T U = I; V T V = I (I: identity matrix) L : eigenvalues are positive, and sorted in decreasing order

CS MD 

SVD - Example

A = U L

V

T - example: data inf .

retrieval brain lung 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

CS MD 

SVD - Example

A = U L

V

T - example: data inf .

retrieval brain CS-concept lung MD-concept 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

CS MD 

SVD - Example

A = U L data inf .

retrieval brain

V

T - example: doc-to-concept CS-concept lung similarity matrix MD-concept 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

CS MD 

SVD - Example

A = U L

V

T - example: data inf .

retrieval brain lung ‘strength’ of CS-concept 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

CS MD 

SVD - Example

A = U L

V

T - example: data inf .

retrieval brain lung 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

CS-concept x term-to-concept similarity matrix 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

CS MD 

SVD - Example

A = U L

V

T - example: data inf .

retrieval brain lung 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

CS-concept x term-to-concept similarity matrix 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

     

SVD - Detailed outline

Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:    U: document-to-concept similarity matrix V: term-to-concept sim. matrix L : its diagonal elements: ‘strength’ of each concept

SVD - Interpretation #2

 best axis to project on: (‘best’ = min sum of squares of projection errors)

SVD - Motivation

SVD - interpretation #2

SVD: gives best axis to project  minimum RMS error v1

SVD - Interpretation #2

SVD - Interpretation #2

A = U L

V

T - example: 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x v1 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #2

A = U L

V

T - example: 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = variance (‘spread’) on the v1 axis 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #2

A = U L

V

T - example: 

U

L gives the coordinates of the points in the projection axis 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

 

SVD - Interpretation #2

More details Q: how exactly is dim. reduction done?

1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

  

SVD - Interpretation #2

More details Q: how exactly is dim. reduction done?

A: set the smallest eigenvalues to zero: 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 1 1 1 0 0 = 0.18 0 x x 5 5 5 0 0 0.90 0 0 5.29

0 0 0 2 2 0 0.53

0 0 0 0 0 0 3 1 3 1 0 0 0.80

0.27

0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #2

1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 ~ 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 0 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #2

1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 ~ 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 0 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #2

1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 ~ 0 0 0 0.18

0.36

0.18

0.90

x 9.64

x 0.58 0.58 0.58 0 0

SVD - Interpretation #2

1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 ~ 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SVD - Interpretation #2

Equivalent: ‘spectral decomposition’ of the matrix: 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #2

Equivalent: ‘spectral decomposition’ of the matrix: 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = u 1 u 2 x l 1 l 2 v 1 v 2 x

n

SVD - Interpretation #2

1 1 2 2 1 1 5 5 0 0 0 0 0 0 Equivalent: ‘spectral decomposition’ of the matrix: m 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 =  l 1 

i r

 1 u 1 v T 1 l

i u i

+

v i T

l 2 u 2 v T 2 +...

SVD - Interpretation #2

n ‘spectral decomposition’ of the matrix: 1 1 2 2 1 1 5 5 0 0 0 0 0 0 m 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = l 1 n x 1 u 1 r terms v T 1 + 1 x m l 2 u 2 v T 2 +...

n

SVD - Interpretation #2

approximation / dim. reduction: by keeping the first few terms (Q: how many?) m 1 1 2 2 1 2 0 0 0 0 To do the mapping you use V T X’ = V T X 1 1 5 5 0 0 0 0 0 0 1 5 0 0 0 0 0 2 3 1 0 0 2 3 1 = l 1 u 1 v T 1 + l 2 u 2 assume: l 1 >= l 2 >= ...

v T 2 +...

n

SVD - Interpretation #2

A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of l i ’s) m 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = l 1 u 1 v T 1 + l 2 u 2 assume: l 1 >= l 2 >= ...

v T 2 +...

SVD - Interpretation #3

 finds non-zero ‘blobs’ in a data matrix 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Interpretation #3

finds non-zero ‘blobs’ in a data matrix 1 1 2 2 1 1 5 5 0 0 0 0 0 0 0 0 0 1 2 1 5 2 3 1 0 0 0 0 2 3 1 0 0 0 0 = 0 0 0 0.18 0 0.36 0 0.18 0 0.90 0 0.53

0.80

0.27

x 9.64 0 0 5.29

x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

 

SVD - Interpretation #3

Drill: find the SVD, ‘by inspection’!

Q: rank = ??

1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 = ??

x ??

x ??

SVD - Interpretation #3

A: rank = 2 (2 linearly independent rows/cols) 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 = ??

??

x ?? 0 0 ??

x ??

??

SVD - Interpretation #3

A: rank = 2 (2 linearly independent rows/cols) 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 = 1 0 1 0 1 0 0 1 0 1 x ?? 0 0 ??

x 1 1 0 0 1 0 0 1 0 1 orthogonal??

SVD - Interpretation #3

 column vectors: are orthogonal - but not unit vectors: 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 = 1 1 3 3 1 0 0 3 0 0 0 1 1 2 2 x ?? 0 0 ??

x 1 0 3 1 0 3 1 0 3 0 0 1 2 1 2

SVD - Interpretation #3

and the eigenvalues are: 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 = 1 1 3 3 1 0 0 3 0 0 0 1 1 2 2 x 3 0 0 2 x 1 0 3 1 0 3 1 0 3 0 0 1 2 1 2

SVD - Interpretation #3

A: SVD properties:     matrix product should give back matrix A matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other ditto for matrix V matrix values L should be diagonal, with positive

SVD - Complexity

     O( n * m * m) or O( n * n * m) (whichever is less) less work, if we just want eigenvalues or if we want first k eigenvectors or if the matrix is sparse [Berry] Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica ...)

Optimality of SVD Def: The Frobenius norm of a n x m matrix M is

M

 

M

[

i

,

j

] 2

F

(reminder) The rank of a matrix M is the number of independent rows (or columns) of M Let A=U L V T and A k = U k L k V k T (SVD approximation of A) A k is an nxm matrix, U k an nxk, rank at most k, we have that: L k kxk, and V k mxk Theorem: [Eckart and Young] Among all n x m matrices C of

A

A k F

 

F

Kleinberg’s Algorithm

 Main idea: In many cases, when you search the web using some terms, the most relevant pages may not contain this term (or contain the term only a few times)   Harvard : www.harvard.edu

Search Engines: yahoo, google, altavista  Authorities and hubs

Kleinberg’s algorithm

  Problem dfn: given the web and a query find the most ‘authoritative’ web pages for this query Step 0: find all pages containing the query terms (root set) Step 1: expand by one move forward and backward (base set)

Kleinberg’s algorithm

 Step 1: expand by one move forward and backward

Kleinberg’s algorithm

  on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to give high importance score (‘hubs’) to nodes that point to good ‘authorities’) hubs authorities

Kleinberg’s algorithm

observations   recursive definition!

each node (say, ‘ i ’-th node) has both an authoritativeness score a i and a hubness score h i

Kleinberg’s algorithm

Let E be the set of edges and A be the adjacency matrix: the ( i,j ) is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores.

Then:

l k m

Kleinberg’s algorithm

i Then: a i = h k + h l + h m that is a i ( = Sum ( j,i h j ) over all j ) edge exists or a = A T

h

that

i

Kleinberg’s algorithm

n q p symmetrically, for the ‘hubness’: h i = a n + a p + a q that is h i ( = Sum ( i,j q j ) over all j ) edge exists or h = A a that

Kleinberg’s algorithm

In conclusion, we want vectors h and a such that: h = A a a = A T

h

Recall properties: C(2): A [n x m]

v

1 [m x 1] = l

1

C(3): u

1 T A =

l

1 v 1 T u

1 [n x 1]

Kleinberg’s algorithm

In short, the solutions to h = A a a = A T

h

are the left- and right- eigenvectors of the adjacency matrix A. Starting from random a’ and iterating, we’ll eventually converge (Q: to which of all the eigenvectors? why?)

Kleinberg’s algorithm

(Q: to which of all the eigenvectors? why?) A: to the ones of the strongest eigenvalue, because of property B(5): B(5): (A T A ) k v’ ~ (constant) v 1

Kleinberg’s algorithm - results

Eg., for the query ‘java’: 0.328 www.gamelan.com

0.251 java.sun.com

0.190 www.digitalfocus.com (“the java developer”)

Kleinberg’s algorithm - discussion  ‘authority’ score can be used to find ‘similar pages’ to page p  closely related to ‘citation analysis’, social networs / ‘small world’ phenomena

google/page-rank algorithm

   closely related: The Web is a directed graph of connected nodes imagine a particle randomly moving along the edges (*) compute its steady-state probabilities. That gives the PageRank of each pages (the importance of this page) (*) with occasional random jumps

PageRank Definition

 Assume a page A and pages T1, T2, …, Tm that point to A. Let d is a damping factor. PR(A) the pagerank of A. C(A) the out-degree of A. Then:

PR

(

A

)  ( 1 

d

) 

d

(

PR

(

T

1 )

C

(

T

1 ) 

PR

(

T

2 )  ...

C

(

T

2 )

PR

(

Tm

) )

C

(

Tm

)

google/page-rank algorithm

 Compute the PR of each page~identical problem: given a Markov Chain, compute the steady state probabilities p1 ... p5 1 2 3 4 5

Computing PageRank

  Iterative procedure Also, … navigate the web by randomly follow links or with prob p jump to a random page. Let A the adjacency matrix (n x n), d i out-degree of page i Prob(A i ->A j ) = pn -1 +(1-p)d i –1 A ij A’[i,j] = Prob(A i ->A j )

1 4

google/page-rank algorithm

 Let A’ be the transition matrix = 1) (= adjacency matrix, row-normalized : sum of each row 2 5 3 1 1 1 1/2 1/2 1/2 1/2 p1 p2 p3 p4 p5 = p1 p2 p3 p4 p5

1 4

google/page-rank algorithm

A p = p A p = p

2 5 3 1 1 1 1/2 1/2 1/2 1/2 p1 p2 p3 p4 p5 = p1 p2 p3 p4 p5

google/page-rank algorithm

 

A p = p

thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is row-normalized)

Kleinberg/google - conclusions

SVD helps in graph analysis: hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix

Conclusions – so far

   SVD: a valuable tool given a document-term matrix, it finds ‘concepts’ (LSI) ... and can reduce dimensionality (KL)

Conclusions cont’d

  ... and can find fixed-points or steady state probabilities (google/ Kleinberg/ Markov Chains) ... and can solve optimally over- and under-constraint linear systems (least squares)

References

  Brin, S. and L. Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.

J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.

Embeddings

    Given a metric distance matrix D, embed the objects in a k-dimensional vector space using a mapping F such that  D(i,j) is close to D’(F(i),F(j)) Isometric mapping:  exact preservation of distance Contractive mapping:  D’(F(i),F(j)) <= D(i,j) d’ is some Lp measure

PCA  Intuition: find the axis that shows the greatest variation, and project all points into this axis f2 e1 e2 f1

SVD: The mathematical formulation     Normalize the dataset by moving the origin to the center of the dataset Find the eigenvectors of the data (or covariance) matrix These define the new space Sort the eigenvalues in “goodness” order e2 f2 e1 f1

SVD Cont’d  Advantages:  Optimal dimensionality reduction (for linear projections)  Disadvantages:  Computationally expensive… but can be improved with random sampling  Sensitive to outliers and non-linearities

FastMap

What if we have a finite metric space ( Faloutsos and Lin (1995) proposed FastMap as metric analogue to the KL-transform (PCA). Imagine that the points are in a Euclidean space.

X, d )?

 Select two

pivot points

are far apart.

x a and x b that   Compute a pseudo-projection of the remaining points along the “line” x a x b .

“Project”

the points to an orthogonal subspace and

recurse

.

Selecting the Pivot Points

The pivot points should lie along the principal axes, and hence should be far apart.

    Select any point x 0 .

Let x 1 be the furthest from Let x 2 be the furthest from Return ( x 1 , x 2 ).

x 0 .

x 1 .

x 1 x 2 x 0

Pseudo-Projections

Given pivots ( third point along x a x b .

y x a ,

law of cosines

x b ), for any , we use the to determine the relation of y d 2 by  d 2 ay  d 2 ab  2 c d The

pseudo-projection

for y is c y  d 2 ay  d 2 ab 2 d ab  d 2 by d a,b c y x b x a d b,y d a,y y

“Project to orthogonal plane”

c z -c y Given distances along x we can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem.

a x b  c z  c y ) 2 Using d ’(.,.), recurse until k features chosen.

d y,z y’ y x b x a d’ y’,z’ z z’

Random Projections

    Based on the Johnson-Lindenstrauss lemma: For:  0< e < 1/2,  any (sufficiently large) set

S

of M points in R n  k = O( e -2 lnM) There exists a linear map f:  (1-

S

 R k , such that e ) D(S,T) < D(f(S),f(T)) < (1+ e )D(S,T) for S,T in

S

Random projection is good with constant probability

Random Projection: Application   Set k = O( e -2 lnM) Select k random n-dimensional vectors  (an approach is to select k gaussian distributed vectors with variance 0 and mean value 1: N(1,0) )   Project the original points into the k vectors.

The resulting k-dimensional space approximately preserves the distances with high probability

Random Projection    A very useful technique, Especially when used in conjunction with another technique (for example SVD) Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther