Transcript Document

Re-evaluation of IR Systems

Note: Slides are taken from Prof. Ray Larson’s web site (www.sims.berkeley.edu/~ray/

BBY 220

Yaşar Tonta

H acettepe Üniversitesi [email protected]

yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim İlkeleri - SLAYT 1

Evaluation of IR Systems

• Precision vs. Recall • Cutoff Points • Test Collections/TREC • Blair & Maron Study

BBY 220 - SLAYT 2

Evaluation

• Why Evaluate?

• What to Evaluate?

• How to Evaluate?

BBY 220 - SLAYT 3

Why Evaluate?

• Determine if the system is desirable • Make comparative assessments • Others?

- SLAYT 4 BBY 220

What to Evaluate?

• How much of the information need is satisfied.

• How much was learned about a topic.

• Incidental learning: – How much was learned about the collection.

– How much was learned about other topics.

• How inviting the system is.

- SLAYT 5 BBY 220

Relevance

• In what ways can a document be relevant to a query?

– Answer precise question precisely.

– Partially answer question.

– Suggest a source for more information.

– Give background information.

– Remind the user of other knowledge.

– Others ...

- SLAYT 6 BBY 220

Relevance

• How relevant is the document – for this user for this information need.

• Subjective, but • Measurable to some extent – How often do people agree a document is relevant to a query • How well does it answer the question?

– Complete answer? Partial? – Background Information?

– Hints for further exploration?

- SLAYT 7 BBY 220

What to Evaluate?

What can be measured that reflects users’ ability to use system? (Cleverdon 66) – Coverage of Information – Form of Presentation – Effort required/Ease of Use – Time and Space Efficiency – Recall • proportion of relevant material actually retrieved – Precision • proportion of retrieved material actually relevant

- SLAYT 8 BBY 220

Relevant vs. Retrieved

All docs Retrieved Relevant

BBY 220 - SLAYT 9

Precision vs. Recall

Precision  | RelRetriev | Retrieved ed | | All docs Recall  | | RelRetriev ed Rel in Collection | | Retrieved Relevant

- SLAYT 10 BBY 220

Why Precision and Recall?

Get as much good stuff while at the same time getting as little junk as possible.

- SLAYT 11 BBY 220

Retrieved vs. Relevant Documents

Very high precision, very low recall

BBY 220

Relevant

- SLAYT 12

Retrieved vs. Relevant Documents

Very low precision, very low recall (0 in fact)

BBY 220

Relevant

- SLAYT 13

Retrieved vs. Relevant Documents

High recall, but low precision

BBY 220

Relevant

- SLAYT 14

Retrieved vs. Relevant Documents

High precision, high recall (at last!)

BBY 220

Relevant

- SLAYT 15

Precision/Recall Curves

• There is a tradeoff between Precision and Recall • So measure Precision at different levels of Recall • Note: this is an AVERAGE over MANY queries

BBY 220

precision x x x recall x

- SLAYT 16

Precision/Recall Curves

• Difficult to determine which of these two hypothetical results is better:

BBY 220

precision x x x x recall

- SLAYT 17

Precision/Recall Curves

BBY 220 - SLAYT 18

Document Cutoff Levels

• Another way to evaluate: – Fix the number of documents retrieved at several levels: • top 5 • top 10 • top 20 • top 50 • top 100 • top 500 – Measure precision at each of these levels – Take (weighted) average over results • This is a way to focus on how well the system ranks the first k documents.

- SLAYT 19 BBY 220

Problems with Precision/Recall

• Can’t know true recall value – except in small collections • Precision/Recall are related – A combined measure sometimes more appropriate • Assumes batch mode – Interactive IR is important and has different criteria for successful searches – We will touch on this in the UI section • Assumes a strict rank ordering matters.

- SLAYT 20 BBY 220

Relation to Contingency Table

Doc is Relevant Doc is NOT relevant Doc is retrieved a b Doc is NOT retrieved c d • Accuracy: (a+d) / (a+b+c+d) • Precision: a/(a+b) • Recall: ?

• Why don’t we use Accuracy for IR?

– (Assuming a large collection) – Most docs aren’t relevant – Most docs aren’t retrieved – Inflates the accuracy value

- SLAYT 21 BBY 220

The E-Measure

Combine Precision and Recall into one number (van Rijsbergen 79) 1

E

 1  1 

b

2

b

2

R

 1

P E

 1   1

P

1

R

P = precision   1 /(  2  1 )  ( 1   ) R = recall b = measure of relative importance of P or R

BBY 220

For example, b = 0.5 means user is twice as interested in precision as recall

- SLAYT 22

How to Evaluate?

Test Collections

BBY 220 - SLAYT 23

TREC

• Text REtrieval Conference/Competition – Run by NIST (National Institute of Standards & Technology) – 2001 was the 10th year - 11th TREC in November • Collection: 5 Gigabytes (5 CRDOMs), >1.5 Million Docs – Newswire & full text news (AP, WSJ, Ziff, FT, San Jose Mercury, LA Times) – Government documents (federal register, Congressional Record) – FBIS (Foreign Broadcast Information Service) – US Patents

- SLAYT 24 BBY 220

TREC (cont.)

• Queries + Relevance Judgments – Queries devised and judged by “Information Specialists” – Relevance judgments done only for those documents retrieved -- not entire collection!

• Competition – Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) – Results judged on precision and recall, going up to a recall level of 1000 documents

- SLAYT 25 BBY 220

Sample TREC queries (topics)

Number: 168 Topic: Financing AMTRAK <desc> Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) <narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.</p> <p><b>- SLAYT 26 BBY 220 </b></p> <a id="p27"></a> <p><b>BBY 220 - SLAYT 27</b></p> <a id="p28"></a> <h3>TREC</h3> <p>• Benefits: – made research systems scale to large collections (pre-WWW) – allows for somewhat controlled comparisons • Drawbacks: – emphasis on high recall, which may be unrealistic for what most users want – very long queries, also unrealistic – comparisons still difficult to make, because systems are quite different on many dimensions – focus on batch ranking rather than interaction • There is an interactive track.</p> <p><b>- SLAYT 28 BBY 220 </b></p> <a id="p29"></a> <h3>TREC is changing</h3> <p>• Emphasis on specialized “tracks” – Interactive track – Natural Language Processing (NLP) track – Multilingual tracks (Chinese, Spanish) – Filtering track – High-Precision – High-Performance • http://trec.nist.gov/</p> <p><b>- SLAYT 29 BBY 220 </b></p> <a id="p30"></a> <h3>TREC Results</h3> <p>• Differ each year • For the main track: – Best systems not statistically significantly different – Small differences sometimes have big effects • how good was the hyphenation model • how was document length taken into account – Systems were optimized for longer queries and all performed worse for shorter, more realistic queries</p> <p><b>- SLAYT 30 BBY 220 </b></p> <a id="p31"></a> <h3>What to Evaluate?</h3> <p>• Effectiveness – Difficult to measure – Recall and Precision are one way – What might be others?</p> <p><b>BBY 220 - SLAYT 31</b></p> <a id="p32"></a> <h3>How Test Runs are Evaluated</h3> <p>R q ={d 3 ,d 5 ,d 9 ,d 25 ,d 39 ,d 44 ,d 56 ,d 71 ,d 89 ,d 123 } : 10 Relevant 1. d 123 * 2. d 84 3. d 56 * 4. d 6 5. d 8 6. d 9 * 7. d 511 8. d 129 9. d 187 10. d 25 * 11. d 38 • First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 12. d 48 13. d 250 10% Recall level is 100% • Next Relevant gives us 14. d 113 66% Precision at 20% 15. d 3 * recall level • Etc….</p> <p>Examples from Chapter 3 in Baeza-Yates</p> <p><b>- SLAYT 32 BBY 220 </b></p> <a id="p33"></a> <h3>Graphing for a Single Query</h3> <p><b>BBY 220 </b></p> <p>I P R E C I S O N 100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 RECALL</p> <p><b>- SLAYT 33</b></p> <a id="p34"></a> <h3>Averaging Multiple Queries</h3> <p><i>P</i></p> <p></p> <p><i>i N</i></p> <p> </p> <p><i>q</i></p> <p>1</p> <p><i>P i N q P</i></p> <p>is the average Precision at Recall level</p> <p><i>r P i N q</i></p> <p>is the number   is the Precision of queries at Recall level</p> <p><i>r</i></p> <p>for the</p> <p><i>i</i></p> <p> th query </p> <p><b>- SLAYT 34 BBY 220 </b></p> <a id="p35"></a> <h3>Interpolation</h3> <p>R q ={d 3 ,d 56 ,d 129 } 1. d 123* 2. d 84 3. d 56 * 4. d 6 5. d 8 6. d 9 * 7. d 511 8. d 129 9. d 187 10. d 25 * 11. d 38 12. d 48 13. d 250 14. d 113 15. d 3 *</p> <p><b>BBY 220 </b></p> <p>• • • • First relevant doc is 56, which gives recall and precision of 33.3% Next Relevant (129) gives us 66% recall at 25% precision Next (3) gives us 100% recall with 20% precision How do we figure out the precision at the 11 standard recall levels?</p> <p><b>- SLAYT 35</b></p> <a id="p36"></a> <h3>Interpolation</h3> <p><i>r j</i></p> <h3>,</h3> <p><i>j</i></p> <p> </p> <h3>0 , 1 , 2 ,..., 10</h3> <p></p> <h3>is a reference</h3> <p><i>P</i></p> <p> </p> <p><i>j</i></p> <h3> to the</h3> <p></p> <h3>max</h3> <p><i>r j j</i></p> <h3>-</h3> <p></p> <h3>th standard</h3> <p><i>r</i></p> <p></p> <h3>recall level</h3> <p><i>r j</i></p> <p> 1</p> <p><i>P</i></p> <h3>(</h3> <p><i>r</i></p> <h3>) I.e., The Maximum known Precision at any recall level between th e</h3> <p><i>j</i></p> <h3> th and the (</h3> <p><i>j</i></p> <p></p> <h3>1 ) th</h3> <p><b>- SLAYT 36 BBY 220 </b></p> <a id="p37"></a> <h3>Interpolation</h3> <p>• So, at recall levels 0%, 10%, 20%, and 30% the interpolated precision is 33.3% • At recall levels 40%, 50%, and 60% interpolated precision is 25% • And at recall levels 70%, 80%, 90% and 100%, interpolated precision is 20% • Giving graph…</p> <p><b>- SLAYT 37 BBY 220 </b></p> <a id="p38"></a> <h3>Interpolation</h3> <p><b>BBY 220 </b></p> <p>I P R E C I S O N 100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 RECALL</p> <p><b>- SLAYT 38</b></p> <a id="p39"></a> <h3>Problems with Precision/Recall</h3> <p>• Can’t know true recall value – except in small collections • Precision/Recall are related – A combined measure sometimes more appropriate • Assumes batch mode – Interactive IR is important and has different criteria for successful searches – We will touch on this in the UI section • Assumes a strict rank ordering matters.</p> <p><b>- SLAYT 39 BBY 220 </b></p> <a id="p40"></a> <h3>Blair and Maron 1985</h3> <p>• A classic study of retrieval effectiveness – earlier studies were on unrealistically small collections • Studied an archive of documents for a legal suit – ~350,000 pages of text – 40 queries – focus on high recall – Used IBM’s STAIRS full-text system • Main Result: – The system retrieved less than 20% of the relevant documents for a particular information need; lawyers thought they had 75% • But many queries had very high precision</p> <p><b>- SLAYT 40 BBY 220 </b></p> <a id="p41"></a> <h3>Blair and Maron, cont.</h3> <p>• How they estimated recall – generated partially random samples of unseen documents – had users (unaware these were random) judge them for relevance • Other results: – two lawyers searches had similar performance – lawyers recall was not much different from paralegal’s</p> <p><b>- SLAYT 41 BBY 220 </b></p> <a id="p42"></a> <h3>Blair and Maron, cont.</h3> <p>• Why recall was low – users can’t foresee exact words and phrases that will indicate relevant documents • “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … • differing technical terminology • slang, misspellings – Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied</p> <p><b>- SLAYT 42 BBY 220 </b></p> <a id="p43"></a> <h3>Blair and Maron, cont.</h3> <p>• Why recall was low – users can’t foresee exact words and phrases that will indicate relevant documents • “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … • differing technical terminology • slang, misspellings – Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied</p> <p><b>- SLAYT 43 BBY 220 </b></p> <a id="p44"></a> <h3>Relationship between Precision and Recall</h3> <p><b>BBY 220 </b></p> <p>Doc is retrieved Doc is NOT retrieved Doc is Relevant Doc is NOT relevant</p> <p><i>N ret</i></p> <p></p> <p><i>rel N ret</i></p> <p></p> <p><i>rel N rel N ret</i></p> <p></p> <p><i>rel N ret</i></p> <p></p> <p><i>rel N rel N ret N ret N tot</i></p> <p>Buckland & Gey, JASIS: Jan 1994</p> <p><b>- SLAYT 44</b></p> <a id="p45"></a> <h3>Recall Under various retrieval assumptions</h3> <p>R E C A L L 1.0</p> <p>0.9</p> <p>0.8</p> <p>0.7</p> <p>0.6</p> <p>0.5</p> <p>0.4</p> <p>0.3</p> <p>0.2</p> <p>0.1</p> <p>0.0</p> <p>Perfect Tangent Parabolic Recall Parabolic Recall random Perverse 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0</p> <p>1000 Documents 100 Relevant Proportion of documents retrieved</p> <p><b>BBY 220 </b></p> <a id="p46"></a> <h3>Precision under various assumptions</h3> <p>I P R E C I S O N</p> <p><b>BBY 220 </b></p> <p>1.0</p> <p>0.9</p> <p>0.8</p> <p>0.7</p> <p>0.6</p> <p>0.5</p> <p>0.4</p> <p>0.3</p> <p>0.2</p> <p>0.1</p> <p>0.0</p> <p>Perfect Tangent Parabolic Recall Parabolic Recall random Perverse 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0</p> <p>1000 Documents 100 Relevant Proportion of documents retrieved</p> <p><b>- SLAYT 46</b></p> <a id="p47"></a> <h3>What to Evaluate?</h3> <p>• Effectiveness – Difficult to measure – Recall and Precision are one way – What might be others?</p> <p><b>BBY 220 - SLAYT 47</b></p> <a id="p48"></a> <h3>Other Ways of Evaluating</h3> <p>• “The primary function of a retrieval system is conceived to be that of saving its users to as great an extent as possible, the labor of perusing and discarding irrelevant documents, in their search for relevant ones”</p> <p><b>BBY 220 </b></p> <p>William S. Cooper (1968) “Expected Search Length: A Single measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems” </p> <p><i>American Documentation</i></p> <p>, 19(1).</p> <p><b>- SLAYT 48</b></p> <a id="p49"></a> <h3>Other Ways of Evaluating</h3> <p>• If the purpose of retrieval system is to rank the documents in descending order of their probability of relevance for the user, then maybe the sequence is important and can be used as a way of evaluating systems.</p> <p>• How to do it?</p> <p><b>- SLAYT 49 BBY 220 </b></p> <a id="p50"></a> <h3>Query Types</h3> <p>• Only one relevant document is wanted • Some arbitrary number </p> <p><i>n </i></p> <p>is wanted • All relevant documents are wanted • Some proportion of the relevant documents is wanted • No documents are wanted? (Special case)</p> <p><b>- SLAYT 50 BBY 220 </b></p> <a id="p51"></a> <h3>Search Length and Expected Search Length</h3> <p>• Work by William Cooper in the late ’60s • Issues with IR Measures: – Usually not a single measure – Assume “retrieved” and “not retrieved” sets without considering more than two classes – No built-in way to compare to purely random retrieval – Don’t take into account how much relevant material the user actually needs (or wants)</p> <p><b>- SLAYT 51 BBY 220 </b></p> <a id="p52"></a> <h3>Weak Ordering in IR Systems</h3> <p>• The assumption that there are two sets of “Retrieved” and “Not Retrieved” is not really accurate.</p> <p>• IR Systems usually rank into many sets of equal retrieval weights • Consider Coordinate-Level ranking…</p> <p><b>- SLAYT 52 BBY 220 </b></p> <a id="p53"></a> <h3>Weak Ordering</h3> <p><b>BBY 220 - SLAYT 53</b></p> <a id="p54"></a> <h3>Filtering</h3> <p>• Characteristics of Filtering systems: – Designed for unstructured or semi-structured data – Deal primarily with text information – Deal with large amounts of data – Involve streams of incoming data – Filtering is based on descriptions of individual or group preferences – profiles. May be negative profiles (e.g. junk mail filters) – Filtering implies </p> <p><i>removing</i></p> <p>non-relevant material as opposed to selecting relevant. </p> <p><b>- SLAYT 54 BBY 220 </b></p> <a id="p55"></a> <h3>Filtering</h3> <p>• Similar to IR, with some key differences • Similar to Routing – sending relevant incoming data to different individuals or groups is virtually identical to filtering – with multiple profiles • Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction)</p> <p><b>- SLAYT 55 BBY 220 </b></p> <a id="p56"></a> <h3>Structure of an IR System</h3> <p>Search Line Interest profiles & Queries Formulating query in terms of descriptors Storage of profiles Store1: Profiles/ Search requests Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Comparison/ Matching Potentially Relevant Documents</p> <p><b>BBY 220 </b></p> <p>Documents & data Storage Line Indexing (Descriptive and Subject) Storage of Documents Store2: Document representations</p> <p><i>Adapted from Soergel, p. 19</i></p> <p><b>- SLAYT 56</b></p> <a id="p57"></a> <h3>Structure of an Filtering System</h3> <p>Individual or Group users Interest profiles Raw Documents & data Incoming Data Stream Formulating query in terms of descriptors Storage of profiles Information Filtering System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Indexing/ Categorization/ Extraction Store1: Profiles/ Search requests</p> <p><b>BBY 220 </b></p> <p>Comparison/ filtering Potentially Relevant Documents Doc surrogate Stream</p> <p><i>Adapted from Soergel, p. 19</i></p> <p><b>- SLAYT 57</b></p> <a id="p58"></a> <h3>Major differences between IR </h3> <p>• IR recognizes inherent faults of queries – Filtering assumes profiles can be better than IR queries • IR concerned with collection and organization of texts – Filtering is concerned with distribution of texts • IR is concerned with selection from a static database.</p> <p>– Filtering concerned with dynamic data stream • IR is concerned with single interaction sessions – Filtering concerned with long-term changes </p> <p><b>- SLAYT 58 BBY 220 </b></p> <a id="p59"></a> <h3>Contextual Differences</h3> <p>• In filtering the </p> <p><i>timeliness </i></p> <p>of the text is often of greatest significance • Filtering often has a less well-defined user community • Filtering often has privacy implications (how complete are user profiles?, what do they contain?) • Filtering profiles can (should?) adapt to user feedback – Conceptually similar to Relevance feedback</p> <p><b>- SLAYT 59 BBY 220 </b></p> <a id="p60"></a> <h3>Methods for Filtering</h3> <p>• Adapted from IR – E.g. use a retrieval ranking algorithm against incoming documents.</p> <p>• Collaborative filtering – Individual and comparative profiles</p> <p><b>- SLAYT 60 BBY 220 </b></p> <a id="p61"></a> <h3>TDT: Topic Detection and Tracking</h3> <p>• Intended to automatically identify new topics – events, etc. – from a stream of text</p> <p><b>- SLAYT 61 BBY 220 </b></p> <a id="p62"></a> <h3>Topic Detection and Tracking</h3> <p><b>BBY 220 </b></p> <h3><b><i>Introduction and Overview</i></b></h3> <p>–</p> <p><b>The TDT3 R&D Challenge</b></p> <p>–</p> <p><b>TDT3 Evaluation Methodology</b></p> <p>Slides from “Overview NIST Topic Detection and Tracking -Introduction and Overview” by G. Doddington -http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm</p> <p><b>- SLAYT 62</b></p> <a id="p63"></a> <h3>TDT Task Overview</h3> <p>*</p> <h3>5 R&D Challenges:</h3> <p>– Story TDT3 Corpus Characteristics: † – Two Types of Sources: • Text • Speech</p> <p><b>BBY 220 </b></p> <p>Segmentation – Topic Tracking – Topic Detection –</p> <p><i>First-Story </i></p> <p>–</p> <p><i>Detection</i></p> <p><b><i>*</i></b></p> <p>– Two Languages: • English 30,000 stories • Mandarin 10,000 stories – 11 Different Sources: _8 English__ 3 Mandarin ABC CNN VOA see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm</p> <p><i>Link Detection</i></p> <p>for details APW NYT for details ZBN</p> <p><b>- SLAYT 63</b></p> <a id="p64"></a> <h3>Preliminaries</h3> <p><b>A </b></p> <h3><b>topic</b></h3> <p><b>is …</b></p> <p>a seminal </p> <p><b>event</b></p> <p>or activity, along with all directly related events and activities.</p> <p><b>A </b></p> <h3><b>story</b></h3> <p><b>is …</b></p> <p>a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.</p> <p><b>- SLAYT 64 BBY 220 </b></p> <a id="p65"></a> <h3>Example Topic</h3> <p>Title: Mountain Hikers Lost – WHAT: 35 or 40 young Mountain Hikers were lost in an avalanche in France around the 20th of January. – WHERE: Orres, France – – WHEN: January 1998 RULES OF INTERPRETATION: 5. Accidents</p> <p><b>- SLAYT 65 BBY 220 </b></p> <a id="p66"></a> <h3>The Link Detection Task</h3> <p><b><i>To detect whether a pair of stories discuss the same topic.</i></b></p> <p>same topic?</p> <p>• The topic discussed is a free variable.</p> <p>• Topic definition and annotation is unnecessary.</p> <p>• The link detection task represents a basic functionality, needed to support all applications (including the TDT applications of topic detection and tracking). • The link detection task is related to the topic tracking task, with Nt = 1.</p> <p><b>- SLAYT 66 BBY 220 </b></p> <a id="p67"></a> <h3>Latent Semantic Indexing</h3> <p>• Latent Semantic Indexing (LSI) • Issues in IR</p> <p><b>BBY 220 - SLAYT 67</b></p> <a id="p68"></a> <h3>LSI Rationale</h3> <p>• The words that searchers use to describe their information needs are often not the same words used by authors to describe the same information.</p> <p>• I.e., index terms and user search terms often do NOT match – Synonymy – Polysemy • Following examples from Deerwester, et al. </p> <p><i>Indexing by Latent Semantic Analysis</i></p> <p>. JASIS 41(6), pp. 391-407, 1990</p> <p><b>- SLAYT 68 BBY 220 </b></p> <a id="p69"></a> <h3>LSI Rationale</h3> <p>Access Document Retrieval Information Theory Database Indexing Computer D1 x x x x x D2 x* x x* D3 x x* x * REL M R M R M</p> <p><b>BBY 220 </b></p> <p>Query: IDF in computer-based information lookup Only matching words are “information” and “computer” D1 is relevant, but has no words in the query…</p> <p><b>- SLAYT 69</b></p> <a id="p70"></a> <h3>LSI Rationale</h3> <p>• Problems of synonyms – If not specified by the user, will miss synonymous terms – Is automatic expansion from a thesaurus useful?</p> <p>– Are the </p> <p><i>semantics</i></p> <p>account?</p> <p>of the terms taken into • Is there an underlying semantic </p> <p><i>model</i></p> <p>of terms and their usage in the database?</p> <p><b>- SLAYT 70 BBY 220 </b></p> <a id="p71"></a> <h3>LSI Rationale</h3> <p>• Statistical techniques such as </p> <p><i>Factor Analysis</i></p> <p>have been developed to derive underlying meanings/models from larger collections of observed data • A notion of semantic similarity between terms and documents is central for modelling the patterns of term usage across documents • Researchers began looking at these methods that focus on the proximity of items within a space (as in the vector model)</p> <p><b>- SLAYT 71 BBY 220 </b></p> <a id="p72"></a> <h3>LSI Rationale</h3> <p>• Researchers (Deerwester, Dumais, Furnas, Landauer and Harshman) considered models using the following criteria – Adjustable representational richness – Explicit representation of both terms and documents – Computational tractability for large databases</p> <p><b>- SLAYT 72 BBY 220 </b></p> <a id="p73"></a> <p><b>BBY 220 </b></p> <h3>Clustering and Automatic Classification</h3> <p>• Clustering • Automatic Classification • Cluster-enhanced search</p> <p><b>- SLAYT 73</b></p> <a id="p74"></a> <h3>Classification</h3> <p>• The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. • In document classification the items are grouped together because they are likely to be wanted together – For example, items about the same topic.</p> <p><b>- SLAYT 74 BBY 220 </b></p> <a id="p75"></a> <h3>Automatic Indexing and Classification</h3> <p>• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.</p> <p>• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.</p> <p>• Automatic classification attempts to automatically group similar documents using either: – A fully automatic clustering method.</p> <p>– An established classification scheme and set of documents already indexed by that scheme.</p> <p><b>- SLAYT 75 BBY 220 </b></p> <a id="p76"></a> <h3>Background and Origins</h3> <p>• Early suggestion by Fairthorne – “The Mathematics of Classification” • Early experiments by Maron (1961) and Borko and Bernick(1963) • Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s).</p> <p>• Early IR clustering work more concerned with efficiency issues than semantic issues.</p> <p><b>- SLAYT 76 BBY 220 </b></p> <a id="p77"></a> <h3>Document Space has High Dimensionality</h3> <p>• What happens beyond three dimensions?</p> <p>• Similarity still has to do with how many tokens are shared in common.</p> <p>• More terms -> harder to understand which subsets of words are shared among similar documents.</p> <p>• One approach to handling high dimensionality: Clustering</p> <p><b>- SLAYT 77 BBY 220 </b></p> <a id="p78"></a> <h3>Vector Space Visualization</h3> <p><b>BBY 220 - SLAYT 78</b></p> <a id="p79"></a> <h3>Cluster Hypothesis</h3> <p>• The basic notion behind the use of classification and clustering methods: • “Closely associated documents tend to be relevant to the same requests.” – C.J. van Rijsbergen</p> <p><b>- SLAYT 79 BBY 220 </b></p> <a id="p80"></a> <h3>Classification of Classification Methods</h3> <p>• Class Structure – Intellectually Formulated • Manual assignment (e.g. Library classification) • Automatic assignment (e.g. Cheshire Classification Mapping) – Automatically derived from collection of items • Hierarchic Clustering Methods (e.g. Single Link) • Agglomerative Clustering Methods (e.g. Dattola) • Hybrid Methods (e.g. Query Clustering)</p> <p><b>- SLAYT 80 BBY 220 </b></p> <a id="p81"></a> <h3>Classification of Classification Methods</h3> <p>• Relationship between properties and classes – monothetic – polythetic • Relation between objects and classes – exclusive – overlapping • Relation between classes and classes – ordered – unordered Adapted from Sparck Jones</p> <p><b>- SLAYT 81 BBY 220 </b></p> <a id="p82"></a> <h3>Properties and Classes</h3> <p>• Monothetic – Class defined by a set of properties that are both </p> <p><i>necessary </i></p> <p>and </p> <p><i>sufficient </i></p> <p>for membership in the class • Polythetic – Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.</p> <p><b>- SLAYT 82 BBY 220 </b></p> <a id="p83"></a> <h3>Monothetic vs. Polythetic</h3> <p><b>BBY 220 </b></p> <p>A B C D E F G H 1 + + + 2 + + + Polythetic 3 + + + 4 + + + 5 + + + 6 + + + Monothetic 7 + + + 8 + + + Adapted from van Rijsbergen, ‘79</p> <p><b>- SLAYT 83</b></p> <a id="p84"></a> <h3>Exclusive Vs. Overlapping</h3> <p>• Item can either belong exclusively to a single class • Items can belong to many classes, sometimes with a “membership weight”</p> <p><b>- SLAYT 84 BBY 220 </b></p> <a id="p85"></a> <h3>Ordered Vs. Unordered</h3> <p>• Ordered classes have some sort of structure imposed on them – Hierarchies are typical of ordered classes • Unordered classes have no imposed precedence or structure and each class is considered on the same “level” – Typical in agglomerative methods</p> <p><b>- SLAYT 85 BBY 220 </b></p> <a id="p86"></a> <h3>Text Clustering</h3> <p>Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu</p> <p><b>Term 1 Term 2 BBY 220 - SLAYT 86</b></p> <a id="p87"></a> <h3>Text Clustering</h3> <p>Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu</p> <p><b>Term 1 Term 2 BBY 220 - SLAYT 87</b></p> <a id="p88"></a> <h3>Text Clustering</h3> <p>• Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others</p> <p><b>- SLAYT 88 BBY 220 </b></p> <a id="p89"></a> <h3>Coefficients of Association</h3> <p>|</p> <p><i>A</i></p> <p></p> <p><i>B</i></p> <p>| 2 |</p> <p><i>A</i></p> <p>| |</p> <p><i>A</i></p> <p>  |</p> <p><i>B B</i></p> <p>| | |</p> <p><i>A</i></p> <p></p> <p><i>B</i></p> <p>| |</p> <p><i>A</i></p> <p></p> <p><i>B</i></p> <p>| | |</p> <p><i>A</i></p> <p></p> <p><i>A</i></p> <p>| </p> <p><i>B</i></p> <p>| |</p> <p><i>B</i></p> <p>| |</p> <p><i>A</i></p> <p></p> <p><i>B</i></p> <p>| min(|</p> <p><i>A</i></p> <p>|, |</p> <p><i>B</i></p> <p>| • Simple • Dice’s coefficient • Jaccard’s coefficient • Cosine coefficient • Overlap coefficient</p> <p><b>- SLAYT 89 BBY 220 </b></p> <a id="p90"></a> <h3>Pair-wise Document Similarity</h3> <p><b>A B C D</b></p> <p>nova 1 5 galaxy heat 3 2 1 h’wood film role 2 4 1 1 5 diet fur How to compute document similarity?</p> <p><b>- SLAYT 90 BBY 220 </b></p> <a id="p91"></a> <h3>Another use of clustering</h3> <p>• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.</p> <p>• “Project” these onto a 2D graphical representation:</p> <p><b>- SLAYT 91 BBY 220 </b></p> <a id="p92"></a> <p>Clustering Multi-Dimensional Document Space (image from Wise et al 95)</p> <p><b>BBY 220 - SLAYT 92</b></p> <a id="p93"></a> <p>Clustering Multi-Dimensional Document Space (image from Wise et al 95)</p> <p><b>BBY 220 - SLAYT 93</b></p> <a id="p94"></a> <h3>Concept “Landscapes”</h3> <p><b>Disease</b></p> <p>Pharmocology</p> <p><b>Anatomy Hospitals Legal BBY 220 (e.g., Lin, Chen, Wise et al.)</b></p> <p> Too many concepts, or too coarse  Single concept per document  No titles  Browsing without search</p> <p><b>- SLAYT 94</b></p> <a id="p95"></a> <h3>Clustering</h3> <p>• Advantages: – See some main themes • Disadvantage: – Many ways documents could group together are hidden • Thinking point: what is the relationship to classification systems and facets?</p> <p><b>- SLAYT 95 BBY 220 </b></p> <a id="p96"></a> <h3>Automatic Class Assignment</h3> <p>Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme Search Engine Doc Doc Doc Doc Doc Doc Doc 1. Create pseudo-documents representing intellectually derived classes.</p> <p>2. Search using document contents 3. Obtain ranked list 4. Assign document to </p> <p><i>N </i></p> <p>categories ranked over threshold. OR assign to top-ranked category</p> <p><b>- SLAYT 96 BBY 220 </b></p> </div> </section> </div> </div> </div> </main> <footer> <div class="container mt-3"> <div class="row justify-content-between"> <div class="col"> <a href="/"> <img src="/theme/studyslide/static/logo-slideum.png" /> </a> </div> </div> <div class="row mt-3"> <ul class="col-sm-6 list-unstyled"> <li> <h6 class="mb-3">Company</h6> <li> <i class="fa fa-location-arrow"></i> Nicosia Constantinou Palaiologou 16, Palouriotissa, 1040 <li> <i class="fa fa-phone"></i> +357 64-733-402 <li> <i class="fa fa-envelope"></i> info@slideum.com </ul> <ul class="col-6 col-sm-3 list-unstyled"> <li> <h6 class="mb-3">Links</h6> <li> <a href="/about">About</a> <li> <a href="/contacts">Contact</a> <li> <a href="/faq">Help / FAQ</a> </ul> <ul class="col-6 col-sm-3 list-unstyled"> <li> <h6 class="mb-3">Legal</h6> <li> <a href="/terms">Terms of Service</a> <li> <a href="/privacy">Privacy policy</a> <li> <a href="/page.html?code=public.usefull.cookie">Cookie policy</a> <li> <a href="/page.html?code=public.usefull.disclaimer">Disclaimer</a> </ul> </div> <hr> <p>slideum.com © 2024, Inc. All rights reserved.</p> </div> </footer> <div class="modal directory" id="directory-modal"> <div class="modal-dialog"> <div class="modal-content"> <div class="modal-header"> <h5 class="modal-title">Directory</h5> <button class="close" type="button" data-dismiss="modal">×</button> </div> <div class="modal-body"></div> </div> </div> </div> <script src="/theme/common/static/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/theme/common/static/jquery_extra/dist/jquery-extra.js"></script> <script src="/theme/common/static/popper.js@1.16.1/dist/umd/popper.min.js"></script> <script src="/theme/common/static/bootstrap@4.6.0/dist/js/bootstrap.min.js"></script> <script> var __path_directory = [ ] !function __draw_directory(data, root, uuid) { var ul = $('<ul>', uuid && { id: 'category' + uuid, class: !__path_directory.includes(uuid) ? 'collapse' : null }); for (var item in data) { var li = $('<li>').appendTo(ul); if (item = data[item], item.children) { li.append('<a href=#category' + item.id + ' data-toggle=collapse>') __draw_directory(item.children, li, item.id); } else { li.append('<a href=' + item.url + '>'); } var a = $('> a', li).addClass('item').text(item.name) .append($('<a class="link fa fa-external-link" href=' + item.url + '>')); if (item.id === +__path_directory.slice(-1)) { a.addClass('active'); } /* if (item.id !== __path_directory[0]) { a.addClass('collapsed'); } */ } root.append(ul); } ([{"id":1,"name":"Food and cooking","url":"/catalog/Food+and+cooking","children":null},{"id":2,"name":"Education","url":"/catalog/Education","children":null},{"id":3,"name":"Healthcare","url":"/catalog/Healthcare","children":null},{"id":4,"name":"Real estate","url":"/catalog/Real+estate","children":null},{"id":5,"name":"Religion ","url":"/catalog/Religion+","children":null},{"id":6,"name":"Science and nature","url":"/catalog/Science+and+nature","children":null},{"id":7,"name":"Internet","url":"/catalog/Internet","children":null},{"id":8,"name":"Sport","url":"/catalog/Sport","children":null},{"id":9,"name":"Technical documentation","url":"/catalog/Technical+documentation","children":null},{"id":10,"name":"Travel","url":"/catalog/Travel","children":null},{"id":11,"name":"Art and Design","url":"/catalog/Art+and+Design","children":null},{"id":12,"name":"Automotive","url":"/catalog/Automotive","children":null},{"id":13,"name":"Business","url":"/catalog/Business","children":null},{"id":14,"name":"Government","url":"/catalog/Government","children":null}], $('#directory-aside')); var __root_directory = $('#directory-aside > ul'); $('#directory-aside') .on('show.bs.collapse', function() { //console.log('show.collapse') }) .on('hide.bs.collapse', function() { //console.log('hide.collapse') }); $('#directory-modal') .on('show.bs.modal', function() { $('[class$="body"]', this).prepend(__root_directory); }) .on('hide.bs.modal', function() { $('#directory-aside').prepend(__root_directory); }); $('.directory-mobile').on('click', function(e) { e.preventDefault(); }); $('.directory .link').on('click', function(e) { e.stopPropagation(); }); </script> <script> function scrollToViewport() { $('html, body').stop().animate( { scrollTop: $('.navbar').outerHeight() }, 1000); } setTimeout(scrollToViewport, 1000); $(window).on('orientationchange', scrollToViewport); $('[data-toggle="tooltip"]').tooltip(); </script> <script async src="//s7.addthis.com/js/300/addthis_widget.js#pubid=#sp('addthis_pub_id')"></script> <!-- Yandex.Metrika counter --> <script type="text/javascript"> (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter28397281 = new Ya.Metrika({ id:28397281 }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = (d.location.protocol == "https:" ? "https:" : "http:") + "//mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="//mc.yandex.ru/watch/28397281" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <link rel="stylesheet" type="text/css" href="//cdnjs.cloudflare.com/ajax/libs/cookieconsent2/3.1.0/cookieconsent.min.css" /> <style> @media screen and (max-width: 768px) { .cc-revoke { display: none; }} </style> <script src="//cdnjs.cloudflare.com/ajax/libs/cookieconsent2/3.1.0/cookieconsent.min.js"></script> <script> window.addEventListener("load", function() { window.cookieconsent.initialise( { content: { href: "https://slideum.com/dmca" }, location: true, palette: { button: { background: "#fff", text: "#237afc" }, popup: { background: "#007bff" }, }, position: "bottom-right", revokable: true, theme: "classic", type: "opt-in" })}); </script> </body> </html> <script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>