Introduction to Information Retrieval

Download Report

Transcript Introduction to Information Retrieval

Introduction to Information Retrieval

1

Outline

      What is the IR problem?

How to organize an IR system? (Or the main processes in IR) Indexing Retrieval System evaluation Some current research topics 2

The problem of IR

Goal = find documents relevant need from a large document set to an information Info. need Document collection Retrieval IR system Query Answer list 3

Example

G o o g l e Web

4

   

IR problem

First applications: in libraries (1950s) ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: external attributes (e.g. ISBN) and internal attribute (Content) Search by external attributes = Search in DB IR: search by content 5

Phases of an IE system

User Interface user need text Text processing user feedback Logic view Query manipulation Logic view Indexing DB Manager Module (expert  proficient  skillful)  (design  implement)  web search index Retrieved doc ranking weighted documents Doc repository

Main problems in IR

   Document and query indexing  How to best represent their contents?

Query evaluation (or retrieval process)  To what extent does a document correspond to a query?

System evaluation   How good is a system? Are the retrieved documents relevant? (precision)  Are all the relevant documents retrieved? (recall) 7

  

Document indexing

Goal = Find the important meanings internal representation and create an Factors to consider:  Accuracy to represent meanings (semantics)   Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents?

    Char. String : not precise enough Word : good coverage, not precise Phrase : poor coverage, more precise Concept : poor coverage, precise

Coverage (Recall) String Word Phrase Concept Accuracy (Precision)

8

keywords

  Keywords are the common way to represent meaning However modern IR systems (search engines like Google)use more than simple keywords 9

Keyword selection and weighting

 How to select

important

keywords?

 Frequency/Informativity frequency Max. Min. 1 2 3 … informativity Rank 10

tf*idf weighting schema

   tf = term frequency  frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc.

df = document frequency  no. of documents containing the term  distribution of the term idf = inverse document frequency  the unevenness of term distribution in the corpus  the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) 11

Some common

tf*idf

schemes

    tf(t, D)=freq(t,D) tf(t, D)=log[freq(t,D)] tf(t, D)=log[freq(t,D)]+1 tf(t, D)=freq(t,d)/Max[f(t,d)] idf(t) = log(N/n) n = #docs containing t N = #docs in corpus weight(t,D) = tf(t,D) * idf(t)  Normalization: Cosine normalization, /max, … 12

Text processing in IR: Stopwords / Stoplist

  function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index      Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document)   The removal of stopwords usually improves IR effectiveness A few “ standard ” stoplists are commonly used.

13

Text processing in IR: Stemming

  Reason :  Different word forms may bear similar meaning (e.g. search, searching): create a “ standard ” representation for them Stemming :  Removing some endings of word computer compute computes computing computed computation

comput

14

Porter algorithm

(Porter, M.F., 1980, An algorithm for suffix stripping, Program , 14(3) :130-137)      Step 1: plurals and past participles   SSES -> SS (*v*) ING -> caresses -> caress motoring -> motor Step 2: adj->n, n->v, n->adj, …   (m>0) OUSNESS -> OUS callousness -> callous (m>0) ATIONAL -> ATE relational -> relate Step 3:  (m>0) ICATE -> IC triplicate -> triplic Step 4:   (m>1) AL -> (m>1) ANCE -> Step 5: revival -> reviv allowance -> allow  (m>1) E -> probate -> probat  (m > 1 and *d and *L) -> single letter controll -> control 15

Problems with Porter’s algorithm

  Grouping errors:   organization, organ  police, policy  polic   organ arm, army  arm (in italian) matto, mattone  Omissions:   cylin der , cylin dri cal create, creation  Europe, European matt 16

Sample text:

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Porter’s:

such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead t pictur of express that is more biolog transpar and access to interpret

Lovins’s:

such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paice’s :

such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret 17

Text processing in IR: Lemmatization

 transform to standard form according to syntactic category.

E.g. verb + ing  verb   noun + s  noun Need POS tagging More accurate than stemming, but needs more resources   crucial to choose stemming/lemmatization rules noise v.s. recognition rate compromise between precision and recall light/no stemming -recall +precision severe stemming +recall -precision 18

Example

Sir Timothy John Berners-Lee OM KBE FRS FREng FRSA (born 8 June 1955) is an computer scientist and MIT professor credited with inventing the World Wide Web. On 25 December 1990 he implemented the first successful communication between an HTTP client and server via the Internet with the help of Robert Cailliau and a young student staff at CERN. He was ranked Joint First alongside Albert Hofmann in The Telegraph ’ s list of 100 greatest living geniuses.[2] Berners-Lee is the director of the World Wide Web Consortium (W3C), which oversees the Web ’ s continued development, the founder of the World Wide Web Foundation and he is a senior researcher and holder of the 3Com Founders Chair at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

19

20

Text processing in IR: POS tagging

Sir/NNP Timothy/NNP John/NNP Berners-Lee/NNP OM/NNP KBE/NNP FRS/NNP FREng/NNP FRSA/NNP (/( born/VBN 8/NNP June/NNP 1955/CD )/) is/VBZ an/DT computer/NN scientist/NN and/CC MIT/NNP professor/NN credited/VBN with/IN inventing/VBG the/DT World/NNP Wide/JJ Web/NN ./.

On/IN 25/CD December/NNP 1990/CD he/PRP implemented/VBN the/DT first/JJ successful/JJ communication/NN between/IN an/DT HTTP/NNP client/NN and/CC server/NN via/IN the/DT Internet/NNP with/IN the/DT help/NN of/IN Robert/NNP Cailliau/NNP and/CC a/DT young/JJ student/NN staff/NN at/IN CERN/NNP ./.

He/PRP was/VBD ranked/VBN Joint/NNP First/JJ alongside/IN Albert/NNP Hofmann/NNP in/IN The/DT Telegraph's/NNP list/NN of/IN 100/CD greatest/JJS living/VBG geniuses/NNS ./.

[/( 2/CD ]/) Berners-Lee/NNP is/VBZ the/DT director/NN of/IN the/DT World/NNP Wide/JJ Web/NNP Consortium/NNP (/( W3C/NNP )/) ,/, which/WDT oversees/VBZ the/DT Web's/NNP continued/VBD development/NN ,/, the/DT founder/NN of/IN the/DT World/NNP Wide/JJ Web/NN Foundation/NN and/CC he/PRP is/VBZ a/DT senior/JJ researcher/NN and/CC holder/NN of/IN the/DT 3Com/NNP Founders/NNPS Chair/NN at/IN the/DT MIT/NNP Computer/NNP Science/NNP and/CC Artificial/NNP Intelligence/NNP Laboratory/NNP (/( CSAIL/NNP )/) ./.

21

Text processing in IR: Terms (named entities) extraction

Sir Timothy John Berners-Lee computer scientist MIT professor World Wide Web HTTP client Robert Cailliau Joint First Albert Hofmann greatest living geniuses World Wide Web Consortium (W3C) World Wide Web Foundation senior researcher 3Com Founders Chair MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) 22

Document representation

 Each document is represented by a set of weighted keywords (terms): D 1  {(t 1 , w 1 ), (t 2 ,w 2 ), …} e.g.

D 1 D 2   {(comput, 0.2), (architect, 0.3), …} {(comput, 0.1), (network, 0.5), …}  Bag of words model : word ordering does not matter, keywords are ordered in a lexicographic order 23

Bag of words model

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes.

For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the

message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns.

In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel

24

Indexing

 How can documents be efficiently searched?

25

Inverted file

 text: Char position

1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are beautiful  Inverted file vocabulary beautiful 70 occurrences flowers garden house 45, 58 18, 29 6 Identify the position of a term within a text

Block indexing

• text:

Block 1 4 Block 2 Block 3 Block

That house has a garden. The garden has many flowers. The flowers are beautiful • Inverted file: vocabulary occurrences beautiful flowers garden house 4 3 2 1

Inverted files D

k

,tf

ik keywords computer database df 3 2 D 7 , 4 D 1 , 3 D j , tf j    df i tf ik science 4 D 2 , 4 system 1 D 5 , 2 Index files List of occurrences =# of documents in which t i is found = freeq of t i in doc D k

Ranking

 The documents to be retrieved are those that include the words of a query   How to retrieve these documents?

How to present them to the user?

 In modern search engines, ranking is based on:  Content  Link analysis 29

Content-based retrieval

 Matching score model    Document D = a set of weighted keywords Query Q = a set of non-weighted keywords R(D, Q) = a measure of relevance of a document, given the query 30

Boolean model

e.g.    Document = Logical conjunction of keywords Query = Boolean expression of keywords R(D, Q) = D  Q D = t 1  t 2  …  t n D can be expressed as a boolean vector t i =1 if tin D, else it is 0 Q = (t (1,1,x….x) (x,x,1,0,x..) 1  t 2 )  (t 3   t 4 ) set of conjunctive components Each conjunctive component can be expressed as a boolean vector R(D,Q)=1 if at least one disjunctive component matched D Problems:  R is either 1 or 0 (unordered set of documents)   many documents or few documents End-users cannot manipulate Boolean operators correctly E.g. documents about kangaroos

and

koalas 31

Vector space model

    Vector space = all the keywords encountered Document D = < a 1 , a 2 , a 3 , …, a n > a i = weight of t i in D Query Q = < b 1 , b 2 , b 3 , …, b n > b i = weight of t i in Q R(D,Q) = Sim(D,Q) 32

The vector space model

t 3 d 1 d 2  t 2 t 1 33

Example (if weights are frequency counts)

Esempio : D 1 D 2 = 2T 1 = 3T 1 Q = 0T 1 + 3T 2 + 7T 2 + 0T 2 + 5T + T + 2T 3 3 3 5 T 3 Is D1 or D2 the most relevant document?

D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 D 2 = 3T 1 + 7T 2 + T 3 7 T 2

Some formulas for Sim

Dot product Cosine Dice Jaccard

Sim

(

D

,

Q

) =

Sim

(

D

,

Q

) =

Sim

(

D

,

Q

) =

Sim

(

D

,

Q

) = å (

a i

*

b i

) t1 2 å

i i

å

a i

2 (

a i

+ å

i i

å å

i a i

(

a i

2 * *

b i i

å )

b i a i

2 + *

b i

å

i i

å å

i b i

) (

a i b i

2 2 2 *

b i

-

i

å ) (

a i

*

b i

) D Q t2 35

Cos-sim computation example

t 3 D 1  2  1 Q t 1 t 2 D 2 D 1 = 2T 1 = 0.81

D 2 = 3T 1 0.13

Q = 0T 1 + 3T + 7T + 0T 2 2 2 + 5T + 1T + 2T 3 3 3 CosSim( CosSim( D D 1 2 , Q ) = 10 /  (4+9+25)(0+0+4) , Q ) = 2 /  (9+49+1)(0+0+4) = 36

Matrix representation

Document space D D D … 1 2 3 D m Q t 1 a 11 a 21 a 31 a b m1 1 t 2 a 12 a 22 a 32 a m2 b 2 t 3 a 13 a 23 a 33 a m3 b 3 … … … … … … t n a 1n a 2n a 3n a mn b n Term vector space 37

 

Implementation (space)

Matrix is very sparse: a few 100s terms for a document, and a few terms for a query, while the term space is large (~100k) Stored as: D1  {(t1, a1), (t2,a2), …} t1  {(D1,a1), …} 38

Implementation (time)

 The implementation of VSM with dot product:   Naïve implementation: O(m*n) Implementation using inverted file: Given a query = {(t1,b1), (t2,b2)}: 1. find the sets of related documents through inverted file for t1 and t2 2. calculate the score of the documents to each weighted term (t1,b1)  {(D1,a1 *b1), …} 3. combine the sets and sum the weights (  )  O(|Q|*n) 39

Cosine similarity is the most common

Cosine:

Sim

(

D

,

Q

) =

i

å (

a i

*

b i

)

j a i

2 *

j b i

2 =

i

å å

j a i a j

2 å

j b i b j

2 use å

j a j

2 and å weights after indexing

j b j

2 to normalize the 40

Cosine computation

41

Content-based ranking

  Query evaluation result is a list of documents, sorted by their similarity to the query.

E.g.

doc1 0.67

doc2 0.65

doc3 0.54

… 42

System evaluation

   Efficiency: time, space Effectiveness:   How is a system capable of retrieving relevant documents? Is a system better than another one?

Metrics often used (together):   Precision = retrieved relevant docs / retrieved docs Recall = retrieved relevant docs / relevant docs retrieved relevant relevant retrieved 43

General form of precision/recall

Precision 1.0 1.0 Recall -Precision change w.r.t. Recall (not a fixed point) -Systems cannot compare at one Precision/Recall point -Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 44

An illustration of P/R calculation

List Doc1 Y Doc2 Doc3 Y Doc4 Y Doc5 Rel?

… Assume: 5 relevant docs.

Precision 1.0 - 0.8 - 0.6 - 0.4 - 0.2 - 0.0 * (0.2, 1.0) * (0.6, 0.75) * (0.4, 0.67) * (0.6, 0.6) * (0.2, 0.5) | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 Recall 45

Some other measures

    Noise = retrieved irrelevant docs / retrieved docs Silence = non-retrieved relevant docs / relevant docs  Noise = 1 – Precision; Silence = 1 – Recall Fallout = retrieved irrel. docs / irrel. docs Single value measures:     F-measure = 2 P * R / (P + R) Average precision = average at 11 points of recall (R=0; 0,1;0,2;….1) Precision at n document (often used for Web IR) Expected search length (no. irrelevant documents to read before obtaining n relevant doc.) 46

Benchmark: Test corpus

   Compare different IR systems on the same test corpus A test corpus contains:  A set of documents   A set of queries Relevance judgment for every document-query pair (desired answers for each query) The results of a system is compared with the desired answers.

47

The TREC competitions

     Once per year A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) Participants work (very hard) to construct, fine tune their systems, and submit the answers (1000/query) at the deadline (July) NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) TREC conference (November) 48

TREC evaluation methodology

      Known document collection (>100K) and query set (50) Submission of 1000 documents for each query by each participant Merge 100 first documents of each participant -> global pool Human relevance judgment of the global pool The other documents are assumed to be irrelevant Evaluation of each system (with 1000 answers)   Partial relevance judgments But stable for system ranking 49

        

Tracks (tasks)

Ad Hoc track: given document collection, different topics Routing (filtering): stable interests (user profile), incoming document flow CLIR: Ad Hoc, but with queries in a different language Web: a large set of Web pages Question-Answering: When did Nixon visit China?

Interactive: put users into action with system Spoken document retrieval Image and video retrieval Information tracking: new topic / follow up 50

Impact of TREC

   Provide large collections for further experiments Compare different systems/techniques on realistic data Develop new methodology for system evaluation  Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …) 51

Some techniques to improve IR effectiveness

  Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document The use of relevance feedback  To improve query expression: Q new =  *Q old +  *Rel_d  *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents 52

Effect of RF

2 nd retrieval 1 st retrieval * * * * * * * * * * * * * * * x * x x x x x Q * * new R * Q * NR x * x * x x 53

Modified relevance feedback

  Users usually do not cooperate (e.g. AltaVista in early years) Pseudo-relevance feedback (Blind RF)   Using the top-ranked documents as if they are relevant:  Select m terms from n top-ranked documents One can usually obtain about 10% improvement 54

Automatic Query expansion

  A query contains part of the important words Add new (related) terms into the query   Manually constructed knowledge base/thesaurus (e.g. Wordnet)  Q = information retrieval  Q ’ = (information + data + knowledge + …) (retrieval + search + seeking + …) Corpus analysis:   two terms that often co-occur are related (Mutual information) Two terms that co-occur with the same words are related (e.g. T-shirt and coat with wear , …) 55

56

57

Global vs. local context analysis

  Global analysis: use the whole document collection to calculate term relationships Local analysis: use the query to retrieve a subset of documents, then calculate term relationships   Combine pseudo-relevance feedback and term co occurrences More effective than global analysis 58

Beyond keywords

  On-the fly exercise: compose a query using Google for which none of the first matches in page 1 are relevant 5-10 minutes! 59

May be never, however..

60

Not now but in a near future Did you hear about embedded systems & internet of things?

61

Here is where keywords don’t work

62

Problems

   Negation “don’t” Mapping user’s terminology with technical terminology (“stomac upset”  “cause adverse effect”) Generic vrs specific terms (“is-a” relations) tylenor is pain killer 63

Some current research topics:

Go beyond keywords   Keywords are not perfect representatives of concepts   Ambiguity: table = data structure, furniture?

Lack of precision: “ operating ” , “ system ” less precise than “ operating_system ” Suggested solution   Sense disambiguation and hypernym expansion (difficult due to the lack of contextual information) Using compound terms (no complete dictionary of compound terms, variation in form)   Using noun phrases (syntactic patterns + statistics) Mapping terminologies (technical vrs not-technical terms) 64

What ig the search space is the WEB??

       No stable document collection (spider, crawler) Invalid document, duplication, etc.

Huge number of documents (partial collection) Multimedia documents Great variation of document quality Multilingual problem … 65

Web servers

66

Market share

67

The web graph

68

Web search

Spider indexing the Web indexes user Sponsored Links CG Appliance Express Discount Appliances (650) 756 -3931 Same Day Certified Installation www.cgappliance.com

San Francisco -Oakland - San Jose, CA

Miele

Vacuum Cleaners

Miele

Vacuums - Complete Selection Free Shipping! www.vacuums.com

Miele

Vacuum Cleaners

Miele

-Free Air shipping! All models. Helpful advice. www.b est -vacuum.com

Results

1

- of about

7,310,000

for

miele

. (

0.12

seconds)

Web Miele

, Inc -- Anything else is a compromise At the heart of your home, Appliances by

Miele

.

...

USA. to Vacuum C leaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System www.

miele

.com/ - 20k - Cached - Similar pages

miele

.com. Residential Appliances.

...

Miele

Welcome to

Miele

, the home of the very best appliances and kitchens in the world. www.

miele

.co.uk/ - 3k - Cached - Similar pages

Miele

- Deutscher Hersteller von Einbauger äten, Hausgeräten page ] Das Portal zum Thema Essen & Geniessen online unter www.zu

...ein Leben lang.

...

W ählen Sie die

Miele

Vertretung Ihres Landes. www.

miele

.de/ - 10k - Cached - Similar pages -tisch.de.

...

- [ Translate this

Miele

weltweit Herzlich willkommen bei

Miele

Österreich - [ Translate this page Herzlich willkommen bei

Miele

Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGER ÄTE

...

www.

miele

.at/ - 3k - Cached - Similar pages ] Search

Page Respository WWW Spidering control Spiders /Crawlers } Indexing module Text Structure Links Query Utenti IR module Ranked results 70

Is “needle in the haystack” the problem?

 What is the most difficult query:   “Venice” “Encyclopedia of mole dwarfs”  ??

PROBLEM IS TOO MANY, NOT FEW!!

71

Link Analysys

 In web IR, the position of a page within the web graph determines its relevance 72

PageRank----Idea

Every page has some number of forward links(outedges) and backlinks(inedges) Generally, highly linked pages are more “ important ” than pages with few links.

PageRank----Idea

2. Backlinks coming from important pages convey more importance to a page. For example, if a web page has a link off the Yahoo home page, it may be just one link but it is a very important one.

A page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks.

PageRank----Definition

u: a web page F u : set of pages u points to B u : set of pages that point to u N u =|F u |: the number of links from u c: a factor used for normalization

R

(

u

) =

c

å

v

Î

B u R

(

v

)

N v

The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges.

0 0 0

1

Parto da pesi casuali

0 0 0

1

informatica@sapienza Pagina 77

0,5

0

0,5

informatica@sapienza 0 informatica@sapienza Pagina 78 Pagina 78

0

0,25 0,75

informatica@sapienza 0 Pagina 79

After several iterations..

0,22 0,11 0,22 0,44 Why does it stops here?

informatica@sapienza Pagina 80

PageRank----definition

A problem with above definition: rank sink If two web pages point to each other but to no other page, during the iteration, this loop will accumulate rank but never distribute any rank.

PageRank----definition

Definition modified:

R

(

u

) =

c

å

v

Î

B u R

(

v

)

N v

+

cE

(

u

) E(u) is some vector over the web pages(for example uniform, favorite page etc.) that corresponds to a source of rank. E(u) is a user designed parameter.

PageRank----Random Surfer Model

The definition corresponds to the probability distribution of a random walk on the web graphs.  E(u) can be thought as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever.

PageRank----Conclusion

   PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can be thought of as random surfer model.

Final remarks on IR

     IR is related to many areas:  NLP, AI, database, machine learning, user modeling…  library, Web, multimedia search, … Relatively week theories Very strong tradition of experiments Many remaining (and exciting) problems Difficult area: Intuitive methods do not necessarily improve effectiveness in practice 85

Why is IR difficult

    Vocabularies mismatching   Synonymy: e.g. car Polysemy: table v.s. automobile Queries are ambiguous, they are partial specification of user ’ s need Content representation may be inadequate and incomplete The user is the ultimate judge, but we don ’ t know how the judge judges…  The notion of relevance is imprecise, context- and user dependent  But how much it is rewarding to gain 10% improvement!

86

Beyond IR

 Information Extraction  Question Answering  Dialogue Management Bag of words More NLP !!

87

  

What is Information Extraction

(In this talk) Information Extraction (IE) = Identifying the instances of specified types of names/entities , relations and events from semi-structured or unstructured text; and create a database For relations and events, this includes finding the participants and modifiers (date, time, location, etc.) In other words, we build a data base with the information on a given relation or event:  people’s jobs      people’s whereabouts merger and acquisition activity disease outbreaks Experiment chains in scientific papers ……

Event Extraction: ‘Traditional’ IE

Barry Diller quit Trigger Arguments

Quit (a “Personnel/End-Position” event) Role = Person Role = Organization Barry Diller Vivendi Universal Entertainment Role = Position Role = Time-within Chief Wednesday (2003-03-04)

Information Extraction (IE) Pipeline

Name/Nominal Extraction

“Barry Diller”, “chief”

Coreference Resolution

“Barry Diller” = “chief”

Time Identification and Normalization

Wednesday ( 2003-03-04 )

Relation Extraction Event Extraction

“Vivendi Universal Entertainment” is located in “France” “Barry Diller” is the end-position event trigged by the person of “quit”

Most Recent applications: Knowledge Base Population

 Knowledge Base (KB)  Attributes (a.k.a., “slots”) derived from Wikipedia infoboxes are used to create the reference KB  Source Collection  A large corpus of newswire and web documents (1 million docs) is provided for systems to discover information to expand and populate KB

Entity Linking: Create Wiki Entries

NIL Query =

James Parsons