Transcript Introduction to Information Retrieval
Introduction to Information Retrieval
1
Outline
What is the IR problem?
How to organize an IR system? (Or the main processes in IR) Indexing Retrieval System evaluation Some current research topics 2
The problem of IR
Goal = find documents relevant need from a large document set to an information Info. need Document collection Retrieval IR system Query Answer list 3
Example
G o o g l e Web
4
IR problem
First applications: in libraries (1950s) ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content:
Phases of an IE system
User Interface user need text Text processing user feedback Logic view Query manipulation Logic view Indexing DB Manager Module (expert proficient skillful) (design implement) web search index Retrieved doc ranking weighted documents Doc repository
Main problems in IR
Document and query indexing How to best represent their contents?
Query evaluation (or retrieval process) To what extent does a document correspond to a query?
System evaluation How good is a system? Are the retrieved documents relevant? (precision) Are all the relevant documents retrieved? (recall) 7
Document indexing
Goal = Find the important meanings internal representation and create an Factors to consider: Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents?
Char. String : not precise enough Word : good coverage, not precise Phrase : poor coverage, more precise Concept : poor coverage, precise
Coverage (Recall) String Word Phrase Concept Accuracy (Precision)
8
keywords
Keywords are the common way to represent meaning However modern IR systems (search engines like Google)use more than simple keywords 9
Keyword selection and weighting
How to select
important
keywords?
Frequency/Informativity frequency Max. Min. 1 2 3 … informativity Rank 10
tf*idf weighting schema
tf = term frequency frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc.
df = document frequency no. of documents containing the term distribution of the term idf = inverse document frequency the unevenness of term distribution in the corpus the specificity of term to a document The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t) 11
Some common
tf*idf
schemes
tf(t, D)=freq(t,D) tf(t, D)=log[freq(t,D)] tf(t, D)=log[freq(t,D)]+1 tf(t, D)=freq(t,d)/Max[f(t,d)] idf(t) = log(N/n) n = #docs containing t N = #docs in corpus weight(t,D) = tf(t,D) * idf(t) Normalization: Cosine normalization, /max, … 12
Text processing in IR: Stopwords / Stoplist
function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “ standard ” stoplists are commonly used.
13
Text processing in IR: Stemming
Reason : Different word forms may bear similar meaning (e.g. search, searching): create a “ standard ” representation for them Stemming : Removing some endings of word computer compute computes computing computed computation
comput
14
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix stripping, Program , 14(3) :130-137) Step 1: plurals and past participles SSES -> SS (*v*) ING -> caresses -> caress motoring -> motor Step 2: adj->n, n->v, n->adj, … (m>0) OUSNESS -> OUS callousness -> callous (m>0) ATIONAL -> ATE relational -> relate Step 3: (m>0) ICATE -> IC triplicate -> triplic Step 4: (m>1) AL -> (m>1) ANCE -> Step 5: revival -> reviv allowance -> allow (m>1) E -> probate -> probat (m > 1 and *d and *L) -> single letter controll -> control 15
Problems with Porter’s algorithm
Grouping errors: organization, organ police, policy polic organ arm, army arm (in italian) matto, mattone Omissions: cylin der , cylin dri cal create, creation Europe, European matt 16
Sample text:
Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation
Porter’s:
such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead t pictur of express that is more biolog transpar and access to interpret
Lovins’s:
such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres
Paice’s :
such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret 17
Text processing in IR: Lemmatization
transform to standard form according to syntactic category.
E.g. verb + ing verb noun + s noun Need POS tagging More accurate than stemming, but needs more resources crucial to choose stemming/lemmatization rules noise v.s. recognition rate compromise between precision and recall light/no stemming -recall +precision severe stemming +recall -precision 18
Example
Sir Timothy John Berners-Lee OM KBE FRS FREng FRSA (born 8 June 1955) is an computer scientist and MIT professor credited with inventing the World Wide Web. On 25 December 1990 he implemented the first successful communication between an HTTP client and server via the Internet with the help of Robert Cailliau and a young student staff at CERN. He was ranked Joint First alongside Albert Hofmann in The Telegraph ’ s list of 100 greatest living geniuses.[2] Berners-Lee is the director of the World Wide Web Consortium (W3C), which oversees the Web ’ s continued development, the founder of the World Wide Web Foundation and he is a senior researcher and holder of the 3Com Founders Chair at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
19
20
Text processing in IR: POS tagging
Sir/NNP Timothy/NNP John/NNP Berners-Lee/NNP OM/NNP KBE/NNP FRS/NNP FREng/NNP FRSA/NNP (/( born/VBN 8/NNP June/NNP 1955/CD )/) is/VBZ an/DT computer/NN scientist/NN and/CC MIT/NNP professor/NN credited/VBN with/IN inventing/VBG the/DT World/NNP Wide/JJ Web/NN ./.
On/IN 25/CD December/NNP 1990/CD he/PRP implemented/VBN the/DT first/JJ successful/JJ communication/NN between/IN an/DT HTTP/NNP client/NN and/CC server/NN via/IN the/DT Internet/NNP with/IN the/DT help/NN of/IN Robert/NNP Cailliau/NNP and/CC a/DT young/JJ student/NN staff/NN at/IN CERN/NNP ./.
He/PRP was/VBD ranked/VBN Joint/NNP First/JJ alongside/IN Albert/NNP Hofmann/NNP in/IN The/DT Telegraph's/NNP list/NN of/IN 100/CD greatest/JJS living/VBG geniuses/NNS ./.
[/( 2/CD ]/) Berners-Lee/NNP is/VBZ the/DT director/NN of/IN the/DT World/NNP Wide/JJ Web/NNP Consortium/NNP (/( W3C/NNP )/) ,/, which/WDT oversees/VBZ the/DT Web's/NNP continued/VBD development/NN ,/, the/DT founder/NN of/IN the/DT World/NNP Wide/JJ Web/NN Foundation/NN and/CC he/PRP is/VBZ a/DT senior/JJ researcher/NN and/CC holder/NN of/IN the/DT 3Com/NNP Founders/NNPS Chair/NN at/IN the/DT MIT/NNP Computer/NNP Science/NNP and/CC Artificial/NNP Intelligence/NNP Laboratory/NNP (/( CSAIL/NNP )/) ./.
21
Text processing in IR: Terms (named entities) extraction
Sir Timothy John Berners-Lee computer scientist MIT professor World Wide Web HTTP client Robert Cailliau Joint First Albert Hofmann greatest living geniuses World Wide Web Consortium (W3C) World Wide Web Foundation senior researcher 3Com Founders Chair MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) 22
Document representation
Each document is represented by a set of weighted keywords (terms): D 1 {(t 1 , w 1 ), (t 2 ,w 2 ), …} e.g.
D 1 D 2 {(comput, 0.2), (architect, 0.3), …} {(comput, 0.1), (network, 0.5), …} Bag of words model : word ordering does not matter, keywords are ordered in a lexicographic order 23
Bag of words model
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes.
For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the
message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns.
In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.
sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel
24
Indexing
How can documents be efficiently searched?
25
Inverted file
text: Char position
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are beautiful Inverted file vocabulary beautiful 70 occurrences flowers garden house 45, 58 18, 29 6 Identify the position of a term within a text
Block indexing
• text:
Block 1 4 Block 2 Block 3 Block
That house has a garden. The garden has many flowers. The flowers are beautiful • Inverted file: vocabulary occurrences beautiful flowers garden house 4 3 2 1
Inverted files D
k
,tf
ik keywords computer database df 3 2 D 7 , 4 D 1 , 3 D j , tf j df i tf ik science 4 D 2 , 4 system 1 D 5 , 2 Index files List of occurrences =# of documents in which t i is found = freeq of t i in doc D k
Ranking
The documents to be retrieved are those that include the words of a query How to retrieve these documents?
How to present them to the user?
In modern search engines, ranking is based on: Content Link analysis 29
Content-based retrieval
Matching score model Document D = a set of weighted keywords Query Q = a set of non-weighted keywords R(D, Q) = a measure of relevance of a document, given the query 30
Boolean model
e.g. Document = Logical conjunction of keywords Query = Boolean expression of keywords R(D, Q) = D Q D = t 1 t 2 … t n D can be expressed as a boolean vector t i =1 if tin D, else it is 0 Q = (t (1,1,x….x) (x,x,1,0,x..) 1 t 2 ) (t 3 t 4 ) set of conjunctive components Each conjunctive component can be expressed as a boolean vector R(D,Q)=1 if at least one disjunctive component matched D Problems: R is either 1 or 0 (unordered set of documents) many documents or few documents End-users cannot manipulate Boolean operators correctly E.g. documents about kangaroos
and
koalas 31
Vector space model
Vector space = all the keywords encountered Document
The vector space model
t 3 d 1 d 2 t 2 t 1 33
Example (if weights are frequency counts)
Esempio : D 1 D 2 = 2T 1 = 3T 1 Q = 0T 1 + 3T 2 + 7T 2 + 0T 2 + 5T + T + 2T 3 3 3 5 T 3 Is D1 or D2 the most relevant document?
D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 D 2 = 3T 1 + 7T 2 + T 3 7 T 2
Some formulas for Sim
Dot product Cosine Dice Jaccard
Sim
(
D
,
Q
) =
Sim
(
D
,
Q
) =
Sim
(
D
,
Q
) =
Sim
(
D
,
Q
) = å (
a i
*
b i
) t1 2 å
i i
å
a i
2 (
a i
+ å
i i
å å
i a i
(
a i
2 * *
b i i
å )
b i a i
2 + *
b i
å
i i
å å
i b i
) (
a i b i
2 2 2 *
b i
-
i
å ) (
a i
*
b i
) D Q t2 35
Cos-sim computation example
t 3 D 1 2 1 Q t 1 t 2 D 2 D 1 = 2T 1 = 0.81
D 2 = 3T 1 0.13
Q = 0T 1 + 3T + 7T + 0T 2 2 2 + 5T + 1T + 2T 3 3 3 CosSim( CosSim( D D 1 2 , Q ) = 10 / (4+9+25)(0+0+4) , Q ) = 2 / (9+49+1)(0+0+4) = 36
Matrix representation
Document space D D D … 1 2 3 D m Q t 1 a 11 a 21 a 31 a b m1 1 t 2 a 12 a 22 a 32 a m2 b 2 t 3 a 13 a 23 a 33 a m3 b 3 … … … … … … t n a 1n a 2n a 3n a mn b n Term vector space 37
Implementation (space)
Matrix is very sparse: a few 100s terms for a document, and a few terms for a query, while the term space is large (~100k) Stored as: D1 {(t1, a1), (t2,a2), …} t1 {(D1,a1), …} 38
Implementation (time)
The implementation of VSM with dot product: Naïve implementation: O(m*n) Implementation using inverted file: Given a query = {(t1,b1), (t2,b2)}: 1. find the sets of related documents through inverted file for t1 and t2 2. calculate the score of the documents to each weighted term (t1,b1) {(D1,a1 *b1), …} 3. combine the sets and sum the weights ( ) O(|Q|*n) 39
Cosine similarity is the most common
Cosine:
Sim
(
D
,
Q
) =
i
å (
a i
*
b i
)
j a i
2 *
j b i
2 =
i
å å
j a i a j
2 å
j b i b j
2 use å
j a j
2 and å weights after indexing
j b j
2 to normalize the 40
Cosine computation
41
Content-based ranking
Query evaluation result is a list of documents, sorted by their similarity to the query.
E.g.
doc1 0.67
doc2 0.65
doc3 0.54
… 42
System evaluation
Efficiency: time, space Effectiveness: How is a system capable of retrieving relevant documents? Is a system better than another one?
Metrics often used (together): Precision = retrieved relevant docs / retrieved docs Recall = retrieved relevant docs / relevant docs retrieved relevant relevant retrieved 43
General form of precision/recall
Precision 1.0 1.0 Recall -Precision change w.r.t. Recall (not a fixed point) -Systems cannot compare at one Precision/Recall point -Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 44
An illustration of P/R calculation
List Doc1 Y Doc2 Doc3 Y Doc4 Y Doc5 Rel?
… Assume: 5 relevant docs.
Precision 1.0 - 0.8 - 0.6 - 0.4 - 0.2 - 0.0 * (0.2, 1.0) * (0.6, 0.75) * (0.4, 0.67) * (0.6, 0.6) * (0.2, 0.5) | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 Recall 45
Some other measures
Noise = retrieved irrelevant docs / retrieved docs Silence = non-retrieved relevant docs / relevant docs Noise = 1 – Precision; Silence = 1 – Recall Fallout = retrieved irrel. docs / irrel. docs Single value measures: F-measure = 2 P * R / (P + R) Average precision = average at 11 points of recall (R=0; 0,1;0,2;….1) Precision at n document (often used for Web IR) Expected search length (no. irrelevant documents to read before obtaining n relevant doc.) 46
Benchmark: Test corpus
Compare different IR systems on the same test corpus A test corpus contains: A set of documents A set of queries Relevance judgment for every document-query pair (desired answers for each query) The results of a system is compared with the desired answers.
47
The TREC competitions
Once per year A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) Participants work (very hard) to construct, fine tune their systems, and submit the answers (1000/query) at the deadline (July) NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) TREC conference (November) 48
TREC evaluation methodology
Known document collection (>100K) and query set (50) Submission of 1000 documents for each query by each participant Merge 100 first documents of each participant -> global pool Human relevance judgment of the global pool The other documents are assumed to be irrelevant Evaluation of each system (with 1000 answers) Partial relevance judgments But stable for system ranking 49
Tracks (tasks)
Ad Hoc track: given document collection, different topics Routing (filtering): stable interests (user profile), incoming document flow CLIR: Ad Hoc, but with queries in a different language Web: a large set of Web pages Question-Answering: When did Nixon visit China?
Interactive: put users into action with system Spoken document retrieval Image and video retrieval Information tracking: new topic / follow up 50
Impact of TREC
Provide large collections for further experiments Compare different systems/techniques on realistic data Develop new methodology for system evaluation Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …) 51
Some techniques to improve IR effectiveness
Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document The use of relevance feedback To improve query expression: Q new = *Q old + *Rel_d *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents 52
Effect of RF
2 nd retrieval 1 st retrieval * * * * * * * * * * * * * * * x * x x x x x Q * * new R * Q * NR x * x * x x 53
Modified relevance feedback
Users usually do not cooperate (e.g. AltaVista in early years) Pseudo-relevance feedback (Blind RF) Using the top-ranked documents as if they are relevant: Select m terms from n top-ranked documents One can usually obtain about 10% improvement 54
Automatic Query expansion
A query contains part of the important words Add new (related) terms into the query Manually constructed knowledge base/thesaurus (e.g. Wordnet) Q = information retrieval Q ’ = (information + data + knowledge + …) (retrieval + search + seeking + …) Corpus analysis: two terms that often co-occur are related (Mutual information) Two terms that co-occur with the same words are related (e.g. T-shirt and coat with wear , …) 55
56
57
Global vs. local context analysis
Global analysis: use the whole document collection to calculate term relationships Local analysis: use the query to retrieve a subset of documents, then calculate term relationships Combine pseudo-relevance feedback and term co occurrences More effective than global analysis 58
Beyond keywords
On-the fly exercise: compose a query using Google for which none of the first matches in page 1 are relevant 5-10 minutes! 59
May be never, however..
60
Not now but in a near future Did you hear about embedded systems & internet of things?
61
Here is where keywords don’t work
62
Problems
Negation “don’t” Mapping user’s terminology with technical terminology (“stomac upset” “cause adverse effect”) Generic vrs specific terms (“is-a” relations) tylenor is pain killer 63
Some current research topics:
Go beyond keywords Keywords are not perfect representatives of concepts Ambiguity: table = data structure, furniture?
Lack of precision: “ operating ” , “ system ” less precise than “ operating_system ” Suggested solution Sense disambiguation and hypernym expansion (difficult due to the lack of contextual information) Using compound terms (no complete dictionary of compound terms, variation in form) Using noun phrases (syntactic patterns + statistics) Mapping terminologies (technical vrs not-technical terms) 64
What ig the search space is the WEB??
No stable document collection (spider, crawler) Invalid document, duplication, etc.
Huge number of documents (partial collection) Multimedia documents Great variation of document quality Multilingual problem … 65
Web servers
66
Market share
67
The web graph
68
Web search
Spider indexing the Web indexes user Sponsored Links CG Appliance Express Discount Appliances (650) 756 -3931 Same Day Certified Installation www.cgappliance.com
San Francisco -Oakland - San Jose, CA
Miele
Vacuum Cleaners
Miele
Vacuums - Complete Selection Free Shipping! www.vacuums.com
Miele
Vacuum Cleaners
Miele
-Free Air shipping! All models. Helpful advice. www.b est -vacuum.com
Results
1
- of about
7,310,000
for
miele
. (
0.12
seconds)
Web Miele
, Inc -- Anything else is a compromise At the heart of your home, Appliances by
Miele
.
...
USA. to Vacuum C leaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System www.
miele
.com/ - 20k - Cached - Similar pages
miele
.com. Residential Appliances.
...
Miele
Welcome to
Miele
, the home of the very best appliances and kitchens in the world. www.
miele
.co.uk/ - 3k - Cached - Similar pages
Miele
- Deutscher Hersteller von Einbauger äten, Hausgeräten page ] Das Portal zum Thema Essen & Geniessen online unter www.zu
...ein Leben lang.
...
W ählen Sie die
Miele
Vertretung Ihres Landes. www.
miele
.de/ - 10k - Cached - Similar pages -tisch.de.
...
- [ Translate this
Miele
weltweit Herzlich willkommen bei
Miele
Österreich - [ Translate this page Herzlich willkommen bei
Miele
Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGER ÄTE
...
www.
miele
.at/ - 3k - Cached - Similar pages ] Search
Page Respository WWW Spidering control Spiders /Crawlers } Indexing module Text Structure Links Query Utenti IR module Ranked results 70
Is “needle in the haystack” the problem?
What is the most difficult query: “Venice” “Encyclopedia of mole dwarfs” ??
PROBLEM IS TOO MANY, NOT FEW!!
71
Link Analysys
In web IR, the position of a page within the web graph determines its relevance 72
PageRank----Idea
Every page has some number of forward links(outedges) and backlinks(inedges) Generally, highly linked pages are more “ important ” than pages with few links.
PageRank----Idea
2. Backlinks coming from important pages convey more importance to a page. For example, if a web page has a link off the Yahoo home page, it may be just one link but it is a very important one.
A page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks.
PageRank----Definition
u: a web page F u : set of pages u points to B u : set of pages that point to u N u =|F u |: the number of links from u c: a factor used for normalization
R
(
u
) =
c
å
v
Î
B u R
(
v
)
N v
The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges.
0 0 0
1
Parto da pesi casuali
0 0 0
1
informatica@sapienza Pagina 77
0,5
0
0,5
informatica@sapienza 0 informatica@sapienza Pagina 78 Pagina 78
0
0,25 0,75
informatica@sapienza 0 Pagina 79
After several iterations..
0,22 0,11 0,22 0,44 Why does it stops here?
informatica@sapienza Pagina 80
PageRank----definition
A problem with above definition: rank sink If two web pages point to each other but to no other page, during the iteration, this loop will accumulate rank but never distribute any rank.
PageRank----definition
Definition modified:
R
(
u
) =
c
å
v
Î
B u R
(
v
)
N v
+
cE
(
u
) E(u) is some vector over the web pages(for example uniform, favorite page etc.) that corresponds to a source of rank. E(u) is a user designed parameter.
PageRank----Random Surfer Model
The definition corresponds to the probability distribution of a random walk on the web graphs. E(u) can be thought as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever.
PageRank----Conclusion
PageRank is a global ranking based on the web's graph structure PageRank use backlinks information to bring order to the web PageRank can be thought of as random surfer model.
Final remarks on IR
IR is related to many areas: NLP, AI, database, machine learning, user modeling… library, Web, multimedia search, … Relatively week theories Very strong tradition of experiments Many remaining (and exciting) problems Difficult area: Intuitive methods do not necessarily improve effectiveness in practice 85
Why is IR difficult
Vocabularies mismatching Synonymy: e.g. car Polysemy: table v.s. automobile Queries are ambiguous, they are partial specification of user ’ s need Content representation may be inadequate and incomplete The user is the ultimate judge, but we don ’ t know how the judge judges… The notion of relevance is imprecise, context- and user dependent But how much it is rewarding to gain 10% improvement!
86
Beyond IR
Information Extraction Question Answering Dialogue Management Bag of words More NLP !!
87
What is Information Extraction
(In this talk) Information Extraction (IE) = Identifying the instances of specified types of names/entities , relations and events from semi-structured or unstructured text; and create a database For relations and events, this includes finding the participants and modifiers (date, time, location, etc.) In other words, we build a data base with the information on a given relation or event: people’s jobs people’s whereabouts merger and acquisition activity disease outbreaks Experiment chains in scientific papers ……
Event Extraction: ‘Traditional’ IE
Barry Diller quit Trigger Arguments
Quit (a “Personnel/End-Position” event) Role = Person Role = Organization Barry Diller Vivendi Universal Entertainment Role = Position Role = Time-within Chief Wednesday (2003-03-04)
Information Extraction (IE) Pipeline
Name/Nominal Extraction
“Barry Diller”, “chief”
Coreference Resolution
“Barry Diller” = “chief”
Time Identification and Normalization
Wednesday ( 2003-03-04 )
Relation Extraction Event Extraction
“Vivendi Universal Entertainment” is located in “France” “Barry Diller” is the end-position event trigged by the person of “quit”
Most Recent applications: Knowledge Base Population
Knowledge Base (KB) Attributes (a.k.a., “slots”) derived from Wikipedia infoboxes are used to create the reference KB Source Collection A large corpus of newswire and web documents (1 million docs) is provided for systems to discover information to expand and populate KB
Entity Linking: Create Wiki Entries
NIL Query =
“
James Parsons
”