Concept-document

Download Report

Transcript Concept-document

Documents
Thanks to Bill Arms, Marti Hearst
Last time
• Size of information
– Continues to grow
• IR an old field, goes back to the ‘40s
• IR iterative process
• Search engine most popular information
retrieval model
• Still new ones being built
Focus on documents
• Document will be what we:
–
–
–
–
–
Crawl (harvest)
Index
Retrieve with query
Evaluate
Rank
• IR iterative process
IR is an Iterative Process
Repositories
Goals
Workspace
User’s
Information
Need
text input
Parse
Query
Collections
Pre-process
Index
User’s
Information
Need
Collections
Pre-process
text input
Parse
Query
Index
Rank or Match
User’s
Information
Need
Collections
Pre-process
text input
Parse
Query
Index
Rank or Match
Evaluation
Query Reformulation
Definitions
Collections consist of Documents
• Document
– The basic unit which we will automatically index
• usually a body of text which is a sequence of terms
– has to be digital
• Tokens or terms
– Basic units of a document, usually consisting of text
• semantic word or phrase, numbers, dates, etc
• Collections or repositories
– particular collections of documents
– sometimes called a database
• Query
– request for documents on a topic
Collection vs documents vs terms
Collection
Document
Terms or tokens
What is a Document?
• A document is a digital object with an operational
definition
– Indexable (usually digital)
– Can be queried and retrieved.
• Many types of documents
– Text or part of text
– Web page
– Image
– Audio
– Video
– Data
– Email
– Etc.
Text Documents
A text digital document consists of a sequence of words and other
symbols, e.g., punctuation.
The individual words and other symbols are known as tokens or
terms.
A textual document can be:
•
Free text, also known as unstructured text, which is a
continuous sequence of tokens.
•
Fielded text, also known as structured text, in which the text
is broken into sections that are distinguished by tags or other
markup.
Example?
Why the focus on text?
• Language is the most powerful query model
• Language can be treated as text
– Text has many interesting properties
• Others?