Transcript mg4j-exercise
MG4J: Managing Gigabytes for Java Exercise Ida Mele
Document
• • • • • Indexing in MG4J is centered around documents Package:
it.unimi.di.big.mg4j.document
The object document, which is the instance of the class
Document
, represents a single document that can be indexed Different documents have different number and type of fields. For example, • • E-mail: from, to, date, subject, body HTML page: title, url, body Ida Mele MG4J - exercise 1
Document
• Summary of methods: Ida Mele MG4J - exercise 2
DocumentCollection
• • Package:
it.unimi.di.big.mg4j.document
DocumentCollection
of documents is a randomly addressable lists Ida Mele MG4J - exercise 3
FileSetDocumentCollection
• • Package:
it.unimi.di.big.mg4j.document
The main method of
FileSetDocumentCollection
allows to build and serialize a set of documents specified by their filenames Ida Mele MG4J - exercise 4
Document Factory
• • Package:
it.unimi.di.big.mg4j.document
The factory turns a pure stream of bytes (file) into a document made by several fields (title and text) Ida Mele MG4J - exercise 5
Standard MG4J Document Factories
• • • • • • • • • CompositeDocumentFactory HtmlDocumentFactory IdentityDocumentFactory MailDocumentFactory PdfDocumentFactory ReplicatedDocumentFactory PropertyBasedDocumentFactory TRECHeaderDocumentFactory ZipDocumentCollection.ZipFactory
Ida Mele MG4J - exercise 6
Query
• • • • • Package:
it.unimi.di.big.mg4j.query
To query the index we can use the main method of the class
Query
We can submit queries by using: • • command line web browser
QueryEngine
: The query engine receives the query and returns the ranked list of results
HttpQueryServer
: A simple web server for query processing Ida Mele MG4J - exercise 7
Indexing and querying: exercise
• TECHNICAL REQUIREMENTS: • UNIX Operating System • Java (>=6) • Document collection and the libraries are available at:
http://www.dis.uniroma1.it/~mele/WebIR.html
Ida Mele MG4J - exercise 8
Set the classpath
• Download and extract
htmlDIS.tar.gz
• Download and extract
lib.zip
• Download the file
set-classpath.sh
• Edit the first line of the file
set-classpath.sh
: replace
your_directory
with the path of the folder containing all the
.jar
files (
lib
folder) • Set the CLASSPATH:
source set-classpath.sh
Ida Mele MG4J - exercise 9
Building the collection of documents (1)
• • Help: java it.unimi.di.big.mg4j.document.FileSetDocumentCollection - help Create the collection:
find htmlDIS -iname \*.html | java it.unimi.di.big.mg4j.document.FileSetDocumentCollection -f HtmlDocumentFactory -p encoding=UTF-8 dis.collection
find
returns the list of files, one per line. This list is provided as input to the main method of the
FileSetDocumentCollection
Ida Mele MG4J - exercise 10
Building the collection of documents (2)
• • • • We need also to specify a factory (the -f option) and the encoding as a property The name of the collection is names
dis.collection
The collection does not contain the files , but only their Deleting or modifying files of
htmlDIS
cause inconsistence in the collection directory may Ida Mele MG4J - exercise 11
Building the index
• • • Help: java it.unimi.di.big.mg4j.tool.IndexBuilder --help Create the index:
java it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S dis.collection dis
• • •
--downcase
: this option forces all the terms to be downcased
-S
: specifies that we are producing an index for the specified collection. If the option is omitted, Index expects to index a document sequence read from standard input
dis
: basename of the index If you have memory problem, you can use
-Xmx
for allocating more memory to Java:
java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S dis.collection dis
Ida Mele MG4J - exercise 12
Index files (1)
•
dis-{text,title}.terms
: contain the terms of the dictionary. One term per line
more dis-text.terms
•
dis-{text,title}.stats
: contain statistics
more dis-text.stats
•
dis-{text,title}.properties
: contain global information
more dis-text.properties
Ida Mele MG4J - exercise 13
Index files (2)
•
dis { text,title}.frequencies
: for each term, there is the number of documents with the term ( -code) •
dis-{text,title}.globcounts
: for each term, there is the number of occurrence of the term ( -code) •
dis-{text,title}.offset
: code) for each term, there is the offset ( Ida Mele MG4J - exercise 14
Index files (3)
•
dis-{title,text}.sizes
: contain the list of the document sizes. The document size is the number of words contained in each document ( - code) •
dis-{text,title}.batch
: temporary files with sub-indices ( -code). Use the option
--keep-batches
to not delete temporary files •
dis-{text,title}.index
: contain the index ( -code) Ida Mele MG4J - exercise 15
Web server
• Help:
java it.unimi.di.big.mg4j.query.Query --help
• Querying the index:
java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c dis.collection dis-text dis-title
• • Command line:
{text, title} > computer
Web browser:
http://localhost:4242/Query
Ida Mele MG4J - exercise 16
Query (1)
•
Search one word
: The result is the set of documents that contain the specified word • Example:
computer
•
AND
: more than one term separated by whitespace or by AND or &. The result is the set of documents that contain all the specified words • • • Example: Example: Example:
computer science computer AND science computer & science
Ida Mele MG4J - exercise 17
Query (2)
•
OR
: more than one term separated by OR or |. The result is the set of documents that contain any of the given words • Example:
conference | workshop
•
NOT
: the operator NOT or ! is used for negation • Example:
conference & ! workshop
•
Parentheses
: the parentheses are used to enforce priority in complex queries • Example:
university & (rome | california)
Ida Mele MG4J - exercise 18
Query (3)
•
Proximity restriction
: the words must appear within a limited portion of the document • Example:
(university rome)~6
•
Phrase
: using
“ ”
we can look for documents that contain the exact phrase • Example:
“university of rome la sapienza”
•
Ordered AND
: more than one term separated by
<
• Example:
computer < science < department
Ida Mele MG4J - exercise 19
Query (4)
•
Wildcard (*):
wildcard queries can be submitted appending
*
at the end of a term • Example:
infor*
•
Index specifiers
: prefixing a query with the name of an index followed by
:
you can restrict the search to that index • Example:
title:computer
• Example:
text:computer science AND title:FOCS
Ida Mele MG4J - exercise 20
Sophisticated queries (1)
• • • • MG4J provides sophisticated query tuning To use this features, we must use the command line interface
$
--- to get some help on the available options Some examples: •
$mode
--- to choose the kind of results Example:
> $mode short
•
$selector
--- to choose the way the snippet or intervals are shown Example:
> $selector 3 40
Ida Mele MG4J - exercise 21
Sophisticated queries (2)
• Other examples: •
$mplex
--- when multiplexing is on, each query is multiplexed to all indices. When a scorer is used, it is a good idea to use multiplexing Example:
> $mplex on
•
$score
--- to choose the scorer • Example:
> $score VignaScorer $weight
--- to change the weight of the indices. This is useful when multiplexing is on Example:
> $weight text:1 title:3
Ida Mele MG4J - exercise 22
Scorer (1)
• • • Scorer are important for ranking the documents result of a query.
Default:
BM25Scorer
and
VignaScorer
ConstantScorer
. Each document has a constant score (default is 0)
> $score ConstantScorer
CountScorer
. It is the product between the number of occurrences of the term in the document and the weight assigned to the index
> $score CountScorer
Ida Mele MG4J - exercise 23
Scorer (2)
•
TfIdfScorer
. It implements TF/IDF TF is the term frequency of the term
t
for the document
d
:
c/l
; where
c
is the number of occurrences of
t
in
d
and
l
is the length of
d
IDF is the inverse document frequency of the term
t
in the collection:
log(N/f);
where
N
is the number of documents in the collection and
f
is the number of documents where
t
appears
> $score TfIdfScorer
Ida Mele MG4J - exercise 24
Scorer (3)
•
DocumentRankScorer
. The scores of documents are stored in a text file
> $score DocumentRankScorer nameFile
Ida Mele MG4J - exercise 25
Virtual fields (1)
• • • • • • A
virtual field
produces pieces of text that refer to other documents (possibly belonging to the collection)
Referrer
: the document that is referring to another document
Referee
: the document to which a piece of text of the
Referrer
is referring to Intuitively, the
Referrer
gives us information about the
Referee
The
Referrer
produces in a virtual field a number of fragments of text, each referring to a
Referee
The content of a virtual field is a list of pairs made by the piece of text (called
virtual fragment
) and by some string that is aimed at representing the
Referee
(called the
document spec
) Ida Mele MG4J - exercise 26
Virtual fields (2)
• • In the case of the HTML document: • the
document spec
attribute) is a URL (as specified in the
href
• the
virtual fragment
is the content of the anchor element and some surrounding text (
anchor context
) The
HTMLDocumentFactory spec
,
virtual fragment
) produces the pairs (
document
Ida Mele MG4J - exercise 27
Virtual fields (3)
• • • Create the list of URL of the documents in the collection:
java it.unimi.di.big.mg4j.tool.ScanMetadata -S dis.collection -u dis.urls
Create the
document resolver
. It is able to map the
document spec
produced by some document factory into actual references to documents in the collection Given a
document spec
, the resolver will decide whether the
spec
really refers to a document in the collection or not, and in the first case it will find out to which document the
spec
refers to:
java it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o dis.urls dis-anchor.resolver
Ida Mele MG4J - exercise 28
Virtual fields (4)
• • Building the index:
java it.unimi.di.big.mg4j.tool.IndexBuilder -a -v anchor:dis anchor.resolver --downcase -S dis.collection dis
Querying the index:
java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c dis.collection dis-text dis-title dis-anchor {text, title, anchor} > anchor:conference {text, title, anchor} > title:combinatorial algorithms AND anchor:conference {text, title, anchor} > text:RoboCup AND anchor:info
Ida Mele MG4J - exercise 29
Virtual gap (1)
• • • • All the virtual fragments that refer to a given document of the collection are like a single text, called
virtual text
Virtual fragments coming from different anchors are concatenated, and they are in a text file This may produce
false positive
results For example, the query
anchor:(computer AND science)
produces as result a list of documents that contain both the words in some of their anchors, but not necessarily in the same anchor Ida Mele MG4J - exercise 30
Virtual gap (2)
• • • • To avoid such kinds of false positives, we can use
virtual gaps
The virtual gap is a positive integer, representing the virtual space left between different virtual fragments For example, if the virtual gap is 64 (the default), anchors are concatenated by leaving 64 “empty words” between subsequent fragments We can submit the query:
> anchor:(computer AND science)~64
and we will be sure that only documents containing both the term in the same anchor are retrieved Ida Mele MG4J - exercise 31
Virtual gap (3)
• • • If the anchor is longer than 64 characters, we can still have false positives In the indexing phase, it is possible to specify a different virtual gap For example, we can use:
java it.unimi.di.big.mg4j.tool.IndexBuilder -a -g anchor:100 -v anchor:dis-anchor.resolver --downcase -S dis.collection dis
It uses 100 characters for the virtual gap Ida Mele MG4J - exercise 32
Term map (1)
• • • A simple representation of a dictionary is the term list (the file
.terms
): a text file containing the whole dictionary, one term per line, in index order (the first line contains the term with index 0, the second line the term with index 1, etc.) A more efficient representation is based on a
monotone minimal perfect hash function
: it is a very compact data structure that is able to answer to the question
"What is the index of the term XXX?”
You can build such a function from a sorted term list using:
java it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction titles.mph dis-title.terms
Ida Mele MG4J - exercise 33
Term map (2)
• • • Monotone minimal perfect functions have a serious limit: they can answer correctly to the question
"What is the index of the term XXX?”
but only if the term appears in the dictionary To solve this problem, we can use a
signed function
For terms not in the dictionary, the function will answer with a special value (
-1
) that means
"the word is not in the dictionary”
java it.unimi.dsi.util.ShiftAddXorSignedStringMap titles.mph titles.map mycollection-title.terms
Ida Mele MG4J - exercise 34
Term map (3)
• • • • Wildcard searches require the use of a
prefix map
A prefix map is able to answer correctly to the question
"What are the indices of terms starting with the characters YYY?”
If terms are lexicographically sorted, the answer is a pair of integers, representing the
first
and the
last
index of terms satisfying the property We can build a prefix map by using:
java it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki -o dis-title.terms dis-title.dict
Ida Mele MG4J - exercise 35
Homework
1.
Read the MG4J (big) manual: http://www.dis.uniroma1.it/~mele/teaching/WebIR/manual -mg4j.pdf
2.
Repeat the exercise 3.
Create your own document collection, build the inverted index (with or without virtual fields), then submit some queries and try the different scorers Ida Mele MG4J - exercise 36