Transcript Indexing
CIS 455/555: Internet and Web Systems
Indexing
February 1, 2016
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
1
Announcements
New HW1 due dates
MS1 due on February 10th, MS2 due on February 19th
Try to have a feature-complete prototype by Friday, so you
have time to debug and test your solution, go to OH, etc.
Debugging tips
When in doubt about protocol details, please look in the
HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy)
Reminder: You have three jokers; the late penalty
without jokers is 20% per day
Audio; waiting list
Reading:
© 2016 A. Haeberlen, Z. Ives
D. Comer: "The Ubiquitous B-Tree"
http://dl.acm.org/citation.cfm?id=356776
University of Pennsylvania
2
Plan for today
Inverted indices
B+ trees
© 2016 A. Haeberlen, Z. Ives
NEXT
University of Pennsylvania
3
Finding data by content
We’ve seen two approaches to search:
Flood the network with requests (example: Gnutella), and do
all the work at the data stores
Have a directory based on names (example: LDAP)
Which of these is the 'best'?
An alternative, two-step process:
Build a content index over what’s out there
An index is a keyvalue map
Typically limited in what kinds of queries can be supported
Most common instance: an index of document keywords
Example: Incidence matrix.
Why is this not a good idea?
4
© 2016 A. Haeberlen, Z. Ives
A common model for search
Index the words in every document
“Forward index”: document (ID) list of
words
“Inverted index”: word document (ID)
5
© 2016 A. Haeberlen, Z. Ives
Inverted indices
A conceptually very simple map-multiset data
structure: <keyword, {list of occurrences}>
In its simplest form, each occurrence includes a
document pointer (e.g., URI), perhaps a count
and/or position
What might a count be useful for? A position?
Requires two components, an indexer and a
retrieval system
We’ll consider the cost of building the index, plus
searching the index using a single keyword
© 2016 A. Haeberlen, Z. Ives
Storage efficiency is also a concern
6
How do we lay out an inverted index?
Which operations do we need to support?
insert
delete
find
next
Which data structures could we use?
Unordered list (e.g., a log)
Ordered list
Tree
Hash table
...
7
© 2016 A. Haeberlen, Z. Ives
Unordered and ordered lists
Assume that we have entries such as:
<keyword, #items, {occurrences}>
What does ordering buy us?
Assume that we adopt a model in which we use:
<keyword, item>
<keyword, item>
Do we get any additional benefits?
How about:
<keyword, {items}>
where we fix the size of the keyword and the number
of items?
8
© 2016 A. Haeberlen, Z. Ives
Tree-based indices
Trees have several benefits over lists:
Potentially logarithmic search time, as with a well-designed
sorted list
if it is balanced!
Ability to handle variable-length records
We’ve already seen how trees might make a natural
way of distributing data, as well
How does a binary search tree fare?
Cost of building?
Cost of finding an item in it?
9
© 2016 A. Haeberlen, Z. Ives
Recap: Inverted indices
Useful for search
Different data structures can be used
© 2016 A. Haeberlen, Z. Ives
Pros / cons
University of Pennsylvania
10
Plan for today
Inverted indices
B+ trees
© 2016 A. Haeberlen, Z. Ives
NEXT
University of Pennsylvania
11
The B+ tree
A flexible, height-balanced, high-fanout tree
Insert/delete at logF N cost (F = fanout, N = # leaf pages)
Minimum 50% occupancy (except for root)
Need to keep tree height-balanced
Each node contains d <= m <= 2d entries
Inner nodes contain up to 2d+1 pointers
d is called the order of the tree
Can search efficiently based on equality (or also
range, though we don’t need that here)
Index Entries
(Direct search)
Linked list
(compare to
B-tree!)
© 2016 A. Haeberlen, Z. Ives
...
Data Entries
("Sequence set")
Example B+ Tree
Data (inverted list pointers) is at the leaves;
intermediate nodes have copies of search keys
Search begins at root, and key comparisons direct it
to a leaf
Search for be↓, bobcat↓ ...
Root
art
a↓ am ↓ an↓ ant↓ art↓ be↓
best
but
best↓ bit↓ bob↓
dog
but↓ can↓ cry↓
dog↓ dry↓ elf↓ fox↓
Based on the search for bobcat*, we know it is not in the tree!
© 2016 A. Haeberlen, Z. Ives
Inserting data into a B+ Tree
Find correct leaf L
Put data entry onto L
best
but
dog
dog↓dry↓elf↓ fox↓
Redistribute entries evenly, copy up middle key
Insert index entry pointing to L2 into parent of L
This can happen recursively
art
If L has enough space
but↓can↓cry↓
best↓bit↓ bob↓
a↓ am ↓an↓ ant↓ art↓ be↓
we are, done!
Else, must split leaf node L (into L and a new node L2)
Root
To split index node, redistribute entries evenly, but push up
middle key. (Contrast with leaf splits.)
Splits “grow” tree; root split increases height
© 2016 A. Haeberlen, Z. Ives
Tree growth: gets wider or one level taller at the top
Inserting “and↓” Example: Copy up
Root
art
dog
but↓ can↓ cry↓
dog↓ dry↓ elf↓ fox↓
Want to insert here; no room, so split & copy up:
an
a↓ am ↓
© 2016 A. Haeberlen, Z. Ives
but
best↓ bit↓ bob↓
a↓ am ↓ an↓ ant↓ art↓ be↓
and↓
best
Entry to be inserted in parent node.
(Note that key “an” is copied up and
continues to appear in the leaf.)
an↓ and↓ ant↓
But where? Parent node
is already "full"!
15
Inserting “and↓” Example: Push up 1/2
Need to split node
& push up
Root
art
best
but
dog
an
a↓ am ↓
art↓ be↓
best↓ bit↓ bob↓
but↓can↓ cry↓
dog↓ dry↓ elf↓ fox↓
an↓ ant↓ and↓
16
© 2016 A. Haeberlen, Z. Ives
Inserting “and↓” Example: Push up 2/2
Entry to be inserted in parent node.
Root
an
a↓ am ↓
art
art↓ be↓
best
(Note that best is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
but dog
best↓ bit↓ bob↓
but↓can↓ cry↓
dog↓ dry↓ elf↓ fox↓
an↓ ant↓ and↓
17
© 2016 A. Haeberlen, Z. Ives
Summary: Copying vs. splitting
Every keyword (search key) appears in at
most one intermediate node
Every inverted list entry must appear in a leaf
Hence, in splitting an intermediate node, we push up
We may also need it in an intermediate node to define a
partition point in the tree
We must copy up the key of this entry
Note that B+ trees easily accommodate
multiple occurrences of a keyword
18
© 2016 A. Haeberlen, Z. Ives
Some details
How would you choose the order of the tree?
How would you find all the words starting
with the letters 'com'?
How would you delete something?
Do you always have to split/merge?
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
19
Virtues of the B+ Tree
B+ tree and other indices are quite efficient:
Height-balanced; logF N cost to search
High fanout (F) means depth rarely more than 3 or 4
Almost always better than maintaining a sorted file
Typically, 67% occupancy on average
Berkeley DB library (C, C++, Java; Oracle) is a
toolkit for B+ trees that you will be using
later in the semester:
© 2016 A. Haeberlen, Z. Ives
Interface: open B+ Tree; get and put items based on key
Handles concurrency, caching, etc.
Example: B+ tree
65
9
25 45
70
130
187
80 101 122
138 150 159 180
65 67 68 69 70 72 75 79
1 4 6
9 14 16
25 31 38 41
45 61 63 64
Insert 15, 11, 12, 32, 74
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
21
How do we distribute a B+ Tree?
We need to host the root
at one machine and
distribute the rest
What are the implications
for scalability?
© 2016 A. Haeberlen, Z. Ives
Consider building the index
as well as searching
22
Eliminating the root
Sometimes we don’t want a tree-structured
system because the higher levels can be a
central point of congestion or failure
Two strategies:
Modified tree structure (e.g., the BATON p2p tree; see
Jagadish et al., VLDB 2005)
Non-hierarchical structure (distributed hash table,
discussed in a couple of weeks)
23
© 2016 A. Haeberlen, Z. Ives
Recap: B+ trees
A very common data structure for indices
Used, e.g., in many file systems and many DBMS
Very efficient
© 2016 A. Haeberlen, Z. Ives
Height-balanced; logF N cost to search
High fanout (F) means depth rarely more than 3 or 4
Almost always better than maintaining a sorted file
Typically, 67% occupancy on average
University of Pennsylvania
24