Indexing

Transcript Indexing

CIS 455/555: Internet and Web Systems
Indexing
February 1, 2016
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
1
Announcements

New HW1 due dates


MS1 due on February 10th, MS2 due on February 19th
Try to have a feature-complete prototype by Friday, so you
have time to debug and test your solution, go to OH, etc.





Debugging tips
When in doubt about protocol details, please look in the
HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy)
Reminder: You have three jokers; the late penalty
without jokers is 20% per day
Audio; waiting list
Reading:

© 2016 A. Haeberlen, Z. Ives
D. Comer: "The Ubiquitous B-Tree"
http://dl.acm.org/citation.cfm?id=356776
University of Pennsylvania
2
Plan for today


Inverted indices
B+ trees
© 2016 A. Haeberlen, Z. Ives
NEXT
University of Pennsylvania
3
Finding data by content

We’ve seen two approaches to search:


Flood the network with requests (example: Gnutella), and do
all the work at the data stores
Have a directory based on names (example: LDAP)


Which of these is the 'best'?
An alternative, two-step process:

Build a content index over what’s out there




An index is a keyvalue map
Typically limited in what kinds of queries can be supported
Most common instance: an index of document keywords
Example: Incidence matrix.

Why is this not a good idea?
4
© 2016 A. Haeberlen, Z. Ives
A common model for search



Index the words in every document
“Forward index”: document (ID)  list of
words
“Inverted index”: word  document (ID)
5
© 2016 A. Haeberlen, Z. Ives
Inverted indices


A conceptually very simple map-multiset data
structure: <keyword, {list of occurrences}>
In its simplest form, each occurrence includes a
document pointer (e.g., URI), perhaps a count
and/or position



What might a count be useful for? A position?
Requires two components, an indexer and a
retrieval system
We’ll consider the cost of building the index, plus
searching the index using a single keyword

© 2016 A. Haeberlen, Z. Ives
Storage efficiency is also a concern
6
How do we lay out an inverted index?

Which operations do we need to support?





insert
delete
find
next
Which data structures could we use?





Unordered list (e.g., a log)
Ordered list
Tree
Hash table
...
7
© 2016 A. Haeberlen, Z. Ives
Unordered and ordered lists





Assume that we have entries such as:
<keyword, #items, {occurrences}>
What does ordering buy us?
Assume that we adopt a model in which we use:
<keyword, item>
<keyword, item>
Do we get any additional benefits?
How about:
<keyword, {items}>
where we fix the size of the keyword and the number
of items?
8
© 2016 A. Haeberlen, Z. Ives
Tree-based indices

Trees have several benefits over lists:

Potentially logarithmic search time, as with a well-designed
sorted list




if it is balanced!
Ability to handle variable-length records
We’ve already seen how trees might make a natural
way of distributing data, as well
How does a binary search tree fare?


Cost of building?
Cost of finding an item in it?
9
© 2016 A. Haeberlen, Z. Ives
Recap: Inverted indices

Useful for search

Different data structures can be used

© 2016 A. Haeberlen, Z. Ives
Pros / cons
University of Pennsylvania
10
Plan for today


Inverted indices
B+ trees
© 2016 A. Haeberlen, Z. Ives
NEXT
University of Pennsylvania
11
The B+ tree


A flexible, height-balanced, high-fanout tree
Insert/delete at logF N cost (F = fanout, N = # leaf pages)


Minimum 50% occupancy (except for root)




Need to keep tree height-balanced
Each node contains d <= m <= 2d entries
Inner nodes contain up to 2d+1 pointers
d is called the order of the tree
Can search efficiently based on equality (or also
range, though we don’t need that here)
Index Entries
(Direct search)
Linked list
(compare to
B-tree!)
© 2016 A. Haeberlen, Z. Ives
...
Data Entries
("Sequence set")
Example B+ Tree



Data (inverted list pointers) is at the leaves;
intermediate nodes have copies of search keys
Search begins at root, and key comparisons direct it
to a leaf
Search for be↓, bobcat↓ ...
Root
art
a↓ am ↓ an↓ ant↓ art↓ be↓
best
but
best↓ bit↓ bob↓
dog
but↓ can↓ cry↓
dog↓ dry↓ elf↓ fox↓
 Based on the search for bobcat*, we know it is not in the tree!
© 2016 A. Haeberlen, Z. Ives
Inserting data into a B+ Tree


Find correct leaf L
Put data entry onto L



best
but
dog
dog↓dry↓elf↓ fox↓
Redistribute entries evenly, copy up middle key
Insert index entry pointing to L2 into parent of L
This can happen recursively


art
If L has enough space
but↓can↓cry↓
best↓bit↓ bob↓
a↓ am ↓an↓ ant↓ art↓ be↓
we are, done!
Else, must split leaf node L (into L and a new node L2)


Root
To split index node, redistribute entries evenly, but push up
middle key. (Contrast with leaf splits.)
Splits “grow” tree; root split increases height

© 2016 A. Haeberlen, Z. Ives
Tree growth: gets wider or one level taller at the top
Inserting “and↓” Example: Copy up
Root
art
dog
but↓ can↓ cry↓
dog↓ dry↓ elf↓ fox↓
Want to insert here; no room, so split & copy up:
an
a↓ am ↓
© 2016 A. Haeberlen, Z. Ives
but
best↓ bit↓ bob↓
a↓ am ↓ an↓ ant↓ art↓ be↓
and↓
best
Entry to be inserted in parent node.
(Note that key “an” is copied up and
continues to appear in the leaf.)
an↓ and↓ ant↓
But where? Parent node
is already "full"!
15
Inserting “and↓” Example: Push up 1/2
Need to split node
& push up
Root
art
best
but
dog
an
a↓ am ↓
art↓ be↓
best↓ bit↓ bob↓
but↓can↓ cry↓
dog↓ dry↓ elf↓ fox↓
an↓ ant↓ and↓
16
© 2016 A. Haeberlen, Z. Ives
Inserting “and↓” Example: Push up 2/2
Entry to be inserted in parent node.
Root
an
a↓ am ↓
art
art↓ be↓
best
(Note that best is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
but dog
best↓ bit↓ bob↓
but↓can↓ cry↓
dog↓ dry↓ elf↓ fox↓
an↓ ant↓ and↓
17
© 2016 A. Haeberlen, Z. Ives
Summary: Copying vs. splitting

Every keyword (search key) appears in at
most one intermediate node


Every inverted list entry must appear in a leaf



Hence, in splitting an intermediate node, we push up
We may also need it in an intermediate node to define a
partition point in the tree
We must copy up the key of this entry
Note that B+ trees easily accommodate
multiple occurrences of a keyword
18
© 2016 A. Haeberlen, Z. Ives
Some details


How would you choose the order of the tree?
How would you find all the words starting
with the letters 'com'?

How would you delete something?

Do you always have to split/merge?
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
19
Virtues of the B+ Tree

B+ tree and other indices are quite efficient:





Height-balanced; logF N cost to search
High fanout (F) means depth rarely more than 3 or 4
Almost always better than maintaining a sorted file
Typically, 67% occupancy on average
Berkeley DB library (C, C++, Java; Oracle) is a
toolkit for B+ trees that you will be using
later in the semester:


© 2016 A. Haeberlen, Z. Ives
Interface: open B+ Tree; get and put items based on key
Handles concurrency, caching, etc.
Example: B+ tree
65
9
25 45
70
130
187
80 101 122
138 150 159 180
65 67 68 69 70 72 75 79
1 4 6

9 14 16
25 31 38 41
45 61 63 64
Insert 15, 11, 12, 32, 74
© 2016 A. Haeberlen, Z. Ives
University of Pennsylvania
21
How do we distribute a B+ Tree?


We need to host the root
at one machine and
distribute the rest
What are the implications
for scalability?

© 2016 A. Haeberlen, Z. Ives
Consider building the index
as well as searching
22
Eliminating the root


Sometimes we don’t want a tree-structured
system because the higher levels can be a
central point of congestion or failure
Two strategies:


Modified tree structure (e.g., the BATON p2p tree; see
Jagadish et al., VLDB 2005)
Non-hierarchical structure (distributed hash table,
discussed in a couple of weeks)
23
© 2016 A. Haeberlen, Z. Ives
Recap: B+ trees

A very common data structure for indices


Used, e.g., in many file systems and many DBMS
Very efficient




© 2016 A. Haeberlen, Z. Ives
Height-balanced; logF N cost to search
High fanout (F) means depth rarely more than 3 or 4
Almost always better than maintaining a sorted file
Typically, 67% occupancy on average
University of Pennsylvania
24

Indexing

Transcript Indexing

Directory