Transcript Slide 1
Keyword Search over XML 1 Inexact Querying • Until now, our queries have been complex patterns, represented by trees or graphs • Such query languages are not appropriate for the naive user: – if XML “replaces” HTML as the web standard, users can’t be expected to write graph queries Allow Keyword Search over XML! 2 Keyword Search • A keyword search is a list of search terms • There can be different ways to define legal search terms. Examples: – keyword:label, e.g., author:Smith – keyword, e.g., :Smith – label, e.g., author: – value (without distinguishing between keywords and labels) 3 Challenges (1) • Determining which part of the XML document corresponds to an answer – When searching HTML, the result units are usually documents – When searching XML, a finer granularity should be returned, e.g., a subtree 4 What should be returned for the query :ACID, :Kempster ? 5 Challenges (2) • Avoiding the return of non-meaningfully related elements – XML documents often contain many unrelated fragments of information. Can these information units be recognized? 6 What should be returned for the query :XML, author: ? 7 What should be returned for the query :XML, :Kempster ? 8 Challenges (3) • Ranking mechanisms – How should document fragments/XML elements be ranked • Ideas? 9 In what order should the answers be returned for :ACID, author: ? 10 Defining a Search Semantics • When defining a search over XML, all previous challenges must be considered. • We must decide: – what portions of a document are a search result? – should any results be filtered out since they are not meaningful? – how should ranking be performed • Typically, research focuses on one of these problems and provides simple solutions for the other problems. 11 Topics Discussed • XRank: Paper presents a variation of PageRank for ranking XML elements – focus on ranking • Interconnection Semantics: Methods to determine whether a set of nodes is meaningfully related – focus on filtering out meaningless results 12 XRank: Ranked Keyword Search over XML Documents Guo, Shao, Botev, Shanmugasundram SIGMOD 2003 13 Queries and their Semantics • Queries are keywords k1,…,kn, as in a search engine • Query results are portions of XML documents that contain all words. Formally: – Let v be a node in the document. To determine whether v should be returned: First, “remove” any descendents of v that contain all the keywords k1,…,kn. If v still contains all of k1,…,kn, then v should be a result of the search. – Intuition: Only return v if no more specific element can be returned. Note: Containment is via child edges, not IDREF edges 14 <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the language …</abstract> <section name=”Implementing XQL Operations”> <subsection name=”Path Expressions”> At first site, the XQL language looks… </subsection> </section> … <cite ref=”2”> Querying XML in Xyleme </cite> <cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite> </paper> <paper id =“2”> What should be returned for … </workshop> the query XQL language? 15 <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the language …</abstract> <section name=”Implementing XQL Operations”> <subsection name=”Path Expressions”> At first site, the XQL language looks… </subsection> </section> … <cite ref=”2”> Querying XML in Xyleme </cite> <cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite> </paper> <paper id =“2”> What should be returned for … </workshop> the query XQL language? 16 Ranking Results: Intuition • Granularity of ranking – In HTML, there is a rank for each document – In XML, we want a rank for each element. Different elements in the same document may have different ranks • Propose to extend ideas used for ranking HTML: – PageRank: Documents with more incoming links are more important (recursive definition) – Proximity: If the document contains the search terms close together, then the document is more important • Overall Rank: combination of PageRank and proximity 17 <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the language …</abstract> <section name=”Implementing XML Operations”> <subsection name=”Path Expressions”> At first site, the XQL language looks… </subsection> </section> … <cite ref=”2”> Querying XML in Xyleme </cite> <cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite> </paper> <paper id =“2”> Should both papers be ranked … </workshop> the same? 18 Topics • We discuss: – Ranking – The Index Structure – Query Processing 19 Ranking Results • Take into consideration – hyperlinks – proximity • We only discuss here ranking by the linking structure. Ranking by proximity can easily be defined (ideas?) • What kind of “links” are the in a graph of XML documents? 20 <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the language …</abstract> <section name=”Implementing XML Operations”> <subsection name=”Path Expressions”> At first site, the XQL language looks… </subsection> </section> … <cite ref=”2”> Querying XML in Xyleme </cite> <cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite> </paper> <paper id =“2”> Child/Parent “links” … </workshop> 21 <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the language …</abstract> <section name=”Implementing XML Operations”> <subsection name=”Path Expressions”> At first site, the XQL language looks… </subsection> </section> … <cite ref=”2”> Querying XML in Xyleme </cite> <cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite> </paper> <paper id =“2”> IDREF “links” … </workshop> 22 <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the language …</abstract> <section name=”Implementing XML Operations”> <subsection name=”Path Expressions”> At first site, the XQL language looks… </subsection> </section> … <cite ref=”2”> Querying XML in Xyleme </cite> <cite xmlns:xlink=”http://www8.org/paper/xmlql”> …</cite> </paper> <paper id =“2”> XLink “links” (out of the … </workshop> document) 23 Remember: Page Rank : Hyperlink edge d /3 d: Probability of following hyperlink d/3 v d /3 1-d: Probability of random jump Number of outgoing Numberlinks of documents 24 A Graph of XML documents • Nodes: N – each element in a document is a node • Edges: E = CE CE-1 HE – CE are “containment links”, i.e., there is an edge (u,v) in CE if u is a parent of v in the XML document – HE are “hyperlinks”, i.e., there is an edge (u,v) in HE if there is an IDREF link or XLink link from u to v • Want to define ElemRank, the parallel to PageRank, but for XML elements 25 Attempt 1 at ElemRank Hyperlink edge Containment edge v There are now 4 ways to get to an element. Consider all in the formula. 26 Attempt 1 at ElemRank: Problem Hyperlink edge Containment edge v Consider a paper with few sections and many references. The more references there are, the less important each section is. Why? 27 Attempt 2 at ElemRank Hyperlink edge Containment edge v Consider Hyperlinks and Structural links separately 28 Attempt 2 at ElemRank: Problem Hyperlink edge Containment edge v In fact, better to consider parentchild links differently from child-parent links 29 Actual ElemRank Hyperlink edge Containment edge v Consider Hyperlinks, Parent links and Child links separately 30 Interpretation in terms of Random Walks • The element rank of e is the probability that e will be reached if we start at a random element and at each point we chose one of the following options: – with probability 1-d1-d2-d3 jump to a random element in a random page – with probability d1 follow a random hyperlink from the current element – with probability d2 follow a random edge to a child element – with probability d3 follow the parent edge 31 ElemRank Example Hyperlink edge Containment edge 1 2 e(v) d1 4 3 1 d1 d 2 d3 e(u ) e(u ) d d e ( u ) 2 3 N ( u ) N ( u ) Ne ( u ,v )HE ( u ,v )CE ( u ,v )CE 1 h c • Suppose that d1 = d2 = d3 = 0.3 • In what order will the nodes be ranked? • What will be the formula for each node? 32 Think About it • Very nice definition of ElemRank • Does it make sense? Would ElemRank give good results in the following scenarios: – IDREFs connect articles with articles that they cite – IDREFs connect managers with their departments – IDREFs connect cleaning staff with their departments in which they work – IDREFs connect countries with bordering contries (as in the CIA factbook) 33 Topics • We discuss: – Ranking – The Index Structure – Query Processing 34 Indexing • We now discuss the index structure • Recall that we will be ranking according to ElemRank • Recall that we want to return “most specific elements” • How should the data be stored in an index? 35 Naive Method <workshop> date 1 28 July … <title> 2 0 <editors> XML and … 7 XQL and … <proceedings> 4 David Carmel … <paper> <title> 3 <author> 8 Ricardo … 5 <paper> <Section> 6 Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0;4;5;7 … 9 … … Problem: Space Overhead How much space is needed in storage? 36 Naive Method <workshop> date 1 28 July … <title> 2 0 <editors> XML and … 7 XQL and … <proceedings> 4 David Carmel … <paper> <title> 3 <author> 8 Ricardo … 5 <paper> <Section> 6 Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0;4;5;7 … 9 … … Problem: Spurious Results Cant simply return intersection of the lists, since if a node satisfies a query, so do all its ancestors 37 Dewey Encoding of ID • Use path information to identify elements – DeweyID • An ancestor’s ID is a prefix of its descendant’s ID • Actually (not shown) all the node ids are prefixed by the document number <workshop> 0 <date> 0.0 <title> 28 July … 0.1 XML and … <editors> 0.3.0.0 … <proceedings> 0.3 David Carmel … <paper> <title> 0.2 0.3.0 0.3.0.1 <author> … 0.3.1 <paper> … … … 38 Dewey Inverted List (DIL) • Store, for each keyword a list containing : – the id of the node containing the keyword – the rank of the node containing the keyword – the positions of the keyword in the node • Rank and positions are needed to compute ranking • To simplify, in the following slides, we only store lists of node ids 39 Topics • We discuss: – Ranking – The Index Structure – Query Processing 40 Query Processing • Challenges: – How do we find nodes that contain all keywords? – How do find only the most specific node that contains all keywords? – Can this be done in a single scan of the inverted keyword lists? 41 Example: Document 47th Document in Corpus proceedings paper paper … XQL … title … XQL … abstract section … XQL … subsection … language … … XQL language … 42 Example: Document with IDs 47th Document in Corpus proceedings 47.0 paper 47.0.0 47.0.0.0 title 47.0.0.2 section 47.0.0.1 abstract … XQL … 47.0.1 paper … XQL … … XQL … 47.0.0.2.0 subsection … language … … XQL language … 43 Example: Inverted Lists proceedings 47.0 paper 47.0.0 47.0.0.0 title paper 47.0.0.2 section 47.0.0.1 abstract … XQL … … XQL … Lists contain ids for nodes that 47.0.1 directly contain … XQL … keyword. Lists are sorted 47.0.0.2.0 subsection … language … … XQL language … XQL language 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 47.0.1 47.0.0.2.0 44 Example: Inverted Lists proceedings 47.0 paper 47.0.0 47.0.0.0 title paper 47.0.1 47.0.0.2 section 47.0.0.1 abstract … XQL … … XQL … We want to find nodes that should be returned. Which? … XQL …How will they be ranked? 47.0.0.2.0 subsection … language … … XQL language … XQL language 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 47.0.1 47.0.0.2.0 45 Algorithm: Data Structures XQL 47.0.1 1 2 ] ContainsAll Contains[ 47.0.0.2.0 DeweyID Result heap: 47.0.0.2.0 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 46 Algorithm: Pseudo Code • Find smallest next entry in inverted lists • Find longest common prefix of entry and dewey stack • Pop all non-matching values from dewey stack. When popping: – propogate down containment information, if containsAll is false – if containsAll turns from false to true, add result to output • Add non-matching values from entry into dewey stack. Mark containment for entry’s keyword 47 Example: Algorithm XQL 47.0.1 1 2 ] ContainsAll Contains[ 47.0.0.2.0 DeweyID Result heap: 47.0.0.2.0 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 48 Example: Algorithm XQL 47.0.0.0 2 ContainsAll 1 ] DeweyID Result heap: Contains[ 47.0.0.2.0 47.0.1 ] 47.0.0.1 47.0.0.2.0 Contains[ language 47.0.0.2 Smallest entry is for keyword 1, XQL. lcp with Dewey stack = none. Pop (nothing). Add (all). 49 Example: Algorithm XQL 2 ContainsAll 1 ] 0 47.0.1 Contains[ 47.0.0.2.0 DeweyID Result heap: 47.0.0.2.0 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 0 0 47 50 Example: Algorithm XQL 47.0.0.2.0 Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0 2 ContainsAll 0 1 ] DeweyID Result heap: Contains[ 47.0.0.2.0 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 Pop non-matching 47.0.1 entries 0 0 47 51 Example: Algorithm XQL 47.0.0.2.0 0 lcp with Dewey stack = 47.0.0 2 ContainsAll DeweyID Next smallest entry is for keyword 2, language. 1 ] Result heap: Contains[ 47.0.0.2.0 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 Add additional 47.0.1 entries 0 47 52 Example: Algorithm XQL 47.0.0.2.0 2 1 Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0 0 ContainsAll 1 ] DeweyID Result heap: Contains[ 47.0.0.2.0 47.0.1 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 0 47 53 Example: Algorithm XQL 47.0.0.2.0 2 1 Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = 47.0.0 0 ContainsAll 1 ] DeweyID Result heap: Contains[ 47.0.0.2.0 ] 47.0.0.1 47.0.0.2 Contains[ language 47.0.0.0 Pop non-matching 47.0.1 entries 0 47 54 Example: Algorithm XQL 47.0.0.2 2 ContainsAll lcp with Dewey stack = 47.0.0 0 1 ] Next smallest entry is for keyword 1, XQL. Contains[ 47.0.0 DeweyID Result heap: 47.0.0.2.0 47.0.1 ] 47.0.0.1 47.0.0.2.0 Contains[ language 47.0.0.0 Continue on Blackboard! 0 47 55