Potential query log sets

Download Report

Transcript Potential query log sets

Potential Query Log Sets

Alexander Yeh MITRE Corp.

October 2008

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Possible Issues with a "Query Log" Corpus

 Resembles queries of real interest to somebody  Has some 'geo' aspect  Multi-lingual Mitre in-house has limitations on languages  Permission to use and distribute (even after the evaluation) © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

More Recent Suggestions (While at Workshop)

 Local search queries from various Wikipedias Multi-lingual Privacy? -probably not as bad as other search logs (more like encyclopedia lookup) Permission?

Long enough to be interesting from a "geo" standpoint?

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

More Recent Suggestions (Continued)

 Treat GikiP topics as queries E.g.: GP4 "Which Swiss cantons border Germany?” Multi-lingual, have permission, no privacy problem Combine with GikiP 2009 for publicity purposes But few in number (15 in 2008 pilot) Realistic enough?

 Use logs generated by an evaluation (like iCLEF) Multi-lingual, permissions & privacy dealt with But realistic enough?

Has "geo" aspect?

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

More Recent Suggestions (Concluded)

 Timway search logs from Hong Kong Chinese, English, usually 1 language in a query Used in some studies, but usual permission & privacy issues Also, finding annotator(s) may be an issue:  Chinese probably in Cantonese (versus "official" Mandarin dialect) - not too bad in written form  Probably traditional characters (not mainland China’s simplified characters) © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Potential Query Log Data Sets - 1

 Tumba! (Diana Santos, Nuno Cardoso and others) Available, large amount, a lot not released before In Portuguese: need to hire and train somebody who can annotate Portuguese © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Potential Query Log Data Sets - 2

 Workshop on Web Search Click Data 2009 (WSCD 2009) http://research.microsoft.com/users/nickcr/w scd09/ MSN search query log Large amount, relatively new (and so not seen as much) Pursuing getting permission (asking Nick Craswell)  Cancelled query parsing task in CLEF 2008 Current status: cannot release data outside of Microsoft © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Potential Query Log Data Sets - 3

 Query parsing task in CLEF 2007 Query log of 800K English queries (unlabeled), 100 queries of labeled training data and 500 queries of test data Presumably this log is still available for use in a new query parsing task.

Use same set, but generate new training and test One disadvantage: the CLEF community is already familiar with this data set © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Can Easily Obtain the Following Query Log Data Sets, But …

 Can easily obtain a number of data-sets, but They are old, and so may have been already seen by the CLEF community Problems getting permissions to use these  Anticipate problems, or  Been asked not to use © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Query Log Data Sets that are Easy to Obtain

 KDD Cup 2005: Ying Li, a co-chair, asked us not to use  AlltheWeb_2001.gz, AlltheWeb_2002.gz, AltaVista_2002.zip: Jim Jansen: the data sharing agreement has expired  Excite_1997_small.zip, Excite_1997_large.zip, Excite_1999.zip, Excite_2001.gz: from Jim Jansen. Need Excite's permission?

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Query Log Data Sets that are Easy to Obtain (Concluded)

 AOL query log: from http://gregsadetsky.com/aol-data/ Was made available to the public for awhile Created a controversy about privacy  But all these data sets will have similar privacy issues © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

A Way to ‘Use' these Data-Sets (John Burger):

 Use the existing logs as 'inspiration' for a made up log corpus May have been done by others, like NIST Will not need permission Will not have been seen before Can insure no privacy disclosures But will take time to produce the made-up data © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Privacy Concerns

 Though most well known with the AOL query logs, all these data sets may contain private data One way to 'remove': use the existing logs as 'inspiration' for a made-up log corpus (mentioned above) A fast, incomplete way to remove private data: remove the query timestamps and links indicating which queries came from the same site and randomize the order of the queries  A lot of the 'disclosures' comes from grouping the queries to a common source  But the removed information is now not available to a query parser © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.

Privacy Concerns (Concluded)

A slower, more complete way to remove private data: review the data (perhaps as it is annotated) and flag any ones with private data  Either substitute the flagged data with fictional information or remove the queries with flags from the data sets © 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697.