Transcript Infoseek Ultra - Steve Kirsch Home Page (short version)
Searching the Internet Steve Kirsch Chairman Infoseek
Scoring Framework • Classification Problem – Separate relevant from non-relevant documents – Bayes’ Decision rule: Relevant if P(x(d)|R)P(R) P(x(d)|~R)P( R) where x(d) is the observed representation of d – Independence assumption leads to S(d) = log [p(t)(1-q(t))/(1-p(t))q(t)] where p(t) = P(t|R) and q(t) = P(t| R)
Agenda • How Infoseek got started • How Infoseek works • How do I … ? • What people ask about; Why • Relevance ranking of web pages • Searching an infinite number of web pages – How people search – Video of IR experts – Top 7 habits of experts – Distributed search fusion algorithm • Why write a search engine in Java?
How Infoseek got started (1994) • DIALOG was too hard and too expensive …so...
Internet + Natural Language Query Engine + Low price “If you build it, they will come”
What things look like today
Popular questions • Do we sell placement?
• How do I get to the top of the results?
• How do I find good Barney pages?
Placement • Based on popularity and statistics
Why people can’t find you • Infinite number of web pages • >150M static pages • Chances of being found on the net are a lot less than being found in the phone book because: – there are a
million times
more web pages – these pages are
not
organized
How do I get to the top of the search result listings?
Pray
How to find Barney pages suitable for your kids
+Barney +dinosaur -bash -kill -maim -destroy -hate
What people ask about (and why)
Unofficial SIGIR survey question
How many people here search the web for “adult sites”?
Top 15 queries on the WWW • sex • Playboy • Penthouse • chat • Hustler • nude • porn * I am not making this up! This list is real!
• erotica • games • pornography • porno • • adult
ESPN
• pussy Sony NightShot HandyCam • Pamela Anderson
• What does
that
mean?
• “Uhh… I was just testing!”
Unofficial SIGIR trivia question • Q: What famous IR researcher asked “Is this because of the CDA?” • A: Bruce Croft in 1995
Why it happens (possible explanations) • Research on CDA • Curious what others looking at • “I read Playboy for the articles” • Many new technologies are driven by sex: – VCR – Hotel movies on demand • People are naturally horny
What it means • Human race in no danger of extinction • Corporate libraries doing a great job in technical areas • Traditional sex education inadequate • Some of you are not telling the truth • Audience surveys are not always accurate • Bill Gates should admit to Congress that Pamela Anderson is more important than he is • If you didn’t raise your hand, you may need professional help!
Finding “relevant” sites • Try these engines: – Sinfoseek – Infoseak – Nymfoseek – Infopeek – ...
Relevance ranking Web sites
Facts about Queries • Most queries are short – Average length approx. 2.2
– 10% use query syntax (usually incorrectly) – 1% used advanced search – Noun phrases only • Precision more important than recall – Users expect precision in top results
Relevance ranking objectives Must use several techniques to determine “relevance”: • Page has query term(s) • Popular usage of the term, e.g., penthouse, java, adult, “evil empire”, ...
• Page quality • Page/site popularity • Spam reduction/elimination • Porn reduction
Relevance ranking factors • Query terms : tf*idf • Usage : Hyperlink text, thesaurus • Quality : site quality, dates, depth, … • Popularity : External link count, proxy stats • Spam : word/phrase unusual statistics (tf limiting) • Porn : site exclusion list, naughty phrase list
Relative weighting of these factors is tricky and subjective
Should “evil empire” return Microsoft as the top hit?
Living in a world of an infinite number of documents
The problem (user view) • Too hard to find things even though only 100M documents indexed • Often precision and relevance, NOT recall – “intel” in the title search gives over 200 hits just like this: Index of /CPAN local/authors/id/GSAR/x86/intel/ix86/intel/ix86 /intel/intel/ix86/intel/ix86/ • Query ambiguity, e.g., “baby Bells”
The problem (vendor view) • Speed • Size • Cost • Freshness • Load on the Internet/bandwidth (both sides) • Quality (Spam/porn) • Will people be able to find what they are looking for as the net grows?
Today’s approach sucks Suck all content into a centralized search engine Infoseek
All the world’s content
Is there a better way?
• We might start by asking the question: “How do people find information today?”
Today’s retrieval techniques
Let’s observe two “information retrieval professionals” at work...
Top 7 information retrieval techniques used by the “pros”
My favorite IR story • On crutches • Right leg in a brace • Nurse uses this highly sophisticated IR technique to determine where the problem was
Centralized searching techniques are rarely used in real life...
• Ask God (and pray for an answer) • Ask DIALOG …and pray...
• WWW search (new!)
What people DO use is decentralized searching Question Source 1 Source 2 ...
Source N Answers and more sources
How well does centralized searching work? • I need a volunteer...
How well does human distributed searching work?
• Let’s find out… • Name two films directed by James Cameron
Human distributed searching attributes • Faster than a computer!!!
• Complete • Accurate • Can be used to validate an answer • Will always find an answer (eventually) • No specialized hardware – All humans had the same CPU speed/RAM
So can’t we design a computer distributed search network that is as fast and accurate and complete as our human distributed search network ?
Our goal • Don’t necessarily mimic the process, but adapt the process to the medium
One approach • User types query • System queries a “meta” database of collection descriptors to determine best collections for that query • System routes query to the best N of those collections in parallel • System merges results and presents to user • For deeper search, use bigger N
Distributed search query • Subject area (optional): – Infoseek • Query: – Steve Kirsch • Collection selection: Use BOTH • Send only the Query to each collection
Distributed searching demo User’s Java applet
Merge results and present Internet
DB #34 ...
DB #564 DB #54 Meta index
Distributed vs. meta searching Meta Distributed Auto collection Fusion Distributed= “As if one collection”
The meta index • The meta index contains characterizations of each collection – words and phrases • The meta index can be updated incrementally, in real-time, whenever a document is added/removed from a collection
Results fusion issue • Search results scores from each collection are not comparable – even if each site is using the same search engine
Traditional fusion technique • Pass 1: Gather statistics from all engines • Pass 2: Ask each engine to compute scores
Requirement
All search engines much use the same scoring algorithm
Infoseek patented fusion technique • Each engine independently computes best N docs • Each engine returns term count statistics for each term in these N docs • Final scoring computed at the Java applet
Data returned from each engine Old way
Score
Doc 89 Doc 12 75% 63% New way Doc cnt 1233 Doc 89 Doc 12
Term1 Term2 Term3
3 2 123 4 0 163 2 3
Advantages of this technique • Single-pass technique (no statistic gathering and re-distribution) • Search engines are not “given” a formula to use, nor IDF statistics • Does not require use of the same search engine on all collections
New technique advantages The same document in two different indexes WILL have the same final score “Value-added” of each search engine is preserved since the search engine operates as it normally would
Issues • Count stats may be “by area” (e.g., title) • “Evil empire” can be handled by returning term count for synonyms of the term • Stemming/tokenization differences between engines can impact term counts returned • Spam, Porn, Popularity, Usage is already handled by each individual engine so these stats need not necessarily be returned
• Scalability • Size • Speed • Accuracy • Reliability • Freshness • Cost • Porn/spam control Benefits • Greater consumer choice… an open systems approach to searching • Preserves value added of each search engine • No spidering overhead
Scalability • Use the world’s largest computer (OPC) • Meta index is ideally centralized – very compact – can be scaled at low cost • Get faster as it gets bigger – in terms of throughput capacity – both meta and document searches
Size • There is no limit to the amount of information that can be indexed • Disk and CPU cost is incurred by each web site
Speed • Sub second response times are typical (web and meta) • • Speed can be set to the fastest to respond
Can merge with centralized results
• Background processing by client of next results page means no wait time for the user • Bottom line: Faster speed than any centralized search engine can achieve
Accuracy • • Using site/page ranking works well
Not always better than centralized; may sometimes be a lot worse; often same
• Look at % of documents containing the terms, not total number of documents • Examples – “Intel”, “Southwest”, ...
– “Gardening” will find best pages on sites dedicated to gardening, not a random page
Reliablity • Search client can perform transparent failover to mirror site • Only single point of failure is at desktop • Centralized mirroring of partial contents of most important sites (distributed indexes) • Bottom line: If California falls into the Pacific ocean, search service is not disrupted
Freshness • Web sites are kept up-to-date through local search engines • Net changes in web site content can be immediately transferred to the Meta servers • Bottom line: – Within seconds after a page is added to a web site, it can be found in the distributed index
Cost • Under $1M to implement – over 1,000 queries/sec on over 1Tb – most of the cost in the mirror servers – meta search servers are very cheap since: • DISK: only summary content so less content • CPU: only one entry per term list per collection
Porn/spam control • Spam and porn are easier to detect on a site level than a page level • Meta indexing process can weed out undesirable sites (e.g., high portion of nude) • Spamming cannot affect searches on other web sites • Other techniques are available to make spamming unlikely
Implementing a distributed search network • Java search clients on the desktop • Distributed Search servers on each web site HAPPENING NOW – Sun, HP, CNN, Ford, Stanford, Boeing, ...
• Indexing other sites separately at Infoseek • Meta servers around the world • Distributed searching applies equally well in an intranet
Search Engine written in Java
“Are you crazy or just stupid?”
• Infoseek • Innotech • Lucene • Apronet We are not alone!
Why?
• Our current search engine – Speed and relevance; minimal RAM usage • Java search engine – Extensibility by users – Portability – Don’t have to make tradeoffs if not architected for extreme speed
Java Search Engine – Clean architecture (from scratch) – Small footprint – Huge limits: 4 Billion docs – Speed: can code in C if needed – Capability: all the std features – Extensible: lots of hooks; public source – Customizability: lots of switches – Suitable for large and small databases – Interfaces: Java RMI, TCP socket, INQUERY, STARTS, SQL/ODBC/JDBC – Correctness of results (full, lazy evaluation) – Extreme compression
Java Search Engine Status • It works now! Real-time indexing/searching • Extensive API documentation available (73 pages): see
software.infoseek.com
• Currently plan to provide it free (including source) for education/research purposes • Hiring more staff
Applications • Demanding environments – Can customize and enhance to meet exacting requirements – Create a registry of third party plug-ins • Relational Database full text search plug-in (Oracle 8.1) • Desktop: email, web pages • IR Research: share and easily build upon other people’s work
Why Java • Faster, easier to code/debug than C, C++ – Memory management – Safe: no pointers • Extensible – subclassing • Dynamic extensibility – Plug-in directory, upload server code, ...
• Popular – lots of programmers and platforms
Why Java • Standard libraries that are available on all platforms – Built in inter-process communication (RMI) – Standards for interfaces (Java Beans) • More portable – no compiler issues – no byte order issues
Java disadvantages • Speed – 2 to 3x slower than C w/native compilers • Memory management not totally under your control – you can tell it when to GC • Not quite “write once, run anywhere” – “Write once, Debug everywhere” • RAM usage efficiency issues (Unicode)
Final survey question • If free for academic use and source code is provided, would you use it?
Summary • Web presents a unique set of requirements and challenges – Short queries – Ambiguous queries – Noisy databases – Lack of standards
80’s
Centralized computing
90’s
Personal computing
00’s
Centralized searching
Summary
Decentralized computing Network computing Decentralized searching using the network
Summary • Distributed searching is attractive, but centralized searching is not dead • Centralized, distributed, and meta-search techniques can co-exist, even in the same search • Chicken and egg to start, start now by using centralized mirrors • There is probably a Java search engine in your future
IR process for a copy of the talk • Find my home page, type: Steve Kirsch into Infoseek • Follow the SIGIR link
Other benefits • Greater consumer choice… an open systems approach to searching • Preserves value-added of each search engine • No spidering overhead
IDF modification • Probably don’t need to mention this
Unexploited business opportunity • Selling our list of porn sites
Refinements • Use more than just the standard statistics: – User behavior • which links selected • relevance judgements – Hyperlinks • Combine meta-index search results with a centralized search
The 4 best ways to get traffic • Submit your site to search engines and directories (www.submit-it.com) • Include your site’s URL in your advertising in traditional media • Purchase banner ads on search engines and directories (this is equivalent to advertising in the telephone yellow pages) • Purchase ads on other Web sites