Infoseek Ultra - Steve Kirsch Home Page (short version)

Transcript Infoseek Ultra - Steve Kirsch Home Page (short version)

Searching the Internet Steve Kirsch Chairman Infoseek

Scoring Framework • Classification Problem – Separate relevant from non-relevant documents – Bayes’ Decision rule: Relevant if P(x(d)|R)P(R)  P(x(d)|~R)P(  R) where x(d) is the observed representation of d – Independence assumption leads to S(d) =  log [p(t)(1-q(t))/(1-p(t))q(t)] where p(t) = P(t|R) and q(t) = P(t|  R)

Agenda • How Infoseek got started • How Infoseek works • How do I … ? • What people ask about; Why • Relevance ranking of web pages • Searching an infinite number of web pages – How people search – Video of IR experts – Top 7 habits of experts – Distributed search fusion algorithm • Why write a search engine in Java?

How Infoseek got started (1994) • DIALOG was too hard and too expensive …so...

Internet + Natural Language Query Engine + Low price “If you build it, they will come”

What things look like today

Popular questions • Do we sell placement?

• How do I get to the top of the results?

• How do I find good Barney pages?

Placement • Based on popularity and statistics

Why people can’t find you • Infinite number of web pages • >150M static pages • Chances of being found on the net are a lot less than being found in the phone book because: – there are a

million times

more web pages – these pages are

not

organized

How do I get to the top of the search result listings?

Pray

How to find Barney pages suitable for your kids

+Barney +dinosaur -bash -kill -maim -destroy -hate

What people ask about (and why)

Unofficial SIGIR survey question

How many people here search the web for “adult sites”?

Top 15 queries on the WWW • sex • Playboy • Penthouse • chat • Hustler • nude • porn * I am not making this up! This list is real!

• erotica • games • pornography • porno • • adult

ESPN

• pussy Sony NightShot HandyCam • Pamela Anderson

• What does

that

mean?

• “Uhh… I was just testing!”

Unofficial SIGIR trivia question • Q: What famous IR researcher asked “Is this because of the CDA?” • A: Bruce Croft in 1995

Why it happens (possible explanations) • Research on CDA • Curious what others looking at • “I read Playboy for the articles” • Many new technologies are driven by sex: – VCR – Hotel movies on demand • People are naturally horny

What it means • Human race in no danger of extinction • Corporate libraries doing a great job in technical areas • Traditional sex education inadequate • Some of you are not telling the truth • Audience surveys are not always accurate • Bill Gates should admit to Congress that Pamela Anderson is more important than he is • If you didn’t raise your hand, you may need professional help!

Finding “relevant” sites • Try these engines: – Sinfoseek – Infoseak – Nymfoseek – Infopeek – ...

Relevance ranking Web sites

Facts about Queries • Most queries are short – Average length approx. 2.2

– 10% use query syntax (usually incorrectly) – 1% used advanced search – Noun phrases only • Precision more important than recall – Users expect precision in top results

Relevance ranking objectives Must use several techniques to determine “relevance”: • Page has query term(s) • Popular usage of the term, e.g., penthouse, java, adult, “evil empire”, ...

• Page quality • Page/site popularity • Spam reduction/elimination • Porn reduction

Relevance ranking factors • Query terms : tf*idf • Usage : Hyperlink text, thesaurus • Quality : site quality, dates, depth, … • Popularity : External link count, proxy stats • Spam : word/phrase unusual statistics (tf limiting) • Porn : site exclusion list, naughty phrase list

Relative weighting of these factors is tricky and subjective

Should “evil empire” return Microsoft as the top hit?

Living in a world of an infinite number of documents

The problem (user view) • Too hard to find things even though only 100M documents indexed • Often precision and relevance, NOT recall – “intel” in the title search gives over 200 hits just like this: Index of /CPAN local/authors/id/GSAR/x86/intel/ix86/intel/ix86 /intel/intel/ix86/intel/ix86/ • Query ambiguity, e.g., “baby Bells”

The problem (vendor view) • Speed • Size • Cost • Freshness • Load on the Internet/bandwidth (both sides) • Quality (Spam/porn) • Will people be able to find what they are looking for as the net grows?

Today’s approach sucks Suck all content into a centralized search engine Infoseek

All the world’s content

Is there a better way?

• We might start by asking the question: “How do people find information today?”

Today’s retrieval techniques

Let’s observe two “information retrieval professionals” at work...

Top 7 information retrieval techniques used by the “pros”

My favorite IR story • On crutches • Right leg in a brace • Nurse uses this highly sophisticated IR technique to determine where the problem was

Centralized searching techniques are rarely used in real life...

• Ask God (and pray for an answer) • Ask DIALOG …and pray...

• WWW search (new!)

What people DO use is decentralized searching Question Source 1 Source 2 ...

Source N Answers and more sources

How well does centralized searching work? • I need a volunteer...

How well does human distributed searching work?

• Let’s find out… • Name two films directed by James Cameron

Human distributed searching attributes • Faster than a computer!!!

• Complete • Accurate • Can be used to validate an answer • Will always find an answer (eventually) • No specialized hardware – All humans had the same CPU speed/RAM

So can’t we design a computer distributed search network that is as fast and accurate and complete as our human distributed search network ?

Our goal • Don’t necessarily mimic the process, but adapt the process to the medium

One approach • User types query • System queries a “meta” database of collection descriptors to determine best collections for that query • System routes query to the best N of those collections in parallel • System merges results and presents to user • For deeper search, use bigger N

Distributed search query • Subject area (optional): – Infoseek • Query: – Steve Kirsch • Collection selection: Use BOTH • Send only the Query to each collection

Distributed searching demo User’s Java applet

Merge results and present Internet

DB #34 ...

DB #564 DB #54 Meta index

Distributed vs. meta searching Meta Distributed Auto collection Fusion Distributed= “As if one collection”

The meta index • The meta index contains characterizations of each collection – words and phrases • The meta index can be updated incrementally, in real-time, whenever a document is added/removed from a collection

Results fusion issue • Search results scores from each collection are not comparable – even if each site is using the same search engine

Traditional fusion technique • Pass 1: Gather statistics from all engines • Pass 2: Ask each engine to compute scores

Requirement

All search engines much use the same scoring algorithm

Infoseek patented fusion technique • Each engine independently computes best N docs • Each engine returns term count statistics for each term in these N docs • Final scoring computed at the Java applet

Data returned from each engine Old way

Score

Doc 89 Doc 12 75% 63% New way Doc cnt 1233 Doc 89 Doc 12

Term1 Term2 Term3

3 2 123 4 0 163 2 3

Advantages of this technique • Single-pass technique (no statistic gathering and re-distribution) • Search engines are not “given” a formula to use, nor IDF statistics • Does not require use of the same search engine on all collections

New technique advantages The same document in two different indexes WILL have the same final score “Value-added” of each search engine is preserved since the search engine operates as it normally would

Issues • Count stats may be “by area” (e.g., title) • “Evil empire” can be handled by returning term count for synonyms of the term • Stemming/tokenization differences between engines can impact term counts returned • Spam, Porn, Popularity, Usage is already handled by each individual engine so these stats need not necessarily be returned

• Scalability • Size • Speed • Accuracy • Reliability • Freshness • Cost • Porn/spam control Benefits • Greater consumer choice… an open systems approach to searching • Preserves value added of each search engine • No spidering overhead

Scalability • Use the world’s largest computer (OPC) • Meta index is ideally centralized – very compact – can be scaled at low cost • Get faster as it gets bigger – in terms of throughput capacity – both meta and document searches

Size • There is no limit to the amount of information that can be indexed • Disk and CPU cost is incurred by each web site

Speed • Sub second response times are typical (web and meta) • • Speed can be set to the fastest to respond

Can merge with centralized results

• Background processing by client of next results page means no wait time for the user • Bottom line: Faster speed than any centralized search engine can achieve

Accuracy • • Using site/page ranking works well

Not always better than centralized; may sometimes be a lot worse; often same

• Look at % of documents containing the terms, not total number of documents • Examples – “Intel”, “Southwest”, ...

– “Gardening” will find best pages on sites dedicated to gardening, not a random page

Reliablity • Search client can perform transparent failover to mirror site • Only single point of failure is at desktop • Centralized mirroring of partial contents of most important sites (distributed indexes) • Bottom line: If California falls into the Pacific ocean, search service is not disrupted

Freshness • Web sites are kept up-to-date through local search engines • Net changes in web site content can be immediately transferred to the Meta servers • Bottom line: – Within seconds after a page is added to a web site, it can be found in the distributed index

Cost • Under $1M to implement – over 1,000 queries/sec on over 1Tb – most of the cost in the mirror servers – meta search servers are very cheap since: • DISK: only summary content so less content • CPU: only one entry per term list per collection

Porn/spam control • Spam and porn are easier to detect on a site level than a page level • Meta indexing process can weed out undesirable sites (e.g., high portion of nude) • Spamming cannot affect searches on other web sites • Other techniques are available to make spamming unlikely

Implementing a distributed search network • Java search clients on the desktop • Distributed Search servers on each web site HAPPENING NOW – Sun, HP, CNN, Ford, Stanford, Boeing, ...

• Indexing other sites separately at Infoseek • Meta servers around the world • Distributed searching applies equally well in an intranet

Search Engine written in Java

“Are you crazy or just stupid?”

• Infoseek • Innotech • Lucene • Apronet We are not alone!

Why?

• Our current search engine – Speed and relevance; minimal RAM usage • Java search engine – Extensibility by users – Portability – Don’t have to make tradeoffs if not architected for extreme speed

Java Search Engine – Clean architecture (from scratch) – Small footprint – Huge limits: 4 Billion docs – Speed: can code in C if needed – Capability: all the std features – Extensible: lots of hooks; public source – Customizability: lots of switches – Suitable for large and small databases – Interfaces: Java RMI, TCP socket, INQUERY, STARTS, SQL/ODBC/JDBC – Correctness of results (full, lazy evaluation) – Extreme compression

Java Search Engine Status • It works now! Real-time indexing/searching • Extensive API documentation available (73 pages): see

software.infoseek.com

• Currently plan to provide it free (including source) for education/research purposes • Hiring more staff

Applications • Demanding environments – Can customize and enhance to meet exacting requirements – Create a registry of third party plug-ins • Relational Database full text search plug-in (Oracle 8.1) • Desktop: email, web pages • IR Research: share and easily build upon other people’s work

Why Java • Faster, easier to code/debug than C, C++ – Memory management – Safe: no pointers • Extensible – subclassing • Dynamic extensibility – Plug-in directory, upload server code, ...

• Popular – lots of programmers and platforms

Why Java • Standard libraries that are available on all platforms – Built in inter-process communication (RMI) – Standards for interfaces (Java Beans) • More portable – no compiler issues – no byte order issues

Java disadvantages • Speed – 2 to 3x slower than C w/native compilers • Memory management not totally under your control – you can tell it when to GC • Not quite “write once, run anywhere” – “Write once, Debug everywhere” • RAM usage efficiency issues (Unicode)

Final survey question • If free for academic use and source code is provided, would you use it?

Summary • Web presents a unique set of requirements and challenges – Short queries – Ambiguous queries – Noisy databases – Lack of standards

80’s

Centralized computing

90’s

Personal computing

00’s

Centralized searching

Summary

Decentralized computing Network computing Decentralized searching using the network

Summary • Distributed searching is attractive, but centralized searching is not dead • Centralized, distributed, and meta-search techniques can co-exist, even in the same search • Chicken and egg to start, start now by using centralized mirrors • There is probably a Java search engine in your future

IR process for a copy of the talk • Find my home page, type: Steve Kirsch into Infoseek • Follow the SIGIR link

Other benefits • Greater consumer choice… an open systems approach to searching • Preserves value-added of each search engine • No spidering overhead

IDF modification • Probably don’t need to mention this

Unexploited business opportunity • Selling our list of porn sites

Refinements • Use more than just the standard statistics: – User behavior • which links selected • relevance judgements – Hyperlinks • Combine meta-index search results with a centralized search

The 4 best ways to get traffic • Submit your site to search engines and directories (www.submit-it.com) • Include your site’s URL in your advertising in traditional media • Purchase banner ads on search engines and directories (this is equivalent to advertising in the telephone yellow pages) • Purchase ads on other Web sites