dvzdfvxdfdfbdbdfbdfbd - Hobart and William Smith Colleges

Download Report

Transcript dvzdfvxdfdfbdbdfbdfbd - Hobart and William Smith Colleges

Search and the ‘Net in 2007

Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter

Reference Librarian Hobart and William Smith Colleges For Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2007

For Today . . . . . . .

        The Landscape of Search in 2007 A Look at the Major Services  Google, Yahoo!, Ask, Windows Live (MSN) Test Drive Time New Services Wikipedia: Looking Under the Hood Tagging and Search Explore on Your Own Current Trends, Future Directions

Web Search in 2007

Who’s crawling the Web?

     

Yahoo

 Owns AlltheWeb , Altavista, Inktomi, Overture

Google Live Search (MSN)

Ask owns Teoma

Gigablast NOTE: Ownership is different from database affiliation

Google Database Affiliates

AOL Google Excite Netscape

The Indexable Web (

Gulli and Signorini, 2005)     Defined as that part of the Web available to be crawled by search engines Estimated at more than 11.5 billion pages Based on sample data set of about a half-million URL’s Gulli, A and Signorini, A “The Indexable Web is More than 11.5 Billion Pages” in Proceedings of 14 th International World Wide Web Conference p. 900-1. Chiba, Japan, 2005 available http://www.cs.uiowa.edu/~asignori/web-size/

Percent of sample indexed by each engine

 76% - Google  69% - Yahoo!

 62% - Live (MSN)  58% - Ask

Estimated Size of 4 Major Services (

Gulli and Signorini, 2005)

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Google Yahoo!

Ask Live (MSN)

Share of Searches – April 2007

Source: Nielsen/NetRatings 2007

60 50 40 30 20 10 0 Google Yahoo Live AOL Ask All Others

Web Composition by TLD

(Koehler, 2004)

Web Document Persistence (Koehler, 2004)  Collection of randomly selected URL’s   After a 4-year period two-thirds could not be accessed The remaining one-third continue to be stable for a total of 6 years  Legal, scholarly and educational sites have “limited lifecycles not dissimilar to web sites in general”  Koehler, W. “A longitudinal study of Web pages..” Iinformationresearch v. 9 no. 2 Jan. 2004 available http://informationr.net/ir/9-2/paper174.html

The Major Services

Google, Yahoo!, Ask and Live

What is Google these days???

      Print, radio advertising company?

 Deals with Viacom and others E-mail utility (gmail), eBay clone (Base), TV network (YouTube)?

Bank, video store?

Microsoft killer (Docs & Spreadsheets)?

Force in US and world politics?

Losing its laser-like focus on search???

New & Notable at Google in Search

      Usage Rights limit (Adv. Search) Suggest (labs.google.com) Transit (labs) Froogle Mobile (labs) News Alerts (www.google.com/alerts) iGoogle (re-design)

iGoogle

formerly the Personalized Homepage     PHP available since 2004 Part of G’s mission    Search your own stuff (desktop) Traditional web search (unmediated) Mediated web search (your preferences, search history, G’s recommendations, etc) IP geolocation as filter for results Facilitates use of Google Gadgets

Google gadgets

  Currently over 250,000 gadgets available for you iGoogle page GadgetMaker allows you to create 7 different types, “with no programming” Photo sharing Daily Me notepad Personal list GoogleGram Countdown gadget YouTube video favorites “Free Form” (meld text and images)

  

New & Notable at Google

Beyond Search Notebook (labs) Web clipping and note-taking service Accessible Search (labs)  Favors ad-free search results that render well for machine readers for blind and visually impaired Docs & Spreadsheets (docs.google.com)   Free web-based word processing and spreadsheet programs (Web 2.0) Create, update, store, share content in real time

Google Desktop 3.0’s “Search Across Computers”

    Allows users to search across all their computers Requires user to install and configure the feature G uploads files from your computers, indexes them and transfers them to your other computers and deletes them from its servers All computers involved must be online at the time

  

Google Desktop 3.0’s “Search Across Computers”

If one computer is not online no data transfer can occur and the files remain on G’s servers for up to 30 days, when they are deleted.

If service is deactivated “some personal account information” may stay on G servers for up to 60 days “Gadget” apps available-news, weather, animations, etc.

Google

Video Search/Store

video.google.com

   Index of closed captioning and text descriptions from selected TV and other video content after Dec. 2004 Results can include pre-or full view, description, source, date, duration and hyperlink Advanced Search limits Language Length 38 “genres” Free or For Sale

Google’s Aug. 22 Ranking Patent

Query Themes & Editorial Opinion    Relevance ranking processing patent granted to Google Is Google interested in direct human intervention in results ranking (???) Query Themes – topics commonly occurring in search queries identified, i.e..

Free software download sites Travel accommodation sites

Editorial opinion parameters

 For any given query theme human editors      Survey user search query logs Examine search results lists Identify sets of sites that are “favored” and “non-favored” Favored - non-spamming sites verified to offer content relevant to the query theme or originate from a reputable subject directory Non-favored – sites exhibiting spam or other deceptive characteristics

Google Print’s Library Project

   Confusion and uncertainty over copyright issues cloud Publisher and Library Projects U. of California system joined Library Project in August “Find this book in a library” link to OCLC Worldcat holdings, searchable by zip code, state, country

World Digital Library

      Joint venture of Google and Library of Congress Initially an expansion of LC’s American Memory Standards and cooperative structures to be worked out UNESCO endorsement sought One to keep your eye on http://www.digitalpreservation.gov/about/index .html

Google’s Legal Challenges

   Google is a magnet for lawsuits Any lawsuit that reaches the pre-trial “discovery” phase can threaten the secrecy of G.’s proprietary software Recent cases have centered on trademark/advertising and other copyright issues

Child Online Protection Act of 1998

  Justice Dept: Parental controls and filters insufficient to protect children against online pornography. Stricter governmental controls needed Aug., 2005 – G, Y, Microsoft and AOL issued subpoenas for all data relating to search terms and the sites users visited between June 1 and July 31, 2005

   

Child Online Protection Act of 1998

Y, MSN and AOL “have provided some of the information requested and taken steps to guard users’ privacy” G refused To date no request for IP address or other data linking search behavior to individual users DoJ request upheld by US Dist. Court 3/18/06, but reduced from 1 million search results to 50 K, with 5,000 random search queries Google will comply

Implications

  For Users – Invasion of privacy/search behavior, online identity, 1 st Amendment For Search Engine Industry –   

R&D focused on offering search results customized to an individual Requires tracking individual’s search behavior Can privacy be guaranteed?

Yahoo!

Three ways in

   

www.yahoo.com

Portal home page (all services) search.yahoo.com

Crawler only dir.yahoo.com

Subject directory only

“People mediated” search via tagging and personalization

New & Notable at Yahoo!

    Y!Q (toolbar) Mindset (Disambiguation)  http://mindset.research.yahoo.com/ Music Engine & other verticals Recommendations: Movies, Music, etc.

New & Notable at Ask

     The Butler is gone! Teoma is in his place!

Smart Search Web Answers Zoom Superior Mapping Tools

Windows Live Search MSN’s live.com

      Launched Sept. 2006 Successor to MSN Search Maintains its own database Tabs for Web, News, Images, Questions and Answers Also Local, Video, (RSS) Feeds Academic –

academic.live.com

Current strengths: Computer Science, Physics, Electrical Engineering

Live.com

    Advanced Search available only after a search has been done Full Boolean and nesting Limits: Site/Domain, Country/Region, Filetype (6), Languages (37), SafeSearch Default S.E. bundled in IE 7

Live vs. others 4/23/07 (vs. 12/1/06)

  Web Search - Imre Kertesz   G – 417,000 +15% Y – 184,000 +7% L – 37,266 +9% A 11,800 +33% “Academic” vs “Scholar” verticals  For “string theory” Scholar – 85,000 -6% Academic – 10,470 +30%  For Siena frescoes Scholar – 3,420 +7% Academic – 32 +113%

Live’s Collections

    User-saved search results featuring locations or points or objects of interest   Best sushi in NY Stops along old Rt. 66 Listings may be annotated with personal notes Collections can be shared Will become a user-created directory (Web 2.0)

New Services

Exalead www.exalead.com

      Launched October 2004, based in France Maintains its own database Smaller than most US services (8 billion) Offers “Narrowing Options” Advanced features:   Phonetic spelling with “soundslike” Approximate spelling with “spellslike” Limits: Site (URL), Filetype (8), Adult content, Language (57!!!)

Factbites – www.factbites.com

   Returns relevant, whole sentences from sites retrieved, not just snippets Incorporates clustering Based in Australia

eTools – eTools.ch

Meta Engine       Searches Google, Yahoo, Live, Ask, AV, Lycos and 4 European engines Parses each query to conform to each engine’s search syntax Limit by country (8) and language (5) Change weighting of results from each source engine Results clustered by “topic” and source Save Results as pdf or rss

Wikipedia: Looking Under the Hood

Wikipedia’s Many Faces

   Free, international, open content online encyclopedias created and edited by thousands of volunteers in wiki format A world-wide online community dedicated to sharing information freely 3 rd most popular news and information source on the Web (Nielsen, June 2006)

Background and History

     2000 – Jimmy Wales, successful options trader, began Nupedia, an online encyclopedia “with content by experts” 2001 – Wikipedia started as a “holding pen” for Nupedia content awaiting review Now maintained by Wikimedia Foundation Financed mainly by donations with current annual budged of $ 1.5 million Growth rate of 13-18 % per month

Wikipedias Worldwide

    Over 250, each in a different language Content is unique to each Wikipedia Article counts (7/5/07)       English – 1,864,000+ German – 606,000+ French – Polish – 521,000+ 398,000+ Japanese – 387,000+ Dutch – 314,000+ Languages include Nahuatl, Sardu, Tetun, Wollof

Structure and Features

      Articles and relevant links Article Talk page – for discussions related to that article only User page – Each registered user can introduce themselves and chat with others Media namespace – Uploaded image and media files described Category namespace – Categories and tags that can be assigned to articles “Watch List” – Users notified of changes to specific articles

Authorship

      Most content contributed by a core of about 1,000 “regular, registered users” IP addresses of anonymous editors are recorded in the page history Registered users’ IP’s are concealed Almost ½ of articles have less than 5 distinct authors About ¼ of articles have only one author Articles average 2.7 authors each

Editorial Control (?)

     

Volunteer “administrators”

Monitor changes in a section or topic area Arbitrate conflicts i.e. “edit wars” and decide when to “protect” an article from further revision Peer Review Status - granted by a larger number of reviewers as a sign of higher quality Featured Content Status (“The Best of W.”) peer reviewed sites selected for this honor by further review and labeled with a Star (18-20 per month) Featured Portals – Large subject metasites of high quality

Nature’s Wikipedia Study (2005)

    50 entries on scientific topics from W. and latest Britannica sent to relevant experts Articles’ sources not revealed (blind study) Only 8 serious errors identified: 4 from each source “Factual errors, omissions or misleading statements”: B – 123; W - 162

Nature’s Wikipedia Study (2005)

   Reviewers noted many W. articles were “poorly structured and confusing” Nature surveyed over 1,000 of it’s own authors  70% had heard of Wikipedia  17% consult it at least weekly  10% author or help to update its content Giles, Jim “Internet Encyclopedias go head to head” Nature v. 438 Dec. 15, 2005 p. 900-1; suppl. info. at www.nature.com/news/2005/051212/full/438900a.html

Tagging

Search and Social Networks

The Democratization of Metadata

aka Social Bookmarking  Folksonomies – Taxonomies by and for (???) people  Del.icio.us – Personal bookmark service for URL’s with tagging capability   Flickr.com – online photo storage system that provides users with a set of tags and allows new ones to be created Facetious – del.icio.us tag database reworked into a faceted classification, grouping tags under “by place” “by technology” “by attribute” etc.

http://demo.siderean.com/facetious/facetious.jsp

Tag Creation and the “Tag Cloud”

     Follows a “power law” scenario The most used tags attract even more usage A large number of tags are used by only a few people A huge number of tags will be used by only 1 or 2 people Mathes, Adam “Folksonomies” D-Lib Magazine 12 (1) Jan., 2006 avail. http://www.dlib.org/dlib/january06/guy/01guy.html

Tagging – the Advantages

    Not a controlled vocabulary (user-created) Easy, quick, free and enjoyable Instantly incorporates buzz words, current terminology, jargon, new concepts Can be automated by mapping a resource to tags that appear often in its text

Tagging – the Challenges

     Not a controlled vocabulary Lack of    Linkage among synonyms, homonyms Hierarchical concept organization Disambiguation Often only single word tags allowed No standardization across services Tags can be over-personalized

Improving Tagging Practices

    Services must allow compound tags “black cat” Educate users to advantages of a few basic classification conventions Controlled Vocabulary 101 Use of tag bundles (informal hierarchies) Allow for cross-cultural tagging practices within and across languages

Trends and the Future of Search

Search Engine Trends

  Search is expanding (or being absorbed) into other enterprises   G, Y develop Web utilities and services unrelated to search Ask bought by Barry Diller (owns Ingenta, hotels around the world) Implications for R & D?

Search Engine Trends

 Privacy concerns explode    The dark side of personalization Convenience and utility in search retrieval vs. confidentiality US Dept. of Justice requests: First step on a slippery slope?

Search in the Future (???)

 Local-for-mobile  Search engines will become more pervasive as personalization and alerts merge with mobile service “Your favorite band is playing in town tonight. Want to book a ticket?” “What is the price and availability of hotel and dining facilities in this area tonight?”

Search in the Future (???)

 Search engine may be incorporating data from blogs and tagging sites into ranking processes    Yahoo! purchased del.icio.us

Ask purchased bloglines Google purchased blogger in 2005

Search in the Future (???)

 Increased use of radio frequency identification chips (RFID) will enable mobile devices to recognize products and compare prices among the stores in your area.

What will the next generation of search engines look like?

 They will begin to “understand” what you are searching for, not just respond to the characters you type. HOW??

   User feedback (personalization) Use of content-bearing metadata supplied by author and others (social tagging services, etc.) Automatic resolution of semantic differences among terms and metadata through natural language processing

Thank You!

Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552 [email protected]