Transcript Slide 1
Search and the ‘Net @ 2006
Trends, Challenges and Cutting-Edge
Developments in Internet Search
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
For
Rochester Regional Library Council
Member Libraries’ Staff
Sponsored by the
Rochester Regional Library Council
Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the
New York State Library 2005
For Today ….
The current “state of web search”
What’s new among established services
Services launched recently
Cutting Edge in Search
– Natural Language Processing
Text mining
The Latest from the Living Web
Weblogs
RSS feeds
Podcasts
Current trends and future possibilities
Linklist for today’s session:
people.hws.edu/hunter/search06links.htm
Web Search @ 2006
Who’s crawling the Web?
Yahoo
– Owns AlltheWeb, Altavista, Inktomi, Overture
Google
MSN
AskJeeves owns Teoma
Gigablast
NOTE: Ownership is different from
database affiliation
Google
Database Affiliates
Google
AOL
Excite
Netscape
Most popular services
Google
48%
Yahoo
29% (up 20% from last year)
MSN
8% (up 30% from last year)
All others
15% (AOL, AJ, Net, Gig)
Study by Harris Interactive (must purchase)
– www.harrisinteractive.com
Database Size
Google: ca. 10 billion web pages
(???)
Yahoo – 20 billion “web objects”
MSN – 6 billion (est.)
Teoma – 3 billion (est.)
Gigablast – 1.5 billion (est.)
Search Engine Overlap
Results compared from 12,500 random queries
from the largest engines
85% were unique to one engine
11% were shared by any two
3% were shared by any three
1% were shared by all
Study by Dogpile, U Pittsburgh and Penn State
– CompareSearchEngines.dogpile.com/OverlapAnalysis
Recent Developments
Among Established Services
2005: A lot to Yahoo! about
No longer just a subject directory
New features and an estimated 20%
increase in users
Vertical Search Engines
– Music, health, finance, shopping and over 20 more
Personalization – My Yahoo and Yahoo
360
– Creates an online identity with photos, restaurant
reviews, personal histories and personal blog
2005: A lot to Yahoo! about
RSS feeds
– Offered as part of My Yahoo
– User-friendly Reader/Aggregator provided;
limited to 250,000 Yahoo-selected feeds
– Yahoo content as RSS: News, Ask Yahoo, Buzz
Index (popular searches), News Groups
Video search (beta) //video.search.yahoo.com
– Advanced search features: KW, format, file size,
length, content filter
Creative commons search.yahoo.com/cc
– Content that is free to share or modify
2005: A lot to Yahoo! About
Contextual Searching - Y!Q
Selected web pages or highlighted
sections analyzed for word frequency
and “concept extraction” and used as
basis for a search
Results give basis for query in “context
selection box”
Refinements include removing
unwanted terms/phrases and “more
like this” link
Requires download of free toolbar
toolbar.yahoo.com
2005: A lot to Yahoo! About
Open Content Alliance (10/3/05)
Large scale E-text initiative
Members include Yahoo, Internet
Archive, National Archives (UK), RLG,
LC, 8 US and 6 Canadian Universities
Over 25,000 Digitized copies of public
domain AND copyrighted works
Works under copyright only available if
permission granted by owner
Yahoo plans to include the content in
it’s database or subject directory
2005: A lot to Yahoo! About
Yahoo/OCLC toolbar
Searchers may restrict their results to the
Open World Cat database, currently at 57
million records
Displays library holdings in the searcher’s
vicinity
Download (free) at www.oclc.org/toolbar
AOL
search.aol.com
Results from Google
Personalization- (with free account)
Results clustering a la Vivisimo
“Smartbox” query refinement
– Offers suggestions BEFORE search button is
clicked
“Snapshots” Human-created answers
Local Search, Maps, Vertical Engines
Gigablast
www.gigablast.com
“Related pages” – Relevant search
results which may not contain original
search terms
Database now at 1.5 billion (up 50%)
One to keep your eye on
Ixquick Metaengine
www.ixquick.com
Repetitive results removed
Results marked as irrelevant by user
used to delete other similar pages in
real time
International price comparison
covering over 5,000 merchants
International phone directory,
residential and business
Google
Personalization
Re-orders search results based on user’s past
searches and click tracks
Ranking will change, depending on user
profiles
Requires setting up a (free) account
Personalized home page (G. as portal?)
Complex profiles are problematic
eg. “Movies, computer hardware, the Internet,
general news, astronomy”
SEARCH: cars
Which categories take precedence over others?????
Google
Personalization
Search records personally associated
with a user are deleted if service is
dropped
Search log data for all Google searches
kept (via cookies)
Google’s privacy policy:
www.google.com/privacy.html
Bookmark entire web pages
Google
Google Earth earth.google.com
Geographic search application
Originally Keyhole 3D, now a free
Google download
Images taken by satellites and aircraft
“sometime in the last 3 years”
“Fly to” accepts an address or coordinates, returns a view from 3,000
ft. above, with zoom capabilities
Google
Local for Mobile google.com/glm
Free download
Unique ID associated with your phone
Simplified version of the web-based Local
Search
Emphasis on maps and directions
Point-to-point directions limited to a certain
area
Business listings offer address and phone
number only
Does not support all mobile phones
Google
Video Search
video.google.com
Index of closed captioning and text
descriptions from selected TV and
other video content after Dec. 2004
Results include thumbnail, description,
source, date, duration and hyperlink
Currently hyperlink links to more
description, not to the video itself
Q&A Service Ready reference service
providing answers to fact-based
queries
Google Print’s 2 divisions
Publisher Program and Library Project
Publisher Program
Publishers authorize G. to scan and
make searchable the full text of their
books
Users see only the full page containing
their search terms
Link to purchase copy
Google Print’s 2 divisions
Publisher Program and Library Project
Library Project
Scan and make searchable 15 million
books, in and out of copyright, from
Harvard, Stanford, Oxford, U. Michigan
and NYPL
For works in copyright, users see only a
few sentences around search terms
Users may browse full text of public
domain works
NOTE: Not possible to print ANY material
from either Google Print project
Library Project in 2005
June – Assoc. of American Publishers
question legality of Library Project
August 15 – G. “temporarily halts”
scanning in-copyright works; continues
scanning public domain works
September 20 – Author’s Guild files a
formal complaint against G. in NY
Federal District Court alleging “massive
copyright infringement”
Services Launched
Recently
Icerocket
www.icerocket.com
Results Enhancements
– Thumbnails of home page
– Archived version (Internet Archive)
– Qluick View
Full Boolean
Includes Web, Blogs, Multimedia and
News, with unique advanced features
“Blog Trends” tool
MAY be using Google
May become www.blogscour.com
Brainboost
www.brainboost.com
A natural language “answer engine”
Results include “Related Questions” as
well as responses to your query
Queryster
queryster.com
Interface that provides quick scanning
of results from up to 10 engines
– Yahoo, Google, MSN, AJ, WNut, Teoma, AV,
Amazon, Ebay, A9
Executes your search as you click on
the engine
Batch search – executes multiple
queries in each engine
Fresh Google – uses daterange search,
(not reliable)
RedLightGreen
www.redlightgreen.com
120 million titles from the Research
Libraries Group union catalog
Search options
– Boolean
Phrase
Author
Title
– Keyword (Title, L C S H)
Subject (LC)
– Limits by language and date
Results refined by Related Subjects,
Authors and Language
Reviews of books linked to record
5 Citation outputs available
The Cutting Edge in Search:
Natural Language Processing
Beyond Searching the Full Text:
Natural Language Processing
(aka Text mining Data mining)
How can we manage unstructured
information?
Current web search engines match query
terms from the full text of downloaded
documents (“bag of words”)
Term frequency, position, page linkage and
popularity and other factors used to create
the final selection and ranking of results.
Enter Natural Language
Processing (NLP)
With NLP software unstructured text
and data can be processed to reveal
degrees of meaning by
– Extracting terms identified as significant
– Summarizing content
– Discovering relationships among terms and
groups of terms
– HOW???
NLP Extraction
Take all articles from a group of
pharmaceutical journals published in
one year (the “corpus”)
Extraction – Run a relevant controlled
vocabulary (list of all known drugs)
against the corpus
NLP Extraction
Drugs found, number of occurrences
and location in the corpus plus a list of
possible drugs not in the controlled
vocabulary
86>penicillin click for locations
124>tetracycline click for locations
213>aspirin click for locations
Are these also drugs? XXX, XXX, XXX
NLP Summarization
Retain phrases surrounding the
extracted term(s) with links to
locations in the corpus (KWIC Index)
rare uses of penicillin
Often penicillin is contraindicated when
responds well to penicillin
NLP Summarization
Tag all words in the corpus with their
grammatical function and search for
noun – verb – noun and other
syntactic patterns
(drug A) treats (disease B)
(drug C) causes (disease B)
(drug D) is contraindicated in (disease B)
NLP Term Relationship
Queries answered by tracking
references across sentences
Can penicillin cause shock?
“Penicillin treatment is not without risks. In
certain cases it can trigger anaphylactic
shock.”
NLP can do even more …
Word disambiguation
bank (river)
bank (finances)
bank (verb)
Retrieval of alternative word forms
Retrieval of variants in capitalization
and spelling
Topic detection and tracking
Following different themes in a changing RSS
feed
Machine translation
NLP and Real Life
Early recognition of emerging market
trends and/or competitors
Monitoring content from bio-medical
and other journal literature that grows
faster than the ability of researchers
to read it
Improve relevancy in searches of
content from libraries, publishers and
the Web
The Latest from the Living Web
Weblogs
RSS Feeds - Podcasts
Blogs: What are they?
Online diaries or journals, usually by
one person, though many invite
“comments”
First developed in 1997
Within the same blog tone can range
from personal musings to discussion
of recent issues in technology and
research
High link-to-word ratio
Often link to other weblogs of similar
content
Blogs: What are they?
Can contain rumor, inside information,
speculation, blatant errors as well as
– Breaking news: political and
technical/research
– Commentary on new software or websites
– Consumer reaction to products or services
Blog authoring tools are basic content
management software, useful in ways
other than online diaries
– Typify the spirit of information sharing that has
fueled the Internet since its beginnings
Today’s Blogosphere
The blogosphere is now over 30 times
as big as it was 3 years ago, with no
signs of letup in growth
As of October 2005, Technorati is now
tracking 19.6 Million weblogs
The total number of weblogs tracked
continues to double about every 5
months
About a new weblog is created each
second
Today’s Blogosphere
2% - 8% of new weblogs per day are
fake or spam weblogs
Between 700,000 and 1.3 Million posts
are made each day
http://www.problogger.net/archives/2005/1
0/17/state-of-the-blogosphere-october2005/
Blogs and Search: Google
blogsearch.google.com and
search.blogger.com
First major engine to offer a blogspecific search (Sept, ’05)
Defines blogs as “sites which use RSS
and other structured feeds and update
content on a regular basis”
Advanced Search features
– Blog title
Author of post Date range
– Language limit
Safe Search option
Blogs and Search: Clusty
clusty.com
Formerly Vivisimo
Metasearch engine with a blog search
capability
Source engines for blog search
– Blogdigger
– Feedster
Blogpulse
Technorati
Daypop
Blogs and Search: Clusty
clusty.com
Results clustered in topical folders
Source engine given for each result
Date and time of each posting given
Accepts natural language queries
Full Boolean capabilities
Phrase search (“ “)
Limits include
– Domain
Host
– Number of results
Source Engine
Length of search (timeout)
RSS: What is it?
A broadcast version of current
content from a website, blog, news
page or other source (aka “RSS
Feed”)
A live, constantly updated table of
contents with links to the full text,
eg. a feed from NYTimes.com
How do I access RSS feeds?
Sites with RSS feeds display a small icon
(usually orange) labeled RSS or XML or
Atom
As RSS is in XML, may require
downloading reader software (older
versions of browsers cannot read XML).
Sources for reader software include
– www.download.com (search rss reader)
Aggregators allow for reading and
organizing feeds of your choosing
RSS:Crossing into the Mainstream
Study of 4,000 respondents by Yahoo! And Ipsos
Insight August, 2006
Who is using RSS?
12% were aware of RSS
4% had knowingly used it
27% unknowingly use RSS via
personalized start pages, eg. My Yahoo
Why do they use RSS?
Ease of use
Choice of content
Instant updating capability (only 7% !!!)
RSS:Crossing into the Mainstream
What feeds are they using?
(in order of popularity)
World news
National news
Entertainment
Science and technology
Weather
Local news
http://publisher.yahoo.com/rss/RSS_WhitePaper1
004.pdf
MY Yahoo! Ticker
yahoo.com
RSS reader and aggregator
Click on Downloads
Click Deskbar for MS Windows
Choose among 250,000 Yahoo-selected
RSS feeds
News and Stocks Server Options allow
filtering by a list of topics
RSS at Google
www.google.com/reader
Requires setting up a (free) account
Subscribe to any feed of your choosing
Keyword search available for feeds in
Google’s database
RSS feeds available for Google News
Folders (labels) available for grouping
feeds of similar content
Sort feed items by date and relevance
Podcasts 101
iPod + broadcast = podcast
Downloadable audio or video files which
can be played on many devices
PC, home systems
Mobile (iPods, cars, MP3)
“Broadcast” by means of RSS
Not limited to Apple’s iPod or MP3 format
Often embedded in weblogs with RSS
feeds
As with any Living Web (RSS) content
podcasts can go offline; may or may not
be archived
Podcasts 101
Development
Ease of publication (cheap storage, MP3 format)
Ease of subscription (RSS 2.0)
Ease of use (iPod, other mobile audio devices)
Create files (audio parameters apply!)
Publish files
– From iTunes.com
– From any web site capable of supplying
content via RSS (Most blogs do)
Podcasts 101
Subscribe to files via the URL for the
RSS Podcast feed (Red RSS or XML
button)
Podcatcher – Freeware that receives
and organizes podcasts (an RSS
aggregator for podcasts).
Available at podcatcher.rubyforge.org
Podcasts in Higher Education
Drexel (Chemistry)
– Lectures podcasted; class time used for
problem solving
Duke (Computer Science)
– Students required to listen to podcasts on
related topics not covered in class
U. of Hawaii (Computer Science)
– Intro class of 600; lectures podcasted;
“Listen to them when you have the time”
Video podcasting
“Vodcasts”
June, 2005 - Apple iTunes begins to
support video podcasting
Can provide supplemental multimedia
content as part of a course, or public
relations initiative
With Web cams, DV cameras and
vodcasting, we may be headed toward
the democratization of video content
60gb Video iPod now available
Arstechnica.com/news.ars/post/20050
915-5313.html
Podcasting and Search
Many podcasts are embedded in blogs
Google blog search tool
blogsearch.google.com
(subject) podcast
Main Google search still text-based;
rock filetype:mp3 = 124 on 11/28/05
Blog-based search engine with media
search: www.blogdigger.com/media/
Podcasting and Search
Podcast Directories and Catalogs
www.podcast.net
Directory of over 15,000 podcast feeds
Searchable by Title & Description, KW, Host (Author),
Location and Episode
www.odeo.com
Searchable catalog of mp3 podcasts
Updates every 3 hours
Offers text snippet of the latest podcast from each
feed
www.podcastdirectory.com
www.podcastshuffle.com
Trends and Future
Possibilities
Search Today …
“Mass Media” as “My Media”
Podcasting iTunes
Blogs
RSS
“Search is no longer about a text-based web
index. It’s about a person’s interface to the
world” -- SEO executive
Enhancing search through context and
user personal profiles
– My Yahoo!
– Google Personalized Search
Search Today …
Federated search (single-point
access, enterprise applications)
The Desktop “without walls”
Unstructured and structured data
Internal, personal sources and WWW
XML makes this possible
– “Middleware” layer with modules that
acquire, manage, retrieve and rank text, data
and multimedia from a variety of sources and
formats
Search tomorrow ???
Search will become more
Sophisticated
Individualized
Portable
Specialized (vertical, subject-specific services)
Voice recognition, GPS and mobile,
local search will grow
“Where can I find the best bargains on this in the
area?”
“Where is the nearest pizza parlor and how do I
get there from here?”
Thank You and
Good Luck!!
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
Geneva, NY 14456
(315) 781-3552
[email protected]