Online information seeking behaviors and search strategies The Internet and the Web • The Internet has been around since the 1960s. – First.

Download Report

Transcript Online information seeking behaviors and search strategies The Internet and the Web • The Internet has been around since the 1960s. – First.

Online information seeking
behaviors and search strategies
The Internet and the Web
• The Internet has been around since the 1960s.
– First major use – e-mail
– Content was hard to find unless someone told you
where to look.
– Archie (1990) and gopher (1991) – tools to find files
online and retrieve them from open ftp sites
• The Web has been available since approximately
1994.
– Mosaic –search engine – 1993
– Lycos, Yahoo – 1994
– Google – September 1998
• See http://www.seoconsultants.com/searchengines/history/#SEH1990 for search engine history
Changing emphasis
• “Microsoft, Yahoo and Google say they
are innovating because people's
expectations for a search engine are far
higher than they were even five years
ago. People no longer search for a Web
site; now they expect to find a specific
piece of information, like the cheapest
airfare to Chicago.”
Updated: 11/02/2009 02:45:48 AM PST
http://www.mercurynews.com/breaking-news/ci_13679086?nclick_check=1
How do you search for something
• For information about a general topic?
• For a specific fact or data item?
• Let’s see
– Form groups of 2 (or 3 if needed)
– Search for
• Reliable information about the H1N1 flu vaccine
• The number of people who died in Philadelphia
from the 1918 flu epidemic
– Work together. Notice what you do and be
prepared to describe your strategy.
How did you do?
• Let’s review the strategies
– Were they different for the two tasks?
– Who found the best results?
• How did you get there?
– Who had the most problems?
• What caused your difficulties?
• Can we find some ideas for future
success?
Looking at other sources
• Some papers on information finding (and
re-finding, and on how search engines
work.)
• Why this topic?
– The Web is the ultimate digital library! If we
understand how to get the best results from
the Web, we can use some of those strategies
to make our digital libraries more effective.
– As we understand the issues faced in web
search, we may have a better understanding of
why targeted digital libraries have a place.
The Web Search
• Three Distinct Phases:
– Crawling
– Indexing
– Searching
• Each has specific challenges to address
Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer.
June & August 2006
Crawlers
• Basic process
– Open an HTML page that has at least one anchor
tag
• (<a href=“…..”> link description </a>
– Send HTTP request to the site and receive the
page.
– Parse the page, looking for other anchor tags
– Place anchors on a queue for further processing
– Submit the actual page for indexing and storing
Indexers
• Scanning
– “For each indexable term … the indexer writes a
posting consisting of a document number and a
term number to a temporary file.”
• Parse this sentence: What is an indexable term?
Posting? Document number? Term number? What does a
posting look like?
• Invert the file
– Sort by term, secondarily by document number
– Record start location and list length for each term
Searching (Query Processing)
• Look up query term in term dictionary
• Get the postings list
• Find documents that match all search terms
– Find documents for each term and merge lists where
common documents occur
• Rank documents and report
– As many as required or until end of the list
• Still possible to find a result on one search and not find that
same item on a subsequent search of the same terms
Expanding from the basics
• Each of the phases of web searching is
simple in concept, but complicated by the
sheer magnitude of the task.
• The same ideas applied on a smaller scale -in a company intra-net, for example, can be
done efficiently.
• The Web presents special challenges.
Crawling
• A single machine running a simple crawling
algorithm would not do well in finding all
Web pages.
• Large data centers
– Redundancy and fault tolerance
– Parallel operation
– (SIGCSE talk by Marissa Mayer of Google)
Crawling reality
• Speed - amazing numbers:
– @ .5 sec per http request, max 86,400 per day = 634 years for 20 billion pages
• Politeness – Overwhelming web servers
• Excluded content
– Robots.txt
• Duplicate content
– Identifying duplicates can be tricky - why?
• Continuous crawling
– Keeping current
– Priority queue for crawling schedule - why?
• Spam
Indexing large collections
• The Web is the ultimate “large collection”
• “Estimating 500 terms in each of 20 billion
pages” --> 10 trillion entries!
• Divide and conquer, as the crawler did
– Each indexer builds a partial file in memory
– Stops when memory is full
– Write to disk, clear memory, and start over
• Merge the partial files to make the full
index
Data structures for indexing
• Trees, tries, hash tables
– Various ways to organize the terms for easy lookup
• Numbers of terms
– More than all words in all languages
– Acronyms, proper names, etc.
– Must deal with common phrases also
• Separate index entries (postings) for common word
combinations
• Compression
– Saves space, increases processing
• Anchor text -- fie on those who use “click here”!!
• Link popularity score
– Give a score to a page based on popularity, also on
query-independent factors.
– Think about the implications of this.
Query Processing
• Most queries are short, do not provide much context
• Result quality -- use some of the techniques from
information retrieval
– Once a preliminary list of responses is obtained, treat that
as the collection and use IR techniques to improve the
quality of the response.
• Some limitations. No way to judge how complete the initial list is.
– Techniques are part of the trade secrets of the companies
• Speeding things up:
–
–
–
–
Skipping
Early termination
Document numbering
Caching
So what to do with the
information found?
• Use it. Sure. Then what? Will I ever need
that information again? Do I use it a lot?
Should I retain it somehow so it will always
be available? Or should I just figure on
searching again?
• What is your approach? Does it vary with
the kind of data? What are your decision
criteria?
Resource
• Jones, William and Jaime Teevan. Personal
Information Management. University of
Washington Press. 2007
• Nice set of papers related to finding,
keeping, and organizing information
• The book has a theme running through it –
a specific event that requires a number of
people to obtain and use information. The
papers address various aspects of meeting
that need.
The unifying theme
• A set of characters with a need to
accomplish a specific activity
– Characters have distinct characteristics in
terms of the way they organize and interact
with information
– The task is to organize a surprise birthday party
for one of the characters, with appropriate
roles and interactions among the other
characters
The characters
•
Alex, male, 27.
–
–
–
•
–
–
–
Software developer at startup
company
“spontaneous, dynamic, chaotic”
“Job is too unpredictable and fast
changing for … much point to filing
information.”
Lots of unsorted piles, 2000
messages in inbox
Connie, 58, mother of Alex and
Brooke
–
–
Prides herself on being organized,
mostly paper based.
Has been ill, papers have piled up
and organization deteriorated.
Derek, male, 23
–
–
–
•
–
–
–
•
Engaged to Brooke
Relies on tablet PC
Would like to banish paper, but it still
comes in
Edna, female, 74
–
Brooke, female, 23. (Alex’s sister)
–
•
Securities analyst
Very well organized, especially work
info.
Takes action immediately on all new
information received -- email or
other
•
Retired, owns a lot of real estate (was
real estate broker)
Almost all paper-based information (PC
does not work)
Prefers to call or write actual letters to
communicate.
No children, close friend of Connie,
honorary grandmother of Alex and
Brooke
Felicia, 20. Derek’s sister
–
–
–
In college, interested in music,
photography
Uses laptop for communication,
organizing of digital pictures
Has a lot of older print pictures, photo
CDs
The theme activity
• Planning a surprise 75th birthday party
for Edna.
• Out of town guests will need hotel rooms
• Edna’s favorite restaurant will be the site.
• Maybe they will do a phot album.
Paper: How People Find Personal Information, in the book Persoal Information
Management
Finding information
• Opening scenario - Alex needs to find the
phone number of his grandmother’s favorite
restaurant and make a reservation.
– He is the organized one.
– This does not fit a category that he uses.
• If he knows the name of the restaurant - easy
• Otherwise, knows he has seen it somewhere -- how to
find it again
– REFINDING - a particular category of information finding.
Related to “keeping” - a subject to come later
Finding - a multi-stepped process
• Importance of browsing
• Common triggers and stop conditions
• Users prefer to find information by orienteering -using small steps guided by their knowledge of
the local context -- rather than search -- sudden
jump to the destination.
• Scenario - Alex knows the restaurant name is in
an email from his sister. Could do a search in the
email client. Instead, goes to a folder, sorts to
find all the mail from Brooke, then browses.
Is that what you would do? How do you look for information
that you believe is in an email message?
Why orienteering?
• Quality of search tools?
– There are studies that show that presenting improved search
tools do not noticeably affect the way people seek information.
• Side benefits?
– Orienteering (or navigating) provides a broader look at the
information space. Not only do you find what you are looking for,
you also see what else is around it that might also be of interest in
the current task.
• Distinction between recognition and recall
– Navigating allows use of recognition within context, which may be
easier or more comfortable than recall of the right search terms
to use.
Note always that individuals differ.
Files and piles
• Relating approaches to finding in physical
spaces
– Filers -- more comfortable with organized systems,
visible structures
– Pilers -- more comfortable with loose structures,
less formal organization.
• Characteristics carry over to approaches to
finding digital information.
• Direct connection between ways of organizing
information and ability to refind.
Refinding -- different from initial
discovery
• Finding something seen before is different
from the initial discovery activity
– Know more about it -- meta data that may aid in
locating it
• Author, title, date created, URL, color, style of text, etc.
• Knowing that Brooke had emailed the name of the
restaurant triggers a memory of the subject of the email,
for example
– Particularly important meta information:
• People associated it, path taken to find it originally,
temporal aspects.
– Some research shows such importance of time, that some
argue that chronological ordering should be default ranking
Factors related to Re-finding
• Initial encounter with the information
provided some experience that will
influence re-finding
– Elapsed time since prior encounter will
influence value of that experience
– Expected future value will influence how well it
is remembered
– Similarity of initial reason to have the
information and the reason for the new access
influence the connection between the prior
and current experience.
Re-finding related to keeping,
organizing
• Studies about how people re-find
information on the Web show preference
for strategies that do not involve any
advance planning or keeping.
• Yet, people do spend time preparing for
future access.
• Shown: Pilers prefer to organize with small
steps while filers are more likely to use
search tools to jump directly to, or close to,
the target
Judging value
• Information is easier to re-find if it was
recognized as important the first time it
was seen.
– What do you do to recognize the potential
future use of information
• In email?
• In web sites?
• In Other information sources?
– Post-valued recall -- recognizing the value of
previously encountered information
• Some people e-mail information to themselves
– Have you ever done this? What does it accomplish?
• Lack of knowledge of future importance makes it
harder to store and organize information effectively
Information fragmentation
• On how many different devices do you store
electronic information?
– Phone, pda, desktop computer, laptop …
– How do you recall what is where?
• Do you have any kind of overall index?
• Do you ever lose something entirely because you cannot
recall where it is stored?
• Do you use online sites such as Google docs to make files
accessible from a variety of places?
– What are the pros and cons of that approach?
– How do you handle multiple e-mail accounts? (Do you?)
– How do you know that this version is most recent?
Naming conventions help or hinder.
• How do you name the versions of a file?
– Cathy Marshall study at Microsoft
Paper: How People Keep and Organize Personal Information in the book, Personal
Information Management
Information Keeping
• People keep things -- including
information -- for a variety of reasons
– Expected future need
– Reminder of an experience, usually
pleasant, but perhaps something
significant that should not be forgotten
(VT April 16 collection -- see
http://www.vt.edu/remember/)
– Increasing amounts of information available,
but it is hard to know what to keep
Define: Information keeping
• Decision-making and actions relating to the
information item currently under consideration
that impact the likelihood that the item will be
found again later. Decisions can range from:
(1) “ignore, this has no relevance to me”; (2)
“ignore, I can get back to this later”… (3) “keep
this in a special place or way so that I can be
sure to use this information later.”
• This is the keep or don’t keep decision, not
related to how to keep anything.
Quoted from How People Keep and Organize Personal Information in
Personal Information Management
Define: Information organizing
• Decision-making and actions relating to the selection and
implementation of a scheme of organization and representation
for a collection of information items. Decisions can include: (1)
How should items in this collection be named? (2) What sets of
properties make sense for and help to distinguish the items in
this collection? (3) How should items within this collection be
grouped? Into piles or folders?
• Note the movement from an item to a collection as we talk
about keeping and organizing
– “Keeping” response is triggered frequently by ordinary
events.
– “Organizing” response is less often triggered
• What triggers the impulse to organize?
Quoted from How People Keep and Organize Personal Information in
Personal Information Management
Define: Information Maintaining
• All decisions and actions relating to the
composition and preservation of a personal
information collection. Decisions involve what
kind of new items go into a collection, how
information in the collection is stored (Where, in
what formats? In what kind of storage? Backed
up how?) and when do older items leave the
collection (e.g. When are they deleted or
archived?)
• A mixed blessing -- the Apple migration (with
firewire) when a new machine replaces an old
one.
– Easily obtain an exact copy of the old disk system.
• Is this good, bad, some of both?
Keeping decisions: Multifaceted
and Error-prone
• Some sorting attributes for paper items
–
–
–
–
–
–
Title, author
Disposition (discard, keep, postpone)
Order scheme (group, separate, arrange
Time (duration, currency)
Value (importance, interest, confidentiality
Cognitive State (don’t know, want to remember)
• Heavily influenced by anticipated future use
Other approaches to keeping
• Collection building, independent of expected
future use
• Packrat
• Legacy
• What do you do with something you do not
intend to use again? Do you get rid of it or just
put it aside? How much effort is required to
make that decision?
– Alex and the business card scenario
Organizing
• Little research on how the same person organizes
different forms of information
• Some results
– People do not take time to assess their organization
– People complain about needing to organize separate
type of information and the resulting fragmentation
– People are not consistent about the approaches they
take, using different schemes on different days
– Some people go to great lengths to consolidate types
of information -- sending documents by email or
storing email in file folders, for example.
Structures
• Making sense of organization includes both
internal representations and external
representations
– Internal representation requires a cognitive
connection, an understanding of where each
information item fits into a larger scheme and how
it will be retrieved later
– External representation is a translation of the
internal understanding of the structure needs of
organization into a realization that can be seen and
used.
My definitions, not from the author
Features that would be useful for
organization
• A manual ordering of folders
– People force this by strange folder names (AAA…)
• An ability to set reminders, due dates, and other
task-like properties on folders
– Subfolders often correspond to tasks, but cannot be
treated like tasks
• An ability to add notes
– Some people add a notes document to a folder
• An ability to use and reuse structures
– If an organization of a folder or directory is useful for a
variety of activities, it would be nice to be able to
reconstitute its structure, ready for new particulars. -For me, an ABET visit, for example.
Final reflections
• How does understanding how people treat
the information they find influence your
organization and presentation of your
digital library? Are there services that you
might add (if you can) in order to meet the
needs of the user?
• Consider the ACM DL option of binders. Is
that something useful? Would it be a
benefit to your users? How would you use
that option or something similar?