Transcript Slide 1

Reading Cyber Tracks:
Analyzing Log Files and Search Logs
Darlene Fichter
Data Library Coordinator, U of S Library
January 29, 2004
Overview
 Log Files




How can log files?
Getting up close and personal with log files
7 things my server logs told me
Error logs
 Search Logs




Content synopsis
Site search performance
Intranet usability
“Best bets”
Why super heroes read log files
Macro picture
Rich source of information about
User behaviors
Link choices
Typical paths through a web site
Point out trouble spots
Help inform redesigns
Detractors
The least useful type of data for understanding
users
Doesn’t measure outcomes
We don’t know the intent of the visitor
Hits are meaningless
True
Not true, if you’re estimating service capacity
and performance
Imprecise and incomplete
Server logs can tell you
Who is using your site?
Who never uses your site?
Where do they enter?
What route do they follow?
What do they use?
How long do they stay?
Big picture
Average Number of Visits per Day on Weekdays
6254
Average Number of Hits per Day on Weekdays
110437
Average Number of Visits per Weekend
8157
Average Number of Hits per Weekend
113500
Most Active Day of the Week
Mon
Least Active Day of the Week
Sat
Most Active Date
December 04 2003
Number of Hits on Most Active Date
173527
Least Active Date
December 25 2003
Number of Hits on Least Active Date
38078
Most Active Hour of the Day
14:00-14:59
Least Active Hour of the Day
01:00-01:59
Page duration
How long do most people spend on a page
– Inordinately long time could mean
Very confusing
Very worthwhile
Went for coffee?
– Skip averages and look for the mode or median
Exit pages
The point where someone leaves your site,
may offer some interesting clues
Related links – fine
Find an article page listing databases
– A caveat to keep in mind
• Use of the back button may not show up when pages
are loaded from the browser cache
Forms
What is the completion rate for forms?
How many people abandon the ILL loan
process part way through?
Forms
Does your system for marking required “fields
work” are people presented with error upon
error on submission?
Are employees entering in bogus responses
in form fields to circumvent bad design?
What can you measure?
Depends on what is recorded in the log file
Web server access log files
– ASCII file that records each request
– Two common web server log files types
 Common
 Combined
– More data
Example: Apache combined log format
Who and when?
IP address or
hostname
Identity or
Login
(seldom used)
Username
recorded
Date, Time
What did they ask for? Did it work?
Method
Path
Protocol (http)
and version
Status
Code
Status codes
In general
–
–
–
–
200 codes are successful requests by a client
300’s report server redirects
400’s are used for client errors
500’s are used for server errors
404
Page immediately before this request
Bytes
transferred
Referring
site
User Agent: Browser and OS
Browser – Mozilla
OS – Windows NT
Log analysis software
Produce summary tables, charts and graphs
Popular ones are:
–
–
–
–
WebTrends (commercial, Windows, Unix)
Analog (free, Unix, Windows)
Wusage (free, Unix, Windows)
Many more
 Yahoo Log Analysis Tools > Titles
Sample: Top domains chart
Sample: Summary top files requested
Meaningful
filenames rather
than id=1232
help make this
report understandable
What your logs can tell you, if you listen…
Specific areas where logs are useful
Specific examples
How visible are your links and menus?
Are you tuning your site? Is the new button or
label working?
Is anyone clicking on the special
announcement information?
Run a special report and see what links are
used the most on your home page
Redesign of E-Journal page
Subject
browse
was #2.
Redesign of U of S home page
Help
was
removed.
Homepage Clickthrus: http://www.usask.ca/analog/homepage/
Redesign of U of S home page
1.
2.
3.
4.
5.
Departments
Search
PAWS
Students
Admissions
Redesign of Health Sciences Library page
Home page
clickthrus
used to
set priority
Order.
Before and after
Does the new top menu work?
Click Tracks [www.clicktracks.com] - displaying all the links on the
page and % of visitors that click on it.
Digging for evidence
Are people able to get from here to there?
Specific example
 Evaluating a site wide menu
Trying to make the case that generic terms rather
than “brand names” were more effective
Team response was polite nods
Log files to the rescue
Looked up how many people actually
selected this area from the home page based
on the brand name label rather then generic
term.
Possible because the links had different
syntax
Tip: Add tracking code to the end of a link
http://library.usask.ca/data?top
Log file: - [27/Jan/2004:03:08:11 -0600] "GET /data?top
We discovered
A quick glance at the log file revealed in the
prior two days
 200 accesses resulted from the brand name label
1000 accesses for the generic term in a less
prominent location
Where do you post announcements?
 Need to get everyone’s attention
 Branch closure
 Pay fines now in order to convocate
 Not every one enters your site at the home
page
 Find the entry pages
Top entry pages
What’s hot and what’s not?
What areas or pages are popular?
How is it changing over time?
Popular may ≠ good
– Custom 404 pages are often #1 on a site with link rot
– High use may mean people are lost, if your site doesn’t
have a followed link colour
Link rot?
http://www.bio.cornell.edu/stats/01/07/default_01_b.htm
Top directories
Popularity questions
What’s popular but shouldn’t be?
– Overdependence on site search may signal site
navigation weaknesses
What should be popular but isn’t?
– If you expect high usage and it’s not happening, recheck
links, labels and position. Is the link to underutilized area
prominent? Is it plain language or jargon?
Does anyone care?
Are we posting new announcements
and no one reads them (ever)?
Are the only hits from search engines
spidering the site?
What should we add more of?
Is a feature used?
After a debate,
quick links
and audience
menus
were added
to the site.
Quick links – very popular #3 and #5
Audience menus
Over time the “student” option on the
audience menus has increased
Getting down to the details
When can you
move
to CSS layouts?
When can
you downgrade
support for
Netscape 4.78?
What web browsers do you need to support?
Microsoft Explorer Browsers
Explorer ; x
Percent of Total Hits
Explorer 2.x
Explorer 7.x
80
Explorer 1.x
60
Explorer 3.x
40
Explorer 4.x
20
Explorer 5.x
0
12/04 12/08 12/12 12/16 12/20 12/24 12/28
12/02 12/06 12/10 12/14 12/18 12/22 12/26 12/30
Mon 12/01/2003 - Wed 12/31/2003 (1 Month Scale)
Explorer 6.x
Cross platform testing
Table
Retrace someone’s footsteps
What page referred them to the library site?
No referrer? Bookmark, typed in URL (or a robot)
What path did they follow?
Sometimes even what link they clicked
What data they may have typed in a search
box?
Where did they leave?
Log analysis tools – “top paths”
A
sad
tale
http://www.bio.cornell.edu/stats/01/07/default_01_b.htm
Paths
An
even
sadder
tale
Or a programmer doing debugging?
Follow the top paths
Pay attention where they stopped and
restarted
No direct links from one area to another, may
indicate they used their back button
Error logs
Usually well used by development teams
Only touch on a few points
Error log captures
Date
Error level
Client IP address or hostname
Error message or path to requested file
[Wed Jan 28 00:15:26 2004] [error] [client 24.69.255.237] File
does not exist: /data/www/northwest/images/spacer.gif, referer:
http://library.usask.ca/northwest/contents.html
[Wed Jan 28 00:16:30 2004] [error] [client 66.77.73.89] File
does not exist: /data/www/education/chldawrd.html
Also log
Some types of authentication failures
Authentication problems may indicate a need
to add:
Directions – usernames are case sensitive
Implement a password reminder feature
Redesign or launch of new service
Watch you log files in “real time” or every few
seconds
For example on a UNIX server, use this
command:
tail –f /usr/local/apache/logs/error_log
tail -f
Path to
error_log file
Site search and search logs
Rich source of data
Often underutilized
Tap into people’s expectations
Site search engine audit can help you to tune
your search engine, web pages, and results.
Useful for finding out about:
Content synopsis
Site search performance
Clues about web site usability
Creating “best bets”
Search engines generate two data files
Robot logging
– What URLS
– What files
– How many terms are indexed
Search query log
– Similar to web server log
 Strength – generate term frequencies
 Weakness – most don’t show IF the user clicked any results
Content synopsis
Discovery tool if many independent content
developers and/or servers
Bird’s eye view of breadth and depth
The robot report can tell you:
How many HTML pages?
How many PDF’s?
How many unique words?
How many bad links/URLs?
“Smarter” robots may report
How many secure areas?
Refresh rate – how many documents have changed
Sample of Swish-e directory report
Checking dir "/data/resources/usability/readersurvey_files"...
Checking dir "/data/resources/usability/templates"...
Checking dir "/data/resources/usability/ugasurvey"...
Checking dir "/data/software.purchased"...
Checking dir "/data/ssh"...
Checking dir "/data/staffsessionmaterial"...
Checking dir "/data/staffsessionmaterial/groupware.presentation"...
Checking dir "/data/surveys"...
Checking dir "/data/sysinfo"...
Checking dir "/data/sysinfo/CVS"...
Checking dir "/data/sysinfo/apps"...
Checking dir "/data/sysinfo/ntnetwork"...
Excerpt of Swish-e summary
Removing very common words... no words removed.
Writing main index... 23414 unique words indexed.
Writing file index... 752 files indexed.
Running time: 21 seconds.
Indexing done!
Removing very common words... no words removed.
Site search performance: coverage
Log analysis is just one part of the
assessment
Strength is the ability to see 1000’s of real
queries
Identify and repeat the top 50, 100 or 200
queries
BBCi – BBC search engines
Receive hourly reports so they can track
trends1
Tune results
– Space shuttle Columbia disaster
– Columbia the country
1 http://www.currybet.net/articles/audiences/
Sample queries by frequency
Verify Ultraseek search term query report
– Site-wide and collection specific queries
Queries by Frequency
Results from past 1 month
14722 Total queries
174 "staff"
169 url:lights.ca, url:hr url:marketing url:sysinfo url:contact, || domain
146 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || stats
139 "domain setup"
105 url:lights.ca, url:docs url:hr url:sysinfo url:contact || orientation
92 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || "hot+spare"
Mining for gold
What did people expect to find?
– Unlike web server logs, we can capture user’s
own words for what they are seeking
– Safely assume they had a “hope” that we’d have
the information
– How well do we measure up?
Do we have content for the top 25 queries?
If not, why not? Should we add it?
“External” terms for U of S Library
Wild and wacky requests
Is the site search labeled clearly?
– Do people know where they are?
– Do they know what they’re searching?
– If they’re looking for MP3 downloads, Hawaiian
vacations, and foot fetishes, then the visual
identity and “sense of place” is needed to
distinguish this from Internet engines
Search terms
Book related queries
Lots of book titles and authors
– This may be okay – depends on your site content
– May mean they are lost and looking for the
catalogue
Check your labeling
Guide them – Looking for books? Search the
catalogue
Missing content
Are they looking for events?
Employee directory information
Subject pages (stored in a database and not
in the site search)
– May be as simple as adding it to the robot
– May require a simple (or sophisticated)
metasearch
Search and silos
Challenging to have people recognize what
they are searching
Federated searching presents possibilities of
breaking down silos
Specific example
Dozens of queries for Italian magazines
A professor of Italian literature gave an assignment
and sent his students to find “Italian magazines” on
the library’s web site
No where do the words “Italian magazines” appear on
a web page
It imperative our sites reflect the language of our
users and, in academia, the language of faculty.
Site search performance: interface
Default settings for search page
Most searchers accept your defaults
How well do these mesh
– “anding” or “not anding” all keywords
Assessing performance
Look at your top 50 and 100 search queries
and see if the defaults are helping or
impeding good result sets
How often is there a likely looking item in the top 3
results? Is it relevant?
Is there a clear mismatch between the searcher’s
terminology and the site terminology?
Is the best page ranking low due to its construction
– Graphics, not title tag, etc.
Results may surprise you!
Library
was a stop
word.
Made a
best bet.
U of S internal search engine report
“No matches found”
Dig deep
– Is it typographic or spelling errors?
Does your search offer suggestions?
– Is it a language gap?
– Is it variant terms for the same thing?
Situational analysis
 Where is the leaping off point for the search?
– Run a referrer report for your search page
– When do many users abandon browsers and head for
search?
 Turn to your web server logs
– Referring URL and then the search terms entered
– What was the page like just before they searched?
– Did it have the answer to the question?
– Was the choice on a menu but the vocabulary in the label
different?
From browse to search
Usability testing shows that people move
back and forth
P. Gremett found when analyzing Amazon,
the majority of users browsed until the
browsing areas became too busy, ambiguous
or lacked relevant content
P. Gremet. Utilizing a Users Context to Improve Search Results. CHI 2003.
Typical response: blame the search engine
Reality check
– Garbage in, garbage out
– Use the terms our visitors are using
– Make sure pages are designed to rank effectively
IF the web team can change the content
Example: Recent site audit
More specific queries ranked lower than a
single general term
Employment
Employment income
Employment income occupation
It is counter-intuitive, that the last phrase
would rank higher on the first result set
From research to action
“Best bets”
– Richard Wiggins, Louis Rosenfeld and Martin
Belam and others have written about the
performance increase you can achieve with “Best
Bets”
– Wiggins found that 50% of searches could be
matched to 1000 unique queries
What’s a “best bet”?
Rather than relying on the search engine to
rank, human editors designate one or more
sites as best bet for the top 20, 200 or 500
terms
The “best bet” appears at the top of the result
set
– Example: a search for library brings up the Library
home page, not C programming libraries or the library
budget.
BBC – “Columbia”
BBC – “library”
BBCi Web – “Mars”
In summary
Web logs and search logs
– Rich sources of information
– Give us clues and help improve our sites
– Fast and easy and readily at hand
Help you create new value added services
Log data
Best used as part of redesign process that
includes web site audits, usability testing and
log analysis
Questions
Darlene Fichter
– [email protected]
– library.usask.ca/~fichter/