Transcript Slide 1
Reading Cyber Tracks:
Analyzing Log Files and Search Logs
Darlene Fichter
Data Library Coordinator, U of S Library
January 29, 2004
Overview
Log Files
How can log files?
Getting up close and personal with log files
7 things my server logs told me
Error logs
Search Logs
Content synopsis
Site search performance
Intranet usability
“Best bets”
Why super heroes read log files
Macro picture
Rich source of information about
User behaviors
Link choices
Typical paths through a web site
Point out trouble spots
Help inform redesigns
Detractors
The least useful type of data for understanding
users
Doesn’t measure outcomes
We don’t know the intent of the visitor
Hits are meaningless
True
Not true, if you’re estimating service capacity
and performance
Imprecise and incomplete
Server logs can tell you
Who is using your site?
Who never uses your site?
Where do they enter?
What route do they follow?
What do they use?
How long do they stay?
Big picture
Average Number of Visits per Day on Weekdays
6254
Average Number of Hits per Day on Weekdays
110437
Average Number of Visits per Weekend
8157
Average Number of Hits per Weekend
113500
Most Active Day of the Week
Mon
Least Active Day of the Week
Sat
Most Active Date
December 04 2003
Number of Hits on Most Active Date
173527
Least Active Date
December 25 2003
Number of Hits on Least Active Date
38078
Most Active Hour of the Day
14:00-14:59
Least Active Hour of the Day
01:00-01:59
Page duration
How long do most people spend on a page
– Inordinately long time could mean
Very confusing
Very worthwhile
Went for coffee?
– Skip averages and look for the mode or median
Exit pages
The point where someone leaves your site,
may offer some interesting clues
Related links – fine
Find an article page listing databases
– A caveat to keep in mind
• Use of the back button may not show up when pages
are loaded from the browser cache
Forms
What is the completion rate for forms?
How many people abandon the ILL loan
process part way through?
Forms
Does your system for marking required “fields
work” are people presented with error upon
error on submission?
Are employees entering in bogus responses
in form fields to circumvent bad design?
What can you measure?
Depends on what is recorded in the log file
Web server access log files
– ASCII file that records each request
– Two common web server log files types
Common
Combined
– More data
Example: Apache combined log format
Who and when?
IP address or
hostname
Identity or
Login
(seldom used)
Username
recorded
Date, Time
What did they ask for? Did it work?
Method
Path
Protocol (http)
and version
Status
Code
Status codes
In general
–
–
–
–
200 codes are successful requests by a client
300’s report server redirects
400’s are used for client errors
500’s are used for server errors
404
Page immediately before this request
Bytes
transferred
Referring
site
User Agent: Browser and OS
Browser – Mozilla
OS – Windows NT
Log analysis software
Produce summary tables, charts and graphs
Popular ones are:
–
–
–
–
WebTrends (commercial, Windows, Unix)
Analog (free, Unix, Windows)
Wusage (free, Unix, Windows)
Many more
Yahoo Log Analysis Tools > Titles
Sample: Top domains chart
Sample: Summary top files requested
Meaningful
filenames rather
than id=1232
help make this
report understandable
What your logs can tell you, if you listen…
Specific areas where logs are useful
Specific examples
How visible are your links and menus?
Are you tuning your site? Is the new button or
label working?
Is anyone clicking on the special
announcement information?
Run a special report and see what links are
used the most on your home page
Redesign of E-Journal page
Subject
browse
was #2.
Redesign of U of S home page
Help
was
removed.
Homepage Clickthrus: http://www.usask.ca/analog/homepage/
Redesign of U of S home page
1.
2.
3.
4.
5.
Departments
Search
PAWS
Students
Admissions
Redesign of Health Sciences Library page
Home page
clickthrus
used to
set priority
Order.
Before and after
Does the new top menu work?
Click Tracks [www.clicktracks.com] - displaying all the links on the
page and % of visitors that click on it.
Digging for evidence
Are people able to get from here to there?
Specific example
Evaluating a site wide menu
Trying to make the case that generic terms rather
than “brand names” were more effective
Team response was polite nods
Log files to the rescue
Looked up how many people actually
selected this area from the home page based
on the brand name label rather then generic
term.
Possible because the links had different
syntax
Tip: Add tracking code to the end of a link
http://library.usask.ca/data?top
Log file: - [27/Jan/2004:03:08:11 -0600] "GET /data?top
We discovered
A quick glance at the log file revealed in the
prior two days
200 accesses resulted from the brand name label
1000 accesses for the generic term in a less
prominent location
Where do you post announcements?
Need to get everyone’s attention
Branch closure
Pay fines now in order to convocate
Not every one enters your site at the home
page
Find the entry pages
Top entry pages
What’s hot and what’s not?
What areas or pages are popular?
How is it changing over time?
Popular may ≠ good
– Custom 404 pages are often #1 on a site with link rot
– High use may mean people are lost, if your site doesn’t
have a followed link colour
Link rot?
http://www.bio.cornell.edu/stats/01/07/default_01_b.htm
Top directories
Popularity questions
What’s popular but shouldn’t be?
– Overdependence on site search may signal site
navigation weaknesses
What should be popular but isn’t?
– If you expect high usage and it’s not happening, recheck
links, labels and position. Is the link to underutilized area
prominent? Is it plain language or jargon?
Does anyone care?
Are we posting new announcements
and no one reads them (ever)?
Are the only hits from search engines
spidering the site?
What should we add more of?
Is a feature used?
After a debate,
quick links
and audience
menus
were added
to the site.
Quick links – very popular #3 and #5
Audience menus
Over time the “student” option on the
audience menus has increased
Getting down to the details
When can you
move
to CSS layouts?
When can
you downgrade
support for
Netscape 4.78?
What web browsers do you need to support?
Microsoft Explorer Browsers
Explorer ; x
Percent of Total Hits
Explorer 2.x
Explorer 7.x
80
Explorer 1.x
60
Explorer 3.x
40
Explorer 4.x
20
Explorer 5.x
0
12/04 12/08 12/12 12/16 12/20 12/24 12/28
12/02 12/06 12/10 12/14 12/18 12/22 12/26 12/30
Mon 12/01/2003 - Wed 12/31/2003 (1 Month Scale)
Explorer 6.x
Cross platform testing
Table
Retrace someone’s footsteps
What page referred them to the library site?
No referrer? Bookmark, typed in URL (or a robot)
What path did they follow?
Sometimes even what link they clicked
What data they may have typed in a search
box?
Where did they leave?
Log analysis tools – “top paths”
A
sad
tale
http://www.bio.cornell.edu/stats/01/07/default_01_b.htm
Paths
An
even
sadder
tale
Or a programmer doing debugging?
Follow the top paths
Pay attention where they stopped and
restarted
No direct links from one area to another, may
indicate they used their back button
Error logs
Usually well used by development teams
Only touch on a few points
Error log captures
Date
Error level
Client IP address or hostname
Error message or path to requested file
[Wed Jan 28 00:15:26 2004] [error] [client 24.69.255.237] File
does not exist: /data/www/northwest/images/spacer.gif, referer:
http://library.usask.ca/northwest/contents.html
[Wed Jan 28 00:16:30 2004] [error] [client 66.77.73.89] File
does not exist: /data/www/education/chldawrd.html
Also log
Some types of authentication failures
Authentication problems may indicate a need
to add:
Directions – usernames are case sensitive
Implement a password reminder feature
Redesign or launch of new service
Watch you log files in “real time” or every few
seconds
For example on a UNIX server, use this
command:
tail –f /usr/local/apache/logs/error_log
tail -f
Path to
error_log file
Site search and search logs
Rich source of data
Often underutilized
Tap into people’s expectations
Site search engine audit can help you to tune
your search engine, web pages, and results.
Useful for finding out about:
Content synopsis
Site search performance
Clues about web site usability
Creating “best bets”
Search engines generate two data files
Robot logging
– What URLS
– What files
– How many terms are indexed
Search query log
– Similar to web server log
Strength – generate term frequencies
Weakness – most don’t show IF the user clicked any results
Content synopsis
Discovery tool if many independent content
developers and/or servers
Bird’s eye view of breadth and depth
The robot report can tell you:
How many HTML pages?
How many PDF’s?
How many unique words?
How many bad links/URLs?
“Smarter” robots may report
How many secure areas?
Refresh rate – how many documents have changed
Sample of Swish-e directory report
Checking dir "/data/resources/usability/readersurvey_files"...
Checking dir "/data/resources/usability/templates"...
Checking dir "/data/resources/usability/ugasurvey"...
Checking dir "/data/software.purchased"...
Checking dir "/data/ssh"...
Checking dir "/data/staffsessionmaterial"...
Checking dir "/data/staffsessionmaterial/groupware.presentation"...
Checking dir "/data/surveys"...
Checking dir "/data/sysinfo"...
Checking dir "/data/sysinfo/CVS"...
Checking dir "/data/sysinfo/apps"...
Checking dir "/data/sysinfo/ntnetwork"...
Excerpt of Swish-e summary
Removing very common words... no words removed.
Writing main index... 23414 unique words indexed.
Writing file index... 752 files indexed.
Running time: 21 seconds.
Indexing done!
Removing very common words... no words removed.
Site search performance: coverage
Log analysis is just one part of the
assessment
Strength is the ability to see 1000’s of real
queries
Identify and repeat the top 50, 100 or 200
queries
BBCi – BBC search engines
Receive hourly reports so they can track
trends1
Tune results
– Space shuttle Columbia disaster
– Columbia the country
1 http://www.currybet.net/articles/audiences/
Sample queries by frequency
Verify Ultraseek search term query report
– Site-wide and collection specific queries
Queries by Frequency
Results from past 1 month
14722 Total queries
174 "staff"
169 url:lights.ca, url:hr url:marketing url:sysinfo url:contact, || domain
146 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || stats
139 "domain setup"
105 url:lights.ca, url:docs url:hr url:sysinfo url:contact || orientation
92 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || "hot+spare"
Mining for gold
What did people expect to find?
– Unlike web server logs, we can capture user’s
own words for what they are seeking
– Safely assume they had a “hope” that we’d have
the information
– How well do we measure up?
Do we have content for the top 25 queries?
If not, why not? Should we add it?
“External” terms for U of S Library
Wild and wacky requests
Is the site search labeled clearly?
– Do people know where they are?
– Do they know what they’re searching?
– If they’re looking for MP3 downloads, Hawaiian
vacations, and foot fetishes, then the visual
identity and “sense of place” is needed to
distinguish this from Internet engines
Search terms
Book related queries
Lots of book titles and authors
– This may be okay – depends on your site content
– May mean they are lost and looking for the
catalogue
Check your labeling
Guide them – Looking for books? Search the
catalogue
Missing content
Are they looking for events?
Employee directory information
Subject pages (stored in a database and not
in the site search)
– May be as simple as adding it to the robot
– May require a simple (or sophisticated)
metasearch
Search and silos
Challenging to have people recognize what
they are searching
Federated searching presents possibilities of
breaking down silos
Specific example
Dozens of queries for Italian magazines
A professor of Italian literature gave an assignment
and sent his students to find “Italian magazines” on
the library’s web site
No where do the words “Italian magazines” appear on
a web page
It imperative our sites reflect the language of our
users and, in academia, the language of faculty.
Site search performance: interface
Default settings for search page
Most searchers accept your defaults
How well do these mesh
– “anding” or “not anding” all keywords
Assessing performance
Look at your top 50 and 100 search queries
and see if the defaults are helping or
impeding good result sets
How often is there a likely looking item in the top 3
results? Is it relevant?
Is there a clear mismatch between the searcher’s
terminology and the site terminology?
Is the best page ranking low due to its construction
– Graphics, not title tag, etc.
Results may surprise you!
Library
was a stop
word.
Made a
best bet.
U of S internal search engine report
“No matches found”
Dig deep
– Is it typographic or spelling errors?
Does your search offer suggestions?
– Is it a language gap?
– Is it variant terms for the same thing?
Situational analysis
Where is the leaping off point for the search?
– Run a referrer report for your search page
– When do many users abandon browsers and head for
search?
Turn to your web server logs
– Referring URL and then the search terms entered
– What was the page like just before they searched?
– Did it have the answer to the question?
– Was the choice on a menu but the vocabulary in the label
different?
From browse to search
Usability testing shows that people move
back and forth
P. Gremett found when analyzing Amazon,
the majority of users browsed until the
browsing areas became too busy, ambiguous
or lacked relevant content
P. Gremet. Utilizing a Users Context to Improve Search Results. CHI 2003.
Typical response: blame the search engine
Reality check
– Garbage in, garbage out
– Use the terms our visitors are using
– Make sure pages are designed to rank effectively
IF the web team can change the content
Example: Recent site audit
More specific queries ranked lower than a
single general term
Employment
Employment income
Employment income occupation
It is counter-intuitive, that the last phrase
would rank higher on the first result set
From research to action
“Best bets”
– Richard Wiggins, Louis Rosenfeld and Martin
Belam and others have written about the
performance increase you can achieve with “Best
Bets”
– Wiggins found that 50% of searches could be
matched to 1000 unique queries
What’s a “best bet”?
Rather than relying on the search engine to
rank, human editors designate one or more
sites as best bet for the top 20, 200 or 500
terms
The “best bet” appears at the top of the result
set
– Example: a search for library brings up the Library
home page, not C programming libraries or the library
budget.
BBC – “Columbia”
BBC – “library”
BBCi Web – “Mars”
In summary
Web logs and search logs
– Rich sources of information
– Give us clues and help improve our sites
– Fast and easy and readily at hand
Help you create new value added services
Log data
Best used as part of redesign process that
includes web site audits, usability testing and
log analysis
Questions
Darlene Fichter
– [email protected]
– library.usask.ca/~fichter/