Transcript Slide 1
Reading Cyber Tracks: Analyzing Log Files and Search Logs Darlene Fichter Data Library Coordinator, U of S Library January 29, 2004 Overview Log Files How can log files? Getting up close and personal with log files 7 things my server logs told me Error logs Search Logs Content synopsis Site search performance Intranet usability “Best bets” Why super heroes read log files Macro picture Rich source of information about User behaviors Link choices Typical paths through a web site Point out trouble spots Help inform redesigns Detractors The least useful type of data for understanding users Doesn’t measure outcomes We don’t know the intent of the visitor Hits are meaningless True Not true, if you’re estimating service capacity and performance Imprecise and incomplete Server logs can tell you Who is using your site? Who never uses your site? Where do they enter? What route do they follow? What do they use? How long do they stay? Big picture Average Number of Visits per Day on Weekdays 6254 Average Number of Hits per Day on Weekdays 110437 Average Number of Visits per Weekend 8157 Average Number of Hits per Weekend 113500 Most Active Day of the Week Mon Least Active Day of the Week Sat Most Active Date December 04 2003 Number of Hits on Most Active Date 173527 Least Active Date December 25 2003 Number of Hits on Least Active Date 38078 Most Active Hour of the Day 14:00-14:59 Least Active Hour of the Day 01:00-01:59 Page duration How long do most people spend on a page – Inordinately long time could mean Very confusing Very worthwhile Went for coffee? – Skip averages and look for the mode or median Exit pages The point where someone leaves your site, may offer some interesting clues Related links – fine Find an article page listing databases – A caveat to keep in mind • Use of the back button may not show up when pages are loaded from the browser cache Forms What is the completion rate for forms? How many people abandon the ILL loan process part way through? Forms Does your system for marking required “fields work” are people presented with error upon error on submission? Are employees entering in bogus responses in form fields to circumvent bad design? What can you measure? Depends on what is recorded in the log file Web server access log files – ASCII file that records each request – Two common web server log files types Common Combined – More data Example: Apache combined log format Who and when? IP address or hostname Identity or Login (seldom used) Username recorded Date, Time What did they ask for? Did it work? Method Path Protocol (http) and version Status Code Status codes In general – – – – 200 codes are successful requests by a client 300’s report server redirects 400’s are used for client errors 500’s are used for server errors 404 Page immediately before this request Bytes transferred Referring site User Agent: Browser and OS Browser – Mozilla OS – Windows NT Log analysis software Produce summary tables, charts and graphs Popular ones are: – – – – WebTrends (commercial, Windows, Unix) Analog (free, Unix, Windows) Wusage (free, Unix, Windows) Many more Yahoo Log Analysis Tools > Titles Sample: Top domains chart Sample: Summary top files requested Meaningful filenames rather than id=1232 help make this report understandable What your logs can tell you, if you listen… Specific areas where logs are useful Specific examples How visible are your links and menus? Are you tuning your site? Is the new button or label working? Is anyone clicking on the special announcement information? Run a special report and see what links are used the most on your home page Redesign of E-Journal page Subject browse was #2. Redesign of U of S home page Help was removed. Homepage Clickthrus: http://www.usask.ca/analog/homepage/ Redesign of U of S home page 1. 2. 3. 4. 5. Departments Search PAWS Students Admissions Redesign of Health Sciences Library page Home page clickthrus used to set priority Order. Before and after Does the new top menu work? Click Tracks [www.clicktracks.com] - displaying all the links on the page and % of visitors that click on it. Digging for evidence Are people able to get from here to there? Specific example Evaluating a site wide menu Trying to make the case that generic terms rather than “brand names” were more effective Team response was polite nods Log files to the rescue Looked up how many people actually selected this area from the home page based on the brand name label rather then generic term. Possible because the links had different syntax Tip: Add tracking code to the end of a link http://library.usask.ca/data?top Log file: - [27/Jan/2004:03:08:11 -0600] "GET /data?top We discovered A quick glance at the log file revealed in the prior two days 200 accesses resulted from the brand name label 1000 accesses for the generic term in a less prominent location Where do you post announcements? Need to get everyone’s attention Branch closure Pay fines now in order to convocate Not every one enters your site at the home page Find the entry pages Top entry pages What’s hot and what’s not? What areas or pages are popular? How is it changing over time? Popular may ≠ good – Custom 404 pages are often #1 on a site with link rot – High use may mean people are lost, if your site doesn’t have a followed link colour Link rot? http://www.bio.cornell.edu/stats/01/07/default_01_b.htm Top directories Popularity questions What’s popular but shouldn’t be? – Overdependence on site search may signal site navigation weaknesses What should be popular but isn’t? – If you expect high usage and it’s not happening, recheck links, labels and position. Is the link to underutilized area prominent? Is it plain language or jargon? Does anyone care? Are we posting new announcements and no one reads them (ever)? Are the only hits from search engines spidering the site? What should we add more of? Is a feature used? After a debate, quick links and audience menus were added to the site. Quick links – very popular #3 and #5 Audience menus Over time the “student” option on the audience menus has increased Getting down to the details When can you move to CSS layouts? When can you downgrade support for Netscape 4.78? What web browsers do you need to support? Microsoft Explorer Browsers Explorer ; x Percent of Total Hits Explorer 2.x Explorer 7.x 80 Explorer 1.x 60 Explorer 3.x 40 Explorer 4.x 20 Explorer 5.x 0 12/04 12/08 12/12 12/16 12/20 12/24 12/28 12/02 12/06 12/10 12/14 12/18 12/22 12/26 12/30 Mon 12/01/2003 - Wed 12/31/2003 (1 Month Scale) Explorer 6.x Cross platform testing Table Retrace someone’s footsteps What page referred them to the library site? No referrer? Bookmark, typed in URL (or a robot) What path did they follow? Sometimes even what link they clicked What data they may have typed in a search box? Where did they leave? Log analysis tools – “top paths” A sad tale http://www.bio.cornell.edu/stats/01/07/default_01_b.htm Paths An even sadder tale Or a programmer doing debugging? Follow the top paths Pay attention where they stopped and restarted No direct links from one area to another, may indicate they used their back button Error logs Usually well used by development teams Only touch on a few points Error log captures Date Error level Client IP address or hostname Error message or path to requested file [Wed Jan 28 00:15:26 2004] [error] [client 24.69.255.237] File does not exist: /data/www/northwest/images/spacer.gif, referer: http://library.usask.ca/northwest/contents.html [Wed Jan 28 00:16:30 2004] [error] [client 66.77.73.89] File does not exist: /data/www/education/chldawrd.html Also log Some types of authentication failures Authentication problems may indicate a need to add: Directions – usernames are case sensitive Implement a password reminder feature Redesign or launch of new service Watch you log files in “real time” or every few seconds For example on a UNIX server, use this command: tail –f /usr/local/apache/logs/error_log tail -f Path to error_log file Site search and search logs Rich source of data Often underutilized Tap into people’s expectations Site search engine audit can help you to tune your search engine, web pages, and results. Useful for finding out about: Content synopsis Site search performance Clues about web site usability Creating “best bets” Search engines generate two data files Robot logging – What URLS – What files – How many terms are indexed Search query log – Similar to web server log Strength – generate term frequencies Weakness – most don’t show IF the user clicked any results Content synopsis Discovery tool if many independent content developers and/or servers Bird’s eye view of breadth and depth The robot report can tell you: How many HTML pages? How many PDF’s? How many unique words? How many bad links/URLs? “Smarter” robots may report How many secure areas? Refresh rate – how many documents have changed Sample of Swish-e directory report Checking dir "/data/resources/usability/readersurvey_files"... Checking dir "/data/resources/usability/templates"... Checking dir "/data/resources/usability/ugasurvey"... Checking dir "/data/software.purchased"... Checking dir "/data/ssh"... Checking dir "/data/staffsessionmaterial"... Checking dir "/data/staffsessionmaterial/groupware.presentation"... Checking dir "/data/surveys"... Checking dir "/data/sysinfo"... Checking dir "/data/sysinfo/CVS"... Checking dir "/data/sysinfo/apps"... Checking dir "/data/sysinfo/ntnetwork"... Excerpt of Swish-e summary Removing very common words... no words removed. Writing main index... 23414 unique words indexed. Writing file index... 752 files indexed. Running time: 21 seconds. Indexing done! Removing very common words... no words removed. Site search performance: coverage Log analysis is just one part of the assessment Strength is the ability to see 1000’s of real queries Identify and repeat the top 50, 100 or 200 queries BBCi – BBC search engines Receive hourly reports so they can track trends1 Tune results – Space shuttle Columbia disaster – Columbia the country 1 http://www.currybet.net/articles/audiences/ Sample queries by frequency Verify Ultraseek search term query report – Site-wide and collection specific queries Queries by Frequency Results from past 1 month 14722 Total queries 174 "staff" 169 url:lights.ca, url:hr url:marketing url:sysinfo url:contact, || domain 146 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || stats 139 "domain setup" 105 url:lights.ca, url:docs url:hr url:sysinfo url:contact || orientation 92 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || "hot+spare" Mining for gold What did people expect to find? – Unlike web server logs, we can capture user’s own words for what they are seeking – Safely assume they had a “hope” that we’d have the information – How well do we measure up? Do we have content for the top 25 queries? If not, why not? Should we add it? “External” terms for U of S Library Wild and wacky requests Is the site search labeled clearly? – Do people know where they are? – Do they know what they’re searching? – If they’re looking for MP3 downloads, Hawaiian vacations, and foot fetishes, then the visual identity and “sense of place” is needed to distinguish this from Internet engines Search terms Book related queries Lots of book titles and authors – This may be okay – depends on your site content – May mean they are lost and looking for the catalogue Check your labeling Guide them – Looking for books? Search the catalogue Missing content Are they looking for events? Employee directory information Subject pages (stored in a database and not in the site search) – May be as simple as adding it to the robot – May require a simple (or sophisticated) metasearch Search and silos Challenging to have people recognize what they are searching Federated searching presents possibilities of breaking down silos Specific example Dozens of queries for Italian magazines A professor of Italian literature gave an assignment and sent his students to find “Italian magazines” on the library’s web site No where do the words “Italian magazines” appear on a web page It imperative our sites reflect the language of our users and, in academia, the language of faculty. Site search performance: interface Default settings for search page Most searchers accept your defaults How well do these mesh – “anding” or “not anding” all keywords Assessing performance Look at your top 50 and 100 search queries and see if the defaults are helping or impeding good result sets How often is there a likely looking item in the top 3 results? Is it relevant? Is there a clear mismatch between the searcher’s terminology and the site terminology? Is the best page ranking low due to its construction – Graphics, not title tag, etc. Results may surprise you! Library was a stop word. Made a best bet. U of S internal search engine report “No matches found” Dig deep – Is it typographic or spelling errors? Does your search offer suggestions? – Is it a language gap? – Is it variant terms for the same thing? Situational analysis Where is the leaping off point for the search? – Run a referrer report for your search page – When do many users abandon browsers and head for search? Turn to your web server logs – Referring URL and then the search terms entered – What was the page like just before they searched? – Did it have the answer to the question? – Was the choice on a menu but the vocabulary in the label different? From browse to search Usability testing shows that people move back and forth P. Gremett found when analyzing Amazon, the majority of users browsed until the browsing areas became too busy, ambiguous or lacked relevant content P. Gremet. Utilizing a Users Context to Improve Search Results. CHI 2003. Typical response: blame the search engine Reality check – Garbage in, garbage out – Use the terms our visitors are using – Make sure pages are designed to rank effectively IF the web team can change the content Example: Recent site audit More specific queries ranked lower than a single general term Employment Employment income Employment income occupation It is counter-intuitive, that the last phrase would rank higher on the first result set From research to action “Best bets” – Richard Wiggins, Louis Rosenfeld and Martin Belam and others have written about the performance increase you can achieve with “Best Bets” – Wiggins found that 50% of searches could be matched to 1000 unique queries What’s a “best bet”? Rather than relying on the search engine to rank, human editors designate one or more sites as best bet for the top 20, 200 or 500 terms The “best bet” appears at the top of the result set – Example: a search for library brings up the Library home page, not C programming libraries or the library budget. BBC – “Columbia” BBC – “library” BBCi Web – “Mars” In summary Web logs and search logs – Rich sources of information – Give us clues and help improve our sites – Fast and easy and readily at hand Help you create new value added services Log data Best used as part of redesign process that includes web site audits, usability testing and log analysis Questions Darlene Fichter – [email protected] – library.usask.ca/~fichter/