Web Server Log - Data Mining Community's Top Resource
Download
Report
Transcript Web Server Log - Data Mining Community's Top Resource
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140
"http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453
"http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400
740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)"
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145
"http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; MyIE2)"
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif
HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.1; SV1; MyIE2)"
2: Web
Server Log
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140
"http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453
"http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400
740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)"
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145
"http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; MyIE2)"
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif
HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.1; SV1; MyIE2)"
An extract from KDnuggets web log
© 2006 KDnuggets
Web Server Log – An Example
KDnuggets.com
Server
Page contents
http://www.kdnuggets.com/jobs/
Web server log
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET … HTTP/1.1" 200
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /gps.html HTTP/1.1" 200
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200
© 2006 KDnuggets
…
Web (Server) Log – In Depth
A sample web log line
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“
152.152.98.11
-[16/Nov/2005:16:32:50 -0500]
"GET /jobs/ HTTP/1.1"
200
15140
"http://www.google.com/search?q=salary+for+data+mining
&hl=en&lr=&start=10&sa=N"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
© 2006 KDnuggets
Web log field: IP
152.152.98.11
IP address - can be converted to host name,
such as xyz.example.com
© 2006 KDnuggets
Web log fields: Name, Login
The name of the remote user (usually
omitted and replaced by a dash “-”)
Login of the remote user (also usually
omitted and replaced by a dash “-”)
© 2006 KDnuggets
Web log field: Date/Time/TZ
[16/Nov/2005:16:32:50 -0500]
Date:
DD/Mon/YYYY
Time:
HH:MM:SS
Time Zone:
(+|-)HH00
relative to GMT
-0500 is US EST
© 2006 KDnuggets
Web log field: Request
"GET
/jobs/
HTTP/1.1"
URL:
relative
to domain
HTTP protocol:
e.g.
HTTP/1.0 or
HTTP/1.1
Method:
GET
HEAD
POST
OPTIONS
…
Note: the request is recorded as sent, so it may contain errors,
hacks, and any strange thing you can imagine
© 2006 KDnuggets
Web log field: Status code
200
Status (Response) code. Most important ones are:
200 – OK (most frequent, hopefully)
206 – partial access
301 – permanently redirected (e.g. access to
/courses is redirected to /courses/ )
302 – temporarily redirected
304 – not modified
404 – not found
…
© 2006 KDnuggets
Web log field: Object size
15140
size of the object returned to the client,
in bytes
Can also be “-” if status code is 304 (not
modified)
© 2006 KDnuggets
Web log field: Referrer
http://www.google.com/search?q=salary
+for+data+mining&hl=en&lr=&start=10
&sa=N
URL the visitor came from (here it was a Google query for
“salary for data mining”, 2nd page of results – starting from 10)
Referrer can also be a static page, internal (same domain) or
external (different domain),
or “-” in case of a direct request (e.g. type-in, bookmark)
Referrer analysis is very valuable
© 2006 KDnuggets
Web log field: User agent
"Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
User agent (browser) http://en.wikipedia.org/wiki/User_agent
Almost all browsers start with Mozilla – for historic reasons
In many cases additional information:
Browser type, version : MSIE
6.0 - Internet Explorer 6.0
OS: Windows NT 5.1 (XP SP2) with .NET Framework 1.1 installed
© 2006 KDnuggets
Web Usage Mining
Basic
Totals
Simple
Request level breakdowns
Advanced
Visit level analysis
Target pages; Conversion analysis
© 2006 KDnuggets
Web Log Analysis Programs
Free
Analog, awstats, webalizer
Google analytics
Commercial
WebTrends, WebSideStory, …
www.kdnuggets.com/software/web-mining.html
© 2006 KDnuggets
Web Usage Mining - Basic
Totals for each component
Hits – total number of requests
Files – number of GETs
Pages – number of HTML pages
Sites – unique IP addresses
Response codes
Kbytes – total Kbytes transferred
User Agents
© 2006 KDnuggets
Example:
KDnuggets.com Nov 2005 totals
Monthly Statistics (from webalizer)
Total
Value
Hits
1,121,643
Files
930,468
Pages
312,889
Kbytes
Unique Sites (IP)
10,578,535
35,942
Unique URLs
6,769
Unique Referrers
7,213
Unique User Agents
2,724
© 2006 KDnuggets
More
details
Q: What is the
meaning of the
difference between
Hits and Files?
Example:
KDnuggets.com Nov 2005 totals, 2
Monthly stats for Files by Status Code
Answer: the difference
between Hits and Files is
the number of requests
with status code not 200.
Code
Hits
Code 200 - OK
930,468
Code 206 - Partial
Content
9,303
Code 301 - Moved
Permanently
4,217
Code 302 - Found
457
Code 304 - Not
Modified
170,874
Code 404 - Not Found
Other
© 2006 KDnuggets
6,297
27
Difference between Files and Pages
Q: What is the meaning of difference between
Files and Pages ?
© 2006 KDnuggets
Difference between Files and Pages
A: the difference between Files and Pages is the
number of non-HTML files (e.g. image, javascript,
etc
In November 2005 KDnuggets log HTML files were
about 1/3 of all requests
However, this data does not separate bot requests
(which are heavily weighted towards HTML pages)
© 2006 KDnuggets
Notes: web log formats
We used web log in Apache standard format
Some old logs have a different format without the
last 2 fields (referrer and user agent), but these
are now rare.
© 2006 KDnuggets