An OLAM Framework for Web Usage Mining and Business

Download Report

Transcript An OLAM Framework for Web Usage Mining and Business

An OLAM Framework for Web
Usage Mining
and Business Intelligence
Reporting
Xiaohua (Tony) Hu
Drexel University
Philadelphia, PA, 19104
Outline
•
•
•
•
•
•
Introduction
Data Capture
Data Webhouse Construction
Mining, OLAP and Business Reporting
Pattern Evaluation and Development
Q &A
Benefits of Web Usage Mining
• Targeting customers based on usage
behavior or profile (personalization)
• Adjusting web content and structure
dynamically based on page access pattern of
users (adaptive web site)
• Enhancing the service quality and delivery
to the end user (cross-selling, up-selling)
• Improving web server system performance
based on the web traffic analysis
• Identifying hot areas/killer areas of the web
site
Web Usage Mining Steps
1. Data capture (clickstream, sales,
customers, products, promotion, shipping
etc)
2. Data processing
-- ETL from OLTP to DW
3. Pattern discovery and OLAP cubes and
reports
4. Pattern evaluation and deployment
Data Capture
•
•
•
•
•
The web server logs recording the visitors’ click stream
behaviors (pages template, cookie, transfer log, time
stamp, IP address, agent, referrer etc.)
Product information (product hierarchy, manufacturer,
price, color, size etc.)
Content information of the web site (image, gif, video
clip etc.)
The customer purchase data (quantity of the products,
payment amount and method, shipping address etc.)
Customer demographics information (age, gender,
income, education level, lifestyle etc.)
Issues in Clickstream Capture
•
•
•
•
•
•
Distinguish sessions
Use Cookies to track customers
Tag templates
Log business events
Records query string
Crawlers detection
What Kind of Clickstream Information
Need to Be Recorded?
– Request (Click) Data:
• Template, Product,
Assortment
• Time stamps for each click,
Compile & execution times
• Query string information,
Referring page information
• The request sequence
number within a session
– Cookie Data:
• The cookie of the visitor
(This ID is temporary if the
user has cookies turned off)
– Session Data:
• Session length
• Browser (useragent) and IP
address information for the
client
• User’s Cookie ID
• User ID of the user if he/she
logged in
• Whether or not the session
timed out
• The total number of requests
in the session
• Whether the session belongs
to a user who “opts-out”
• The total number of sessions
that have come from users
with this Cookie ID
Web Log Data
• Designed for debugging purpose, not for
analysis
Crawler Session
• Crawlers are programs that visit your site
search engine, shopping bots
• It is very important to filter the crawler
session (some of our clients’ site, the
crawler sessions account up to 30%)
Techniques to Identify Crawlers
Sessions
• Build a model to identify crawler sessions:
common turn off images, have empty
referrers, friendly bots will visit robots.txt
file, page hits rate is too fast, pattern is a
depth-first or breadth-first search of the site,
bots never purchase
• Created invisible links in the web page
OLTP vs DSS
OLTP
DSS
Daily operation
Analysis
Many small
transactions
Need quick
response
Read & write
(insert, delete,
update)
Session and product
centric
Few large
transaction
Very time
consuming
Mostly read only
Customer centric
What is OLAM?
• OLAP: (On-Line Analytical Processing)
pre-calculate summary information to enable
drilling, pivoting, slicing/dicing, filtering , to
analyze business from multiple angles or views
(dimensions)
• OLAM (On Line Analytical Mining): An
integration of data mining and data warehousing
and OLAP technologies
Data Webhouse Construction
• Requirement Analysis of the Data
Webhouse
• Data Webhouse Schema Design
Dimensions, Fact Tables,
Aggregation/Summary tables
Requirement Analysis of the Data
Webhouse
1. Web site activity (hourly, daily, weekly, monthly, quarterly
etc)
2. Product sale (by region, by brand, by domain, by browser
type, by time etc)
3. Customers (by type, by age, by gender, by region, buyer vs.
visitor, heavy buyer vs. light buyer etc)
4. Vendors (by type, by region, by price range etc)
5. Referrers (by domain, by sale amount, by visit numbers etc)
6. Navigational behavior pattern (top entry page, top exit page,
killer age, hot page etc)
7. Click conversation-ratio
8. Shipments (by regular, by express mail etc)
9. Payments (by cash, by credit card, e-money etc)
•
Data Webhouse Schema Design
• Define the Source Data
• Choose the Grain of the Fact Tables
• Choose the Dimensions Appropriate for the
Grain
• Choose the Facts Appropriate for That
Grain
Appropriate Dimensions
•
•
•
•
•
Session Dimension
Page Dimension
Time Dimension
User Dimension
Product Dimension
Session Attributes
•
•
•
•
•
•
•
•
•
•
•
•
Session Length
Referrer
Agent
Host Name
IP Address
Cookie_id
First Request Time
Last Request Time
Average Time Per Page
Purchase Flag
Time Out Flag
Many more …
Customer Attributes
• Address: City, State/Province, Country
• Gender, Age, profession, Education, Marital
Status
• Contact Info: Email, Phone
• Repeat Visit Flag
• Frequent Buyer Flag
• Heavy Spender Flag
• Reader/Browser Flag
• Many more …
Page Attributes
•
•
•
•
•
•
•
•
•
Page Template
Page Location
Page Type
Page Category
Page Description
Registration Page Flag
Shipping Page Flag
Checkout Page lag
Many more …
Promotion Attributes
•
•
•
•
•
•
•
•
Promotion Name
Price Reduction Percentage
Adv Type
Coupon Type
Begin Date
End Data
Promotion Region
Many more …
Date Attributes
• Day, Week, Month, Quarter, Year
• Day number in Month, Day Number in
Quarter, Day Number in Year
• Week number in Month, Week Number in
Quarter, Week Number in Year
• Weekday Flag
• Weekend Flag
• Season
• Many more …
Time Attributes
•
•
•
•
•
•
•
Second, Minute, Minute, Hour,
Early Morning Flag
Late Afternoon Flag
Lunch Time Flag
Dinner Time Flag
Late Evening Flag
Many more …
OLAP
• View data from Multiple views and angles
• Immediate response to business query
• Ability to drill down and roll up the
multiple dimensional data in the cube
• Analyze Business measures such as profit,
revenue, quantity from different angles,
perspectives and various factors
Some Fact Tables
MINE_ORDERS_CLICKS_GIFTS
This table contains a row for each order line,
clickstream request, and gift registry entry. It is the
union of the MINE_ORDER_LINES,
MINE_CLICK_LINES, and MINE_GIFT_LINES tables
and is used as the fact table when mining on a
combination of order and clickstream data. Since
different columns apply to different types of line items
they are marked with the applicable type(s) (order, click,
gift, or all).
MINE_ORDERS_ACXIOM
MINE_ORDER_HEADERS joins with
MINE_CUSTOMERS, MINE_ACXIOM,
MINE_PROMOTION
MINE_LINE_ITEMS
MINE_ORDER_LINES joins with MINE_CUSTOMER,
MINE_ORDER_HEADERS, MINE_PRODUCTS,
MINE_ASSORTMENT, MINE_PROMOTIONS
Some Dimension and Summary Tables in
Webhouse
MINE_CLICK_LINES
a row for each Web page viewed
MINE_ACXIOM
a row for each customer for which the system was able to find
Acxiom data
MINE_SESSIONS
a row for each Web session
MINE_ASSORTMENTS
a row for each assortment folder, assortment, and sub
assortment defined in the system.
MINE_CUSTOMERS
a row for each customer
MINE_GIFT_HEADERS
a gift row for each customer
MINE_GIFT_LINES
a row for each gift registry item of each customer
MINE_ORDER_LINE
contains a row for each order line of each order
MINE_ORDER_HEADERS
a row for each order of each customer
MINE_PROMOTIONS
a row for each promotion folder and promotion defined in the
system
Search Argument Findings
Records
292,952
70
64
60
53
43
37
30
28
27
26
25
24
23
23
22
22
21
20
20
961
Percent
99.46%
0.02%
0.02%
0.02%
0.02%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.33%
Normalized
4.38%
4.00%
3.75%
3.31%
2.69%
2.31%
1.88%
1.75%
1.69%
1.63%
1.56%
1.50%
1.44%
1.44%
1.38%
1.38%
1.31%
1.25%
1.25%
Value
NULL
fat boy
chrome
motorclothes
fuel tank
sportster
maintenance
sidecar
sissy bar
seat
touring
fuel tanks
exhaust
accessories
road king
rear fender
backrest
style
fatboy
deuce
Other Values
Top 20 Paths Lead to Non-Purchased Sessions
path
main
main->main
main->main->main
main->main->login
main->main->main->main
login
main->main->pna->pna
pna
main->main->pna->pna->pna
main->main->eDealer
mc
main->main->pna
main->main->pna->pna->pna->pna->pna
main->main->pna->pna->pna->pna->pna->pna
main->main->pna->pna->pna->pna->pna->pna->pna
main->main->pna->pna->pna->pna
main->main->mc->mc->mc->mc
main->main->pna->pna->pna->pna->pna->pna->pna->pna
main->main->mc->mc->mc
main->main->pna->pna->pna->pna->pna->pna->pna->pna->pna
counts
14622
3731
790
329
303
274
216
212
192
185
180
175
169
166
160
147
131
118
111
106
Top 20 paths start at OF_Main.jsp and exit at OF_Main.jsp
Paths
Counts
OF_Main.jsp->splash.jsp->OF_Main.jsp
154
OF_Main.jsp->OF_Main.jsp
122
OF_Main.jsp->splash.jsp->OF_Main.jsp->OF_Main.jsp
52
OF_Main.jsp->OF_Main.jsp->OF_Main.jsp
28
OF_Main.jsp->splash.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp
25
OF_Main.jsp->OF_Main.jsp->splash.jsp->OF_Main.jsp
23
OF_Main.jsp->splash.jsp->pna/pa_main.jsp->OF_Main.jsp
16
OF_Main.jsp->splash.jsp->login/ln_login.jsp->OF_Main.jsp
15
OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp
13
OF_Main.jsp->splash.jsp->mc/MC_main.jsp->OF_Main.jsp
13
OF_Main.jsp->splash.jsp->dealer_positioning.jsp->OF_Main.jsp
11
OF_Main.jsp->splash.jsp->pna/pa_main.jsp->pna/pa_family.jsp->OF_Main.jsp
11
OF_Main.jsp->splash.jsp->login/ln_login.jsp->login/ln_loginopp.jsp->login/ln_message.jsp->OF_Main.jsp 10
OF_Main.jsp->splash.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp
9
OF_Main.jsp->splash.jsp->cart/sc_listing.jsp->OF_Main.jsp
7
OF_Main.jsp->splash.jsp->login/ln_login.jsp->login/ln_login_step.jsp->OF_Main.jsp
7
OF_Main.jsp->browser_message.jsp->OF_Main.jsp
6
OF_Main.jsp->dealer_positioning.jsp->OF_Main.jsp
5
OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp
5
OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp->OF_Main.jsp
5
Single/Multiple visitors/buyers
Type
Counts
Single Visit
1823
Multiple Visit
37
Single Visit Buyer
269
Multiple Visit Buyer
58
Unknown
2846
Web Usage Mining Methods
• Construct cubes from data webhouse
roll-up, drill-down the OLAP cubes to find the top domain,
top products, top hot spot, web activity, most frequently accessed
time periods etc.
• Perform data mining on data webhouse
find association patterns for cross-sell and up-sell, build link between pages,
sequential patterns, and trend of web accessing, improve system design by
web caching, web page prefetching, and web page swapping
Mining the web data
• Association Rules
• Classification/Prediction
• Clustering
Data Mining -Association
• Path Link analysis : Explore, understand,
predict browsing pattern
• Shopping cart Analysis: cross-sell, up-sell
to increase wallet-share
Gloss Example
Relations Lift
Support(%)
Confidence(%)
Rule
1
2
1.56
1.89
18.58
Bloom ==> Dirty_Girl
2
2
1.56
1.89
15.91
Dirty_Girl ==> Bloom
3
2
1.13
1.50
11.52
Philosophy ==> Bloom
4
2
1.13
1.50
14.75
Bloom ==> Philosophy
5
2
1.66
1.41
11.87
Dirty_Girl ==> Blue_Q
6
2
1.66
1.41
19.75
Blue_Q ==> Dirty_Girl
7
2
3.12
1.32
18.41
Tony_And_Tina ==> Girl
8
2
1.41
1.32
10.14
Philosophy ==> Tony_And_Tina
9
2
1.41
1.32
18.41
Tony_And_Tina ==> Philosophy
10 2
2.96
1.32
18.88
Demeter_Fragrances ==> Smell_This
11 2
3.12
1.32
22.45
Girl ==> Tony_And_Tina
12 2
2.96
1.32
20.75
Smell_This ==> Demeter_Fragrances
Data Mining - Classification
• Understand customer via rules, tree etc
• Prediction model for target-oriented
marketing/campaign
Data Mining - Clustering
• Discover group/segments of similar
behaviors/profile
Questions ?