2002ECDLXMLlog.ppt

Download Report

Transcript 2002ECDLXMLlog.ppt

An XML Log Standard and Tool
for Digital Library Logging
Analysis
Marcos André Gonçalves, Ming Luo, Rao Shen, Mir
Farooq Ali, and Edward A. Fox
Virginia Tech
Outline


Motivation
Related Work


The Digital Library Standardized Log Format




Problems with existing DL logs
DL log standard design
DL Log format structure
DL log tool and its implementation
Conclusions and future work
Motivation

Log analysis

Source of information about:




Used to:





How patrons really use DL services
How systems behave while supporting user information
seeking activities
Examples: patterns
Evaluate
Enhance services
Help and design user interfaces
Better allocation of resources
Common practice in the web setting

Supported by web servers, proxy caching
Motivation (cont.)

DLs differ from the web




DL Logging should offer much richer information and
opportunities


DL collections are explicitly organized, described, managed,
and preserved
Users with more specific tasks and needs
Digital objects and collections more structured
Tradeoff : user privacy
Current DL logs


Differences in formats and recorded information
Problems:
 Lack of interoperability
 No reuse of analysis tools
 Comparability of log analysis results
Related Work

Web Servers (Common Log Format)

Focused in browsing, stateless
bbn-cache-3.cisco.com - - [22/Oct/1998:00:20:21 -0400] "GET
/~harley/courses.html HTTP/1.0" 200 1734
bbn-cache-3.cisco.com - - [22/Oct/1998:00:20:22 -0400] "GET
/~harley/clip_art/word_icon.gif HTTP/1.0" 200 1050
www4.e-softinc.com - - [22/Oct/1998:00:20:27 -0400] "HEAD
/ HTTP/1.0" 200 0
user-38ldbam.dialup.mindspring.com - - [22/Oct/1998:00:20:48 -0400]
"GET
/~lhuang/junior/capehatteras.html HTTP/1.0" 200 328
user-38ldbam.dialup.mindspring.com - - [22/Oct/1998:00:20:48 -0400]
"GET
/~lhuang/junior/PB2panforringed.mirror.gif HTTP/1.0" 200 20222
eger-dl01.agria.hu - - [22/Oct/1998:00:20:51 -0400] "GET
/~tjohnson/pinouts/ HTTP/1.0" 200 26994
Related Work (cont.)

DL- Greenstone
ADMINISTRATION 37
/fast-cgi-bin/niupepalibrary
(a) its-www1.massey.ac.nz
(b) [Thu Dec 07 23:47:00 NZDT 2000]
(c) (a=p, b=0, bcp=, beu=, c=niupepa, cc=, ccp=0, ccs=0,
cl=, cm=, cq2=, d=, e=, er=, f=0, fc=1, gc=0, gg=text,
gt=0, h=, h2=, hl=1, hp=, il=l, j=, j2=, k=1, ky=,
l=en, m=50, n=, n2=, o=20, p=home, pw=, q=, q2=, r=1,
s=0, sp=frameset, t=1, ua=, uan=, ug=, uma=listusers,
umc=, umnpw1=, umnpw2=, umpw=, umug=, umun=, umus=,
un=, us=invalid, v=0, w=w, x=0, z=130.123.128.4950647871)
(d) "Mozilla/4.08 [en] (Win95; I ;Nav)"
Relate Work (cont.)

Search Engine - OpenText
Mon Sep 28 17:48:42 1998
----- Starting Search ----Mon Sep 28 17:48:42 1998
{Transaction Begin}
Mon Sep 28 17:48:42 1998
{RankMode Relevance1}
Mon Sep 28 17:48:42 1998
"Bacillus thuringiensis "
Mon Sep 28 17:48:42 1998
P0 = "Bacillus thuringiensis "
Mon Sep 28 17:48:42 1998
R = (*D including (*P0))
Mon Sep 28 17:48:42 1998
R = (((*R rankedby *P0)))
Mon Sep 28 17:48:42 1998
S = (subset.1.10 (*R))
Mon Sep 28 17:48:42 1998
SL0 = (region "OTSummary" within.1 (*S))
Mon Sep 28 17:48:42 1998
(*SL0 within.1 ( subset.1.1 *S ))
Mon Sep 28 17:48:42 1998
(*SL0 within.1 ( subset.2.1 *S ))
Mon Sep 28 17:48:42 1998
{Transaction End}
Related Work (cont.)

Problems with existing DL logs







Incompatibility
Incompleteness
Complexity of analysis
Lack of organization
Ambiguity
Inflexibility
Verboseness
The Digital Library Standardized
Log Format







Comprehensive
Reflective of the actual DL system behavior
Easily readable
Precise
Flexible to accommodate in varying systems
Succinct enough to be implemented
Concern: user privacy
The Digital Library Standardized Log
Format- Design (cont.)

Capture high level user and system behaviors


Hierarchical organization
Encapsulated in transactions



1.
2.
3.
4.
Interactions between the users and the system or among the
system components
Log format designed to record a number of different kinds
of transactions
Examples:
Login to the system
Submission of search query
Browsing a result list
Recording of a user failure
The Digital Library Standardized Log
Format- Design (cont.)

Design


Reflective of DL behavior
Based on the 5S formal theory
Unifying, mathematical theory to formally describe
the semantics of DL components
 Guidance for how to organize the log structure

The Digital Library Standardized Log
Format- Design (cont.)
5S
Definition
Use in Log Design
Streams
Represent static and dynamic
multimedia content
Temporal events, types of digital objects
Structures
Labeled directed graphs;
provide organization within
the DL
Structured documents and metadata; structured
searches, collection, metadata catalog;
hypertext, classification scheme
Spaces
Sets, properties and operations
on those sets
Retrieval mode, Presentation information,
Scenarios
sequences of events that
modify states of a computation
in order to accomplish some
functional requirement.
Organization of the user and system actions
into transactions, statements, events and
actions; DL services as sets of scenarios.
Societies
Sets of communities and
relationships among them
User information
The Digital Library Standardized
Log Format (cont.)

Specification

Collection of extensive, flat set of attributes
update
catalog
event
session
help
query
collection
transaction
timestamp
response
Result
cutoff
search
search
registering
error
browse
Sorting
rule
Machine
information
action
The Digital Library Standardized Log
Format - Specification

Organization in structured logical way

XML- XML Schema
 Standard
syntax
 Guarantee quality, correctness
 Rich set of basic types help standardization
 Abundance of XML parsers helps construction
of analysis tools
The Digital Library Standardized
Log Format - Structure

Top Level Hierarchy
Log
...
Log Entry
Transaction
...
Statement
SessionId
TimeStamp
MachineInfo
The Digital Library Standardized Log
Format – Structure (cont.)

Decomposition of statement into different
types
Statement
ErrorInfo
SessionInfo
HelpInfo
RegisterInfo
Event
AdmInfo
The Digital Library Standardized Log
Format – Structure (cont.)

Decomposition of event
Statement
ErrorInfo
SessionInfo
Event
HelpInfo
AdmInfo
RegisterInfo
Action
Search
Browse
StatusInfo
Update
StoreSysInfo
The Digital Library Standardized Log Format
– Structure (cont.)

Search Attributes
Search
TimeFrame
Collection
PresentationInfo
Catalog
SearchBy
QueryString
Format
SortBy
NumberOfResults
CutOff
DL Log Tool and Implementation

Java classes


XMLLogData: store data
XMLLogManager: methods to read and write log
information according to the format


Middleware for plug-in DL tool to target system


Synchronized read and writes: avoid conflicts and
inconsistencies
Events based on target system architecture and
implementation
Implemented in the MARIAN DL system
DL Log Tool and Implementation (cont.):
the MARIAN DL system
Distributed client communication
Webgate
Structured logging
Semantic network
Management API
Customization and personalization
Query history
User
Interaction
Layer
Search Layer
Searcher community
Fusion modules
Multilingual support
Database Layer
Generalized inverted
index interfaces
Tailored DL
Infrastructure generation
Database management API
Data Analysis,
Collection Builders &
Loading Tools
Semantic networks
persistent storage
DL Information
networks characterization,
indexing and loading
DL Log Tool and Implementation (cont.)
DL
patron
User
event
c1
System
event
c2
Log middleware
DL
analyst Analysis
request
result
MARIAN
User Layer
Analysis
tool
writeLogEntry
(parameters)
XMLLogManager
storelogData
(parameters)
getLogData
(parameters)
logData
XMLLogData
DL Log Tool and Implementation (cont.)

Example 1: Login to the system
<Transaction ID = "3452">
<SessionId > 987654usr3 </SessionId>
<SessionInfo>
<SessionStart> Start </SessionStart>
<LoginInfo>
<UserId> mhabib <UserId>
</LoginInfo>
</SessionInfo>
<TimeStamp> 2002-05-31T20:10:55.000-05:00 </TimeStamp>
<MachineInfo>
<IPAddress> 128.173.244.56 <IPAddress>
<Port> 8000 </Port>
</MachineInfo>
</TransId>
DL Log Tool and Implementation

Example 2: query all Dirline records about “low back pain”
...
<Event>
<Action>
<Search>
<Collection>Dirline</Collection>
<ObjectType>CommunityRecord</ObjectType>
<SearchBy>SearchByAnyParts</SearchBy>
<SearchType>NonPersistant</SearchType>
<QueryString>low back pain</QueryString>
<TimeFrame>
<StartTime>2002-05-31T20:11:07.000-05:00</StartTime>
<EndTime>2002-05-31T20:11:09.000-05:00</EndTime>
</TimeFrame>
<PresentationInfo>
<Format>List</Format>
<SortBy>ByRank</SortBy>
<NumberOfResults>217</NumberOfResults>
<Cutoff>20</Cutoff>
</PresentationInfo>
...
DL Log Tool and Implementation

Example 3: Browse an item of the ranked list returned as an answer
for the previous search
<Transaction ID = "3456">
<SessionId > 987654usr3 </SessionId>
...
<Statement>
<Event>
<Action>
<Browse>
<DocID> 5114 </DocID>
<DocName>University of Washington School of
Medicine Multidisciplinary Pain Center (UWPC)
</DocName>
...
In conclusion

Analysis of current DL log formats


Designed an XML-based log format standard for
DL logging analysis


Need for standardization, common practices,
interoperable tools
Captures a rich, detailed set of system and user
behaviors.
Implemented format in a log component tool

Connected to the MARIAN DL system
Future Work


Build suite of Components for Evaluation
Use log format and tools to evaluate several
projects






Networked Digital Library of Theses and
Dissertations (NDLTD)
CITIDEL
Broadening the scope of use to other NSDL
projects
Extend and use log tool with other DL systems
and architectures
Consider user privacy issues
Explore info for personalization
Future work

Crosswalks to other standards (e.g. CLF)


More challenges




“Not yet other standard”
Distributed Logs
Large settings
Investigate compression issues to deal with XML
verboseness
Promote discussions:

Listserv: [email protected]