Transcript Document
Do It Yourself Primo Statistics
The Art of the (Relatively) Painless
Extraction
ANNE L. HIGHSMITH
DIRECTOR, CONSORTIA SYSTEMS
TEXAS A&M UNIVERSITY
[email protected]
HTTP://LIBRARY.TAMU.EDU/DIRECTORY/HISMITH
Our Environment
Our Primo Environment
Texas A&M University is a hosted, Direct customer, in
production since June 2012.
As a hosted customer, we have a staging system as
well as production. All program development for
these extracts has been done on the production
system.
We are currently on release 4.4.1
Our Reporting Environment
Report server with an Oracle database
Oracle is separately licensed, so we can do
development on it
Contains SFX/MetaLib extracts and statistics and a
full copy of the Voyager database, rebuilt nightly
from backup
Viewing the Views
How to see what’s available
Log in as primo user
Execute: s+ RPT00
Execute: SELECT VIEW_NAME FROM ALL_VIEWS
WHERE OWNER LIKE ‘%RPT00’
CLICK_EVENTS
SEARCH_STATISTICS
SEARCH_STRINGS
To see view definition, execute: SELECT TEXT FROM
ALL_VIEWS WHERE VIEW_NAME =
‘CLICK_EVENTS’
SELECT ID,
SUMMARY_TIMESTAMP EVENT_DATE,
CLICK_TYPE EVENT_TYPE,
CASE WHEN CLICK_VALUE='N/A' THEN '' ELSE
CLICK_VALUE END CLICK_VALUE,
CLICK_COUNT,
SOURCE_VIEW,
SOURCE_INSTITUTION,
SOURCE_ON_CAMPUS,
SOURCE_USER_GROUP
from P41_PRM00.S_CLICK_SUMMARIES
WHERE CLICK_TYPE NOT IN ('File System', 'DB Listener',
'Load', 'Indexes', 'Table Space', 'Search Problem', 'IO
Wait', 'Memory')
View Definitions
All stats views seem to be based on
S_SEARCH_SUMMARIES & S_CLICK_SUMMARIES
tables
Notice
that CLICK_EVENTS excludes some system-type
stats
SEARCH_STATISTICS is a subset of
S_SEARCH_SUMMARIES, where
SUMMARY_TYPE='SEARCH_COUNT‘
SEARCH_STRINGS is a subset of
S_SEARCH_SUMMARIES, where SUMMARY_TYPE =
'TOP_SEARCHES_SUMMARY'
Data Anomalies
SQL vs. BIRT Reports
Replicate BIRT report for Click Events
Event type
Display
details tab
DS
GetIt!Link2
BIRT
SQL
5
49,629
444,539
1
89,791
20,682
SQL Selection Criteria Issues
Some tables contain “junk”
Out of 10M rows in the CLICK_EVENTS
view, 36% had no institution name
Myriad variations in INSTITUTION_NAME
Basic Selection Criteria
SELECT event_type,
click_value,
click_count,
institution,
\"VIEW\" AS view_name,
on_campus,
user_group
FROM
p41_rpt00.click_events
WHERE to_char(event_date,'YYYYMM') = '$previous_month'
AND
institution is not null
AND
lower(institution) not like 'primo%'
Scope Names
Hoped that SCOPE_NAME would be equivalent to the
Search Scope Name as it appears on the Search Scope
List in the Primo Back Office.
Current default SCOPE_NAME appears as:
scope:("MSL"),scope:(libguides),scope:(archon),scope:(AMDB_
VOYAGER),scope:(TAMU-SFX
),scope:(EVANS),scope:(tamu_dspace_qdc),primo_central_mu
ltiple_fe
Collected all known SCOPE_NAME values in a Perl
module, TAMU_Primo.pm
Scope Types
SEARCH_STATISTICS and SEARCH_STRINGS views
contain an element called SCOPE_TYPE
SCOPE_TYPE in SEARCH_STRINGS should be
limited to LOCAL/REMOTE
SCOPE_TYPE IN SEARCH_STATISTICS should be
limited to LOCAL/REMOTE/DS
Scope Types (Continued)
SEARCH_STATISTICS – 16% of SCOPE_TYPE values
are something other than LOCAL/REMOTE
SEARCH_STRINGS – 12% of SCOPE_TYPE values
are something other than LOCAL/REMOTE/DS
If the retrieved value didn’t match the list of
defined values, I set it to null.
Data I Can’t Make Sense of
SEARCH_STRINGS has only 149,127 rows in the
view
Are these unique strings?
If yes, why does the same string appear in different
rows?
What do the numbers, such as AVERAGE_RESULTS
and SEARCH_COUNT, really mean?
Example
“Fluid mechanics” appears as a search string in the
default scope 5 times in the period 1/18/20143/5/2014.
AVERAGE_RESULTS by date
18-Jan-14 210677
31-Mar-14 150528
27-Feb-14 58544
5-Mar-14
58576
5-Mar-14
74119
Perl Extract Programs
Generalities
The extract and processing programs for the TAMU
report server are written in Perl; the front end is
written in PHP
The Primo stats extract programs I have written live
on the production Primo server; they sftp output to
the report server
The perl programs use a local symlink from
/exlibris/product/perl-5.8.9/bin/perl to
/exlibris/primo/scripts/perl
Generalities (Continued)
The Primo group consists of 5 Perl programs and 1
module
click_extract.pl,
click_compile.pl, facets.pl,
search_statistics.pl, search_strings.pl, TAMU_Primo.pm
click_extract.pl extracts data from the CLICK_EVENTS
view and stores it in output files, which are mined by
click_compile.pl & facets.pl to create useful output.
search_statistics.pl & search_strings.pl extract data
from their corresponding views to an output file
Generalities (Continued)
Programs are designed to be run on a monthly
basis, to be put into a cronjob and cumulate the
previous month’s data. But they can also be run
from the command line with parameters that let
you select other months earlier in the calendar.
The programs that create output files also have
a step to sftp the output to a different server.
But you have to do the sftp setup between
servers yourself.
A Few Specifics
Facets.pl creates 2 sets of output files – one set
which cumulates all facet requests and a second one
that provides detail about certain facet types
If it’s a domain, language, library, resource type, or
top-level facet, it cumulates the individual values
under each of those types. So you would know how
many times the facet for English language was
applied or the facet for Thesis resource type.
Normalization
Contained in TAMU_Primo.pm
Defines
variations in the institution value, code versus
spelled out name, and normalizes them all to the codes
Defines a list of valid view names
Normalizes the user groups.
Defines a long list of valid scope_names
Search_statistics.pl
collects undefined scope_names and
emails the list to a designated email account so that the
list can be updated