Transcript Document
Do It Yourself Primo Statistics The Art of the (Relatively) Painless Extraction ANNE L. HIGHSMITH DIRECTOR, CONSORTIA SYSTEMS TEXAS A&M UNIVERSITY [email protected] HTTP://LIBRARY.TAMU.EDU/DIRECTORY/HISMITH Our Environment Our Primo Environment Texas A&M University is a hosted, Direct customer, in production since June 2012. As a hosted customer, we have a staging system as well as production. All program development for these extracts has been done on the production system. We are currently on release 4.4.1 Our Reporting Environment Report server with an Oracle database Oracle is separately licensed, so we can do development on it Contains SFX/MetaLib extracts and statistics and a full copy of the Voyager database, rebuilt nightly from backup Viewing the Views How to see what’s available Log in as primo user Execute: s+ RPT00 Execute: SELECT VIEW_NAME FROM ALL_VIEWS WHERE OWNER LIKE ‘%RPT00’ CLICK_EVENTS SEARCH_STATISTICS SEARCH_STRINGS To see view definition, execute: SELECT TEXT FROM ALL_VIEWS WHERE VIEW_NAME = ‘CLICK_EVENTS’ SELECT ID, SUMMARY_TIMESTAMP EVENT_DATE, CLICK_TYPE EVENT_TYPE, CASE WHEN CLICK_VALUE='N/A' THEN '' ELSE CLICK_VALUE END CLICK_VALUE, CLICK_COUNT, SOURCE_VIEW, SOURCE_INSTITUTION, SOURCE_ON_CAMPUS, SOURCE_USER_GROUP from P41_PRM00.S_CLICK_SUMMARIES WHERE CLICK_TYPE NOT IN ('File System', 'DB Listener', 'Load', 'Indexes', 'Table Space', 'Search Problem', 'IO Wait', 'Memory') View Definitions All stats views seem to be based on S_SEARCH_SUMMARIES & S_CLICK_SUMMARIES tables Notice that CLICK_EVENTS excludes some system-type stats SEARCH_STATISTICS is a subset of S_SEARCH_SUMMARIES, where SUMMARY_TYPE='SEARCH_COUNT‘ SEARCH_STRINGS is a subset of S_SEARCH_SUMMARIES, where SUMMARY_TYPE = 'TOP_SEARCHES_SUMMARY' Data Anomalies SQL vs. BIRT Reports Replicate BIRT report for Click Events Event type Display details tab DS GetIt!Link2 BIRT SQL 5 49,629 444,539 1 89,791 20,682 SQL Selection Criteria Issues Some tables contain “junk” Out of 10M rows in the CLICK_EVENTS view, 36% had no institution name Myriad variations in INSTITUTION_NAME Basic Selection Criteria SELECT event_type, click_value, click_count, institution, \"VIEW\" AS view_name, on_campus, user_group FROM p41_rpt00.click_events WHERE to_char(event_date,'YYYYMM') = '$previous_month' AND institution is not null AND lower(institution) not like 'primo%' Scope Names Hoped that SCOPE_NAME would be equivalent to the Search Scope Name as it appears on the Search Scope List in the Primo Back Office. Current default SCOPE_NAME appears as: scope:("MSL"),scope:(libguides),scope:(archon),scope:(AMDB_ VOYAGER),scope:(TAMU-SFX ),scope:(EVANS),scope:(tamu_dspace_qdc),primo_central_mu ltiple_fe Collected all known SCOPE_NAME values in a Perl module, TAMU_Primo.pm Scope Types SEARCH_STATISTICS and SEARCH_STRINGS views contain an element called SCOPE_TYPE SCOPE_TYPE in SEARCH_STRINGS should be limited to LOCAL/REMOTE SCOPE_TYPE IN SEARCH_STATISTICS should be limited to LOCAL/REMOTE/DS Scope Types (Continued) SEARCH_STATISTICS – 16% of SCOPE_TYPE values are something other than LOCAL/REMOTE SEARCH_STRINGS – 12% of SCOPE_TYPE values are something other than LOCAL/REMOTE/DS If the retrieved value didn’t match the list of defined values, I set it to null. Data I Can’t Make Sense of SEARCH_STRINGS has only 149,127 rows in the view Are these unique strings? If yes, why does the same string appear in different rows? What do the numbers, such as AVERAGE_RESULTS and SEARCH_COUNT, really mean? Example “Fluid mechanics” appears as a search string in the default scope 5 times in the period 1/18/20143/5/2014. AVERAGE_RESULTS by date 18-Jan-14 210677 31-Mar-14 150528 27-Feb-14 58544 5-Mar-14 58576 5-Mar-14 74119 Perl Extract Programs Generalities The extract and processing programs for the TAMU report server are written in Perl; the front end is written in PHP The Primo stats extract programs I have written live on the production Primo server; they sftp output to the report server The perl programs use a local symlink from /exlibris/product/perl-5.8.9/bin/perl to /exlibris/primo/scripts/perl Generalities (Continued) The Primo group consists of 5 Perl programs and 1 module click_extract.pl, click_compile.pl, facets.pl, search_statistics.pl, search_strings.pl, TAMU_Primo.pm click_extract.pl extracts data from the CLICK_EVENTS view and stores it in output files, which are mined by click_compile.pl & facets.pl to create useful output. search_statistics.pl & search_strings.pl extract data from their corresponding views to an output file Generalities (Continued) Programs are designed to be run on a monthly basis, to be put into a cronjob and cumulate the previous month’s data. But they can also be run from the command line with parameters that let you select other months earlier in the calendar. The programs that create output files also have a step to sftp the output to a different server. But you have to do the sftp setup between servers yourself. A Few Specifics Facets.pl creates 2 sets of output files – one set which cumulates all facet requests and a second one that provides detail about certain facet types If it’s a domain, language, library, resource type, or top-level facet, it cumulates the individual values under each of those types. So you would know how many times the facet for English language was applied or the facet for Thesis resource type. Normalization Contained in TAMU_Primo.pm Defines variations in the institution value, code versus spelled out name, and normalizes them all to the codes Defines a list of valid view names Normalizes the user groups. Defines a long list of valid scope_names Search_statistics.pl collects undefined scope_names and emails the list to a designated email account so that the list can be updated