Scripting EPrints
Download
Report
Transcript Scripting EPrints
Advanced Customisation:
Scripting EPrints
EPrints Training Course
Southampton, May 3-4th 2007
Taking Control: the EPrints API
EPrints configuration files offer many
opportunities for customisation and control
branding, workflow, controlled vocabs, authority lists,
deposit types, metadata...
EPrints API offers many more opportunities
the more perl-intensive configuration files
e.g. eprint_render.pl
and beyond..
plugins
command-line tools
Roadmap
Core API
1. manipulating your data
2. accessing data collections
3. searching your data
Scripting techniques
1. essentials – putting it all together
2. writing export plugins
3. writing screen plugins
4. writing command-line tools
5. writing CGI scripts
Part 1: Core API
About This Part of the Talk
Light on syntax
object->function(arg1, arg2)
Incomplete
Designed to
give you a feel for the EPrints data model
introduce you to the most significant (and
useful!) objects
how they relate to one another
their most common methods
act as a jumping off point for exploring
Finding Documentation
EPrints modules have embedded
documentation
Extract it using perldoc
perldoc perl_lib/EPrints/Search.pm
Core API:
Manipulating Your Data
Data Model: 3 Core Objects
EPrint
single deposit in the repository
Document
single document attached to an EPrint
User
single registered user
User
EPrint
Document
Data Model: Core Relationships
1 User owns (deposits) many EPrints
1 EPrint has many documents attached
to it
1 Document may contain many files, but
these are not part of the API
e.g. PDF = 1 file
e.g. HTML + images = many files
User
1
*
EPrint
1
*
Document
Data Model: DataObj
All data objects inherit from DataObj
Provides common interface to data
DataObj
User
1
*
EPrint
1
*
Document
Accessing Data: DataObj interface
get_id()
get_url()
EPrint – abstract page
User – user summary page
Document – document download
get_type()
EPrint – article, book, thesis...
User – user, editor, admin
Document – pdf, html, word...
Manipulating Data: DataObj Interface
get_value(fieldname)
get the value of the named data field
eprint->get_value( “title” )
set_value(fieldname, value)
set the value of the named field
doc->set_value( “format”, “pdf” )
is_set(fieldname)
true if the named field has a value
user->is_set( “email” )
Manipulating Data: DataObj Interface (2)
commit()
write any changes made to the object through
to the database
e.g. after using set_value
remove()
erase the object from the database
also removes any sub-objects and files
e.g. eprint->remove
removes EPrint and associated Documents from DB
removes Document files from filesystem
Getting Hold of Existing Data Objects
new(session, id)
returns data object for an existing record
EPrints::DataObj::EPrint->new(session, 1)
EPrints::DataObj::User->new(session, 1)
EPrints::DataObj::Document->new(session, 1)
User object has extra options
user_with_email(session, email)
user_with_username(session, username)
Creating New Data Objects
Slightly different for each data object
EPrint
create(session, dataset, data)
User
create(session, user_type)
Document
create(session, eprint)
Specific Methods
Each data object also has specific
methods for manipulating their data
EPrint Methods
get_user()
get a User object representing the user to whom the
EPrint belongs
get_all_documents()
get a list of all the Document objects associated with
the EPrint
generate_static()
generate the static abstract page for the eprint
useful when you’ve modified the eprint values!
in a multi-language archive this will generate a page in
each language
User Methods
get_eprints(dataset)
get a list of EPrints owned by the user
mail(subject, message)
send an email to the user
Document Methods
get_eprint()
get the EPrint object the document is
associated with
local_path()
get the full path of the directory where the
document is stored in the filesystem
files()
get a list of (filename, file size) pairs
Document Methods: Main File
get_main()
set_main(main_file)
get/set the main file for the document
this is the file that gets linked to
in majority of cases, Document will have 1 file
e.g. PDF
but there may be some cases where a
Document has many file
e.g. HTML document = .html files, images,
stylesheets
set main to top level index.html
Document Methods: Adding Files
add_file(file, filename)
upload(filehandle, filename)
both add a file to the document
add_file uses full path to file
upload uses file handle
in both cases the document will be named
filename
Document Methods: Adding Files (2)
upload_url(url)
grab file(s) from given URL
in the case of HTML, only relative links will be
followed
add_archive(file, format)
add files from a .zip or .tar.gz file
remove_file(filename)
remove the named file
Other Data Objects
Subject
a node in the subjects tree
SavedSearch
a saved search associated with a User
History
an event that took place on another data object
e.g. change to eprint metadata
Access
a Web access to an object
e.g. document download
Request
a request for a (restricted) document
Explore these using perldoc
Core API:
Accessing Data Collections
Accessing Data Collections
We’ve looked at individual data objects
but a repository holds many eprints and
documents and has many registered users
2 key ways to manipulate data objects
collectively:
1. built-in datasets
large fixed sets of data objects
2. by searching the repository
set of data objects matching specific criteria
Datasets
All data objects in the repository are part
of a collection called a dataset
3 core datasets:
eprint
all eprints
user
all registered users
document
all documents
Datasets (2)
Also 4 subsets within eprint dataset
which collect eprints in same state
archive
all eprints in live archive
inbox
all eprints which users are still working on
buffer
all eprints submitted for editorial review
deletion
all eprints retired from live archive
The DataSet Object
Gives access to all the data objects in a
particular dataset
Also
tells us which data fields apply to that dataset
recall get_value and set_value methods
a repository’s metadata is configurable so this
gives us a way to find out:
which fields are available in a particular repository
the properties of individual fields
Accessing DataSets
count(session)
get the number of items in the dataset
get_item_ids(session)
get the IDs of the objects in the dataset
map(function, args)
apply function to each object in the dataset
function is called with args:
(session, dataset, dataobj, args)
Fields in a DataSet
has_field(fieldname)
true if the dataset has a field of that name
get_field(fieldname)
get a MetaField object describing the named
field
get_fields()
get list of MetaField objects describing all fields
in the dataset
Datasets and MetaFields
A MetaField
is a single field in a dataset
tells us properties of the field
get_property(name)
set_property(name, value)
e.g. name, type, input_rows, maxlength,
multiple...
but not the field value
the value is specific to the individual data
object
e.g. eprint->get_value(“title”)
Core API:
Searching the Repository
Searching the Repository
The Search object allows us to search
datasets for data objects matching specific
criteria
Provides access to the results
Starting a New Search
new(options)
create a new search expression
must specify which dataset to search in
search = new Search(
session => session,
dataset => dataset,
custom_order => “title” )
many other options can be specified
explore with perldoc
Adding Search Fields
add_field(metafield, value)
add a new search field with the given value
(search text) to the search expression
add as many fields as you like to the search
criteria
Adding Search Fields: Example
Example: full text search
search->add_field(
dataset->get_field(“title”),
“routing”,
“IN”,
“ALL” )
Adding Search Fields: Example (2)
Example: full text search which matches
word in title or abstract
search->add_field(
[ dataset->get_field(“title”),
dataset->get_field(“abstract”)
],
“routing”,
“IN”,
“ALL” )
Adding Search Fields
Example: date search
search->add_field(
dataset->get_field(“date”),
“2000-2004”,
“EQ”,
“ALL” )
Processing Search Results
Carry out a search using:
list = search->perform_search()
Returns a List object which gives access
to search results
The List Object
Any ordered collection of data objects
usually the results of a search
Processing Lists
count()
get the number of results
get_ids(offset, count)
get_records(offset, count)
get an array if data objects, or just their ids
optionally specify a range using count and
offset
map(function, args)
apply the function to each data object in the
list
Manipulating Lists
newlist = list->reorder( neworder )
newlist = list->union( list2 )
newlist = list->intersect( list2 )
newlist = list->remainder( list2 )
Part 2: Scripting Techniques
Roadmap
Core API
manipulating your data
accessing data collections
searching your data
Scripting techniques
1. essentials – putting it all together
2. writing export plugins
3. writing screen plugins
4. writing command-line tools
5. writing CGI scripts
Scripting Techniques:
Essentials
Putting it all together
Two essential objects
1. Session
connects to the repository
many useful methods
2. Repository
provides access to
datasets
session->get_repository->get_dataset(“archive”)
configuration settings
Explore using perldoc
Scripting for the Web
API provides lots of methods to help you
build Web pages and display (render)
data
these methods return (X)HTML
but not strings!
XML DOM objects
DocumentFragment, Element, TextNode...
Build pages from these nodes
node1->appendChild(node2)
why? it’s easier to manipulate a tree than to
manipulate a large string
XML DOM vs. Strings
p = make_element(“p”)
p = “<p>”
text = make_text(
p += “Hello World”
“Hello World” )
p->appendChild(text)
p += “</p>”
<p>Hello World</p>
Can manipulate tree
to add extra text,
elements etc.
p
Hello World
Difficult to make changes to
the string – would need to
find the right position first
Render Methods: Session
Session provides many useful Web page
building blocks
make_doc_fragment()
create an empty XHTML document
fill it with things!
make_text(text)
create an XML TextNode
make_element(name, attrs)
create an XHTML Element
make_element("p", align => "right")
<p align=”right” />
Render Methods: Session (2)
render_link(uri, target)
create an XHTML link
1. link = session->
render_link(“http://www.eprints.org“)
2. text = session->make_text(“EPrints")
3. link->appendChild(text)
<a href=”http://www.eprints.org”>
EPrints</a>
Render Methods: Session (3)
html_phrase(phraseid, inserts)
render an XHTML phrase in the current language
looks up phraseid from the phrases files
inserts can be used to add extra information to the
phrase
must be a corresponding <epc:pin> in the
phrase
<epp:phrase>Number of results:
<epc:pin name=“count“/></epp:phrase>
Render Methods: Session (4)
Many methods for building input forms,
including:
render_form(method, dest)
render_option_list(params)
render_hidden_field(name, value)
render_upload_field(name)
render_action_buttons(buttons)
...
Rendering Methods: Data Objects
render_citation(style)
render_citation_link(style)
create an XHTML citation for the object
if style is set then use the named citation style
render_value(fieldname)
get an XHTML fragment containing the
rendered version of the value of the named
field
in the current language
Rendering Methods: MetaFields
render_name(session)
render_help(session)
get an XHTML fragment containing the
name/description of the field in the current
language
Rendering Methods: Searches
render_description()
get some XHTML describing the search
parameters
render_search_form(help)
render an input form for the search
if help is true then this also renders the help for
each search field in current language
Getting User Input (CGI parameters)
Session object also provides useful
methods for getting user input
e.g. from an input form
have_parameters
true if parameters (POST or GET) are
available
param(name)
get the value of a named parameter
Scripting Techniques:
Writing Export Plugins
Plugin Framework
EPrints provides a framework for plugins
registration of plugin capabilities
standard interface which plugins need to implement
Several types of plugin interface provided
import and export
get data in and out of the repository
interface screens
add new tools and reports to UI
input components
add new ways for users to enter data
Plugin Framework (2)
Not just a plugin framework for 3rd party extensions!
Used extensively by EPrints itself
majority of (dynamic) Web pages you see are screen
plugins
search, deposit workflow, editorial review, item control
page, user profile, saved searches, adminstration tools...
all import/export options implemented as plugins
all input components in deposit workflow are plugins
subject browser input, file upload...
Plugin Framework (3)
EPrints is really a generic plugin framework
with a set of plugins that implement the functions of a
repository
Gives plugin developers many examples to
work from
find a plugin that does something similar to what you
want to achieve and explore how it works
Plugins
Plugin Framework
Backend (data model)
Writing Export Plugins
Typically a standalone Perl module in
perl_lib/EPrints/Plugin/Export/
Writing export plugins
1. register plugin
2. define how to convert data objects to an
output/interchange format
Export Plugin: Registration
Register
name
the name of the plugin
visible
who can use it
accept
what the plugin can convert
lists of data objects or single data objects (or both)
type of record (eprint, user...)
suffix and mimetype
file extension and MIME type of format plugin
converts to
Registration Example: BibTeX
$self->{name} = "BibTeX";
$self->{accept} = [ 'list/eprint',
'dataobj/eprint' ];
$self->{visible} = "all";
$self->{suffix} = ".bib";
$self->{mimetype} = "text/plain";
Converts lists or single EPrint objects
Available to all users
Produces plain text file with .bib extension
Registration Example: FOAF
$self->{name} = "FOAF Export";
$self->{accept} = [ 'dataobj/user' ];
$self->{visible} = "all";
$self->{suffix} = ".rdf";
$self->{mimetype} = "text/xml";
Converts single User objects
Available to all users
Produces XML file with .rdf extension
Registration Example: XML
$self->{name} = "EP3 XML";
$self->{accept} = [ 'list/*', 'dataobj/*' ];
$self->{visible} = "all";
$self->{suffix} = ".xml";
$self->{mimetype} = "text/xml";
Converts any data object
Available to all users
Produces XML file with .xml extension
Export Plugin: Conversion
For a straight conversion plugin, this usually
includes:
1. mapping data objects to output/interchange
format
2. serialising the output/interchange format
e.g. EndNote conversion section:
$data->{K} = $dataobj->get_value( "keywords" );
$data->{T} = $dataobj->get_value( "title" );
$data->{U} = $dataobj->get_url;
Export Plugin: Conversion (2)
But export plugins aren’t limited to
straight conversions!
Explore:
Google Maps export plugin
plot location data on map
http://files.eprints.org/224/
Timeline export plugin
plot date data on timeline
http://files.eprints.org/225/
Export Plugin: Template
1. Register
subclass EPrints::Plugin::Export
inherits all the mechanics so you don’t have to
worry about them
could subclass existing plugin e.g. XML, Feed
define name, accept, visible etc.
in constructor new() of plugin module
2. Conversion
define output_dataobj function
will be called by plugin subsystem for every data
object that needs to be converted
Writing Import Plugins
Typically a standalone Perl module in
Reading input can be harder than writing
output
perl_lib/EPrints/Plugin/Import/
need to detect and handle errors in input
many existing libraries available for parsing a wide
variety of file formats
Writing import plugins
1.
2.
register
define how to convert input/interchange format into
data objects
reverse of export
Scripting Techniques:
Writing Screen Plugins
Plugins: Writing Screen Plugins
One or more Perl modules in
perl_lib/EPrints/Plugin/Screen/
may be bundled with phrases, config files,
stylesheets etc.
Writing screen plugins
1. register
where it appears in UI
who can use it
2. define functionality
Screen Plugin: Registration
Register
actions
the actions the plugin can carry out (if any)
appears
where abouts in the interface the plugin and/or
actions will appear
named list
position in list
will be displayed as link, button or tab
Registration Example: Manage Deposits
$self->{appears} = [
{ place => "key_tools", position => 100, }
];
key_tools list
Registration Example: EPrint Details
$self->{appears} = [
{ place => "eprint_view_tabs", position => 100, },
];
eprint_view_tabs list
(each tab is a single screen
plugin)
Registration Example: New Item
$self->{appears} = [
{ place => “item_tools", position => 100,
action => “create”, },
];
item_tools list (create
action will be invoked
when button pressed)
Screen Plugin: Define Functionality
3 types of screen plugin
1. Render only
define how to produce output display
examples: Admin::Status, EPrint::Details
2. Action only (no output display)
define how to carry out action(s)
examples: Admin::IndexerControl, EPrint::Move,
EPrint::NewVersion
3. Combined (interactive)
define how to produce output/carry out action(s)
examples: EPrint::RejectWithEmail, EPrint::Edit,
User::Edit
Screen Plugins: Displaying Messages
Action plugins produce no output display
but can still display messages to user
add_message(type, message)
register a message that will be displayed to
the user on the next screen they see
type can be
error
warning
message (informational)
Screen Plugin Template: Action Only
1. Register
subclass EPrints::Plugin::Screen
define actions supported
define where actions appear
define who can use actions
allow_ACTION function(s)
2. Define functionality
define action_ACTION function(s)
carry out the action
use add_message to show result/error
redirect to a different screen when done
Screen Plugin Template: Combined
render function usually displays
links/buttons which invoke the plugin’s
actions
e.g. EPrint::Remove
registers remove and cancel actions
render function displays Are you sure?
screen
OK/Cancel buttons invoke remove/cancel
actions
Scripting Techniques:
Writing Command Line Scripts
Command Line Scripts
Usually stored in bin directory
Add batch/offline processes to your
repository
e.g. duplicate detection – compare each
record to every other record
e.g. file integrity - check stored MD5 sums
against actual MD5 sums
Connecting to the Repository
Command line scripts (and CGI scripts)
must explicitly connect to the repository by
creating a new Session object
new(mode, repositoryid)
set mode to 1 for command line scripts
set mode to 0 for CGI scripts
And disconnect from the repository when
complete
terminate()
performs necessary cleanup
Using Render Functions
XHTML is good for building Web pages
but not so good for command line output!
often no string equivalent
use tree_to_utf8()
extracts a string from the result of any
rendering method
tree_to_utf8(
eprint->render_citation)
Search and Modify Template
Common pattern for command line tools
1. Connect to repository
2. Get desired dataset
3. Search dataset
4. Apply function to matching results
modify result
commit changes
5. Disconnect from repository
Example: lift_embargos
Removes access restrictions on
documents with expired embargos
1. Connect to repository
2. Get document dataset
3. Search dataset
embargo date field earlier than today’s date
4. Apply function to matching results
remove access restriction
clear embargo date
commit changes
5. Disconnect from repository
Scripting Techniques:
Writing CGI Scripts
CGI Scripts
Usually stored in cgi directory
Largely superceded by screen plugins
but can still be used to add e.g. custom
reports to your repository
Similar template to command-line scripts
but build Web page output using API
render_ methods
Building Pages
In Screen plugins, mechanics of sending
Web pages to the user’s browser are
handled by the plugin subsystem
need to do this yourself with CGI scripts
methods provided by the Session object
build_page(title, body)
wraps your XHTML document in the archive
template
send_page()
flatten page and send it to the user
Summary
Use the core API to manipulate data in the API
individual data objects
EPrint, Document, User
sets of data objects
DataSet, List, Search
Wrap this in a plugin or script
Session, Repository
Web output using render_ methods
templates
Next: hands-on exercises designed to get you
started with these techniques