Scripting EPrints

Download Report

Transcript Scripting EPrints

Advanced Customisation:
Scripting EPrints
EPrints Training Course
Southampton, May 3-4th 2007
Taking Control: the EPrints API
 EPrints configuration files offer many
opportunities for customisation and control
 branding, workflow, controlled vocabs, authority lists,
deposit types, metadata...
 EPrints API offers many more opportunities
 the more perl-intensive configuration files
 e.g. eprint_render.pl
 and beyond..
 plugins
 command-line tools
Roadmap
 Core API
1. manipulating your data
2. accessing data collections
3. searching your data
 Scripting techniques
1. essentials – putting it all together
2. writing export plugins
3. writing screen plugins
4. writing command-line tools
5. writing CGI scripts
Part 1: Core API
About This Part of the Talk
 Light on syntax
 object->function(arg1, arg2)
 Incomplete
 Designed to
 give you a feel for the EPrints data model
 introduce you to the most significant (and
useful!) objects
 how they relate to one another
 their most common methods
 act as a jumping off point for exploring
Finding Documentation
 EPrints modules have embedded
documentation
 Extract it using perldoc
 perldoc perl_lib/EPrints/Search.pm
Core API:
Manipulating Your Data
Data Model: 3 Core Objects
 EPrint
 single deposit in the repository
 Document
 single document attached to an EPrint
 User
 single registered user
User
EPrint
Document
Data Model: Core Relationships
 1 User owns (deposits) many EPrints
 1 EPrint has many documents attached
to it
 1 Document may contain many files, but
these are not part of the API
 e.g. PDF = 1 file
 e.g. HTML + images = many files
User
1
*
EPrint
1
*
Document
Data Model: DataObj
 All data objects inherit from DataObj
 Provides common interface to data
DataObj
User
1
*
EPrint
1
*
Document
Accessing Data: DataObj interface
 get_id()
 get_url()
 EPrint – abstract page
 User – user summary page
 Document – document download
 get_type()
 EPrint – article, book, thesis...
 User – user, editor, admin
 Document – pdf, html, word...
Manipulating Data: DataObj Interface
 get_value(fieldname)
 get the value of the named data field
 eprint->get_value( “title” )
 set_value(fieldname, value)
 set the value of the named field
 doc->set_value( “format”, “pdf” )
 is_set(fieldname)
 true if the named field has a value
 user->is_set( “email” )
Manipulating Data: DataObj Interface (2)
 commit()
 write any changes made to the object through
to the database
 e.g. after using set_value
 remove()
 erase the object from the database
 also removes any sub-objects and files
 e.g. eprint->remove
 removes EPrint and associated Documents from DB
 removes Document files from filesystem
Getting Hold of Existing Data Objects
 new(session, id)
 returns data object for an existing record
 EPrints::DataObj::EPrint->new(session, 1)
 EPrints::DataObj::User->new(session, 1)
 EPrints::DataObj::Document->new(session, 1)
 User object has extra options
 user_with_email(session, email)
 user_with_username(session, username)
Creating New Data Objects
 Slightly different for each data object
 EPrint
 create(session, dataset, data)
 User
 create(session, user_type)
 Document
 create(session, eprint)
Specific Methods
 Each data object also has specific
methods for manipulating their data
EPrint Methods
 get_user()
 get a User object representing the user to whom the
EPrint belongs
 get_all_documents()
 get a list of all the Document objects associated with
the EPrint
 generate_static()
 generate the static abstract page for the eprint
 useful when you’ve modified the eprint values!
 in a multi-language archive this will generate a page in
each language
User Methods
 get_eprints(dataset)
 get a list of EPrints owned by the user
 mail(subject, message)
 send an email to the user
Document Methods
 get_eprint()
 get the EPrint object the document is
associated with
 local_path()
 get the full path of the directory where the
document is stored in the filesystem
 files()
 get a list of (filename, file size) pairs
Document Methods: Main File
 get_main()
 set_main(main_file)
 get/set the main file for the document
 this is the file that gets linked to
 in majority of cases, Document will have 1 file
 e.g. PDF
 but there may be some cases where a
Document has many file
 e.g. HTML document = .html files, images,
stylesheets
 set main to top level index.html
Document Methods: Adding Files
 add_file(file, filename)
 upload(filehandle, filename)
 both add a file to the document
 add_file uses full path to file
 upload uses file handle
 in both cases the document will be named
filename
Document Methods: Adding Files (2)
 upload_url(url)
 grab file(s) from given URL
 in the case of HTML, only relative links will be
followed
 add_archive(file, format)
 add files from a .zip or .tar.gz file
 remove_file(filename)
 remove the named file
Other Data Objects
 Subject
 a node in the subjects tree
 SavedSearch
 a saved search associated with a User
 History
 an event that took place on another data object
 e.g. change to eprint metadata
 Access
 a Web access to an object
 e.g. document download
 Request
 a request for a (restricted) document
 Explore these using perldoc
Core API:
Accessing Data Collections
Accessing Data Collections
 We’ve looked at individual data objects
 but a repository holds many eprints and
documents and has many registered users
 2 key ways to manipulate data objects
collectively:
1. built-in datasets
 large fixed sets of data objects
2. by searching the repository
 set of data objects matching specific criteria
Datasets
 All data objects in the repository are part
of a collection called a dataset
 3 core datasets:
 eprint
 all eprints
 user
 all registered users
 document
 all documents
Datasets (2)
 Also 4 subsets within eprint dataset
which collect eprints in same state
 archive
 all eprints in live archive
 inbox
 all eprints which users are still working on
 buffer
 all eprints submitted for editorial review
 deletion
 all eprints retired from live archive
The DataSet Object
 Gives access to all the data objects in a
particular dataset
 Also
 tells us which data fields apply to that dataset
 recall get_value and set_value methods
 a repository’s metadata is configurable so this
gives us a way to find out:
 which fields are available in a particular repository
 the properties of individual fields
Accessing DataSets
 count(session)
 get the number of items in the dataset
 get_item_ids(session)
 get the IDs of the objects in the dataset
 map(function, args)
 apply function to each object in the dataset
 function is called with args:
 (session, dataset, dataobj, args)
Fields in a DataSet
 has_field(fieldname)
 true if the dataset has a field of that name
 get_field(fieldname)
 get a MetaField object describing the named
field
 get_fields()
 get list of MetaField objects describing all fields
in the dataset
Datasets and MetaFields
 A MetaField
 is a single field in a dataset
 tells us properties of the field
 get_property(name)
 set_property(name, value)
 e.g. name, type, input_rows, maxlength,
multiple...
 but not the field value
 the value is specific to the individual data
object
 e.g. eprint->get_value(“title”)
Core API:
Searching the Repository
Searching the Repository
 The Search object allows us to search
datasets for data objects matching specific
criteria
 Provides access to the results
Starting a New Search
 new(options)
 create a new search expression
 must specify which dataset to search in
 search = new Search(
session => session,
dataset => dataset,
custom_order => “title” )
 many other options can be specified
 explore with perldoc
Adding Search Fields
 add_field(metafield, value)
 add a new search field with the given value
(search text) to the search expression
 add as many fields as you like to the search
criteria
Adding Search Fields: Example
 Example: full text search
 search->add_field(
dataset->get_field(“title”),
“routing”,
“IN”,
“ALL” )
Adding Search Fields: Example (2)
 Example: full text search which matches
word in title or abstract
 search->add_field(
[ dataset->get_field(“title”),
dataset->get_field(“abstract”)
],
“routing”,
“IN”,
“ALL” )
Adding Search Fields
 Example: date search
 search->add_field(
dataset->get_field(“date”),
“2000-2004”,
“EQ”,
“ALL” )
Processing Search Results
 Carry out a search using:
 list = search->perform_search()
 Returns a List object which gives access
to search results
The List Object
 Any ordered collection of data objects
 usually the results of a search
Processing Lists
 count()
 get the number of results
 get_ids(offset, count)
 get_records(offset, count)
 get an array if data objects, or just their ids
 optionally specify a range using count and
offset
 map(function, args)
 apply the function to each data object in the
list
Manipulating Lists
 newlist = list->reorder( neworder )
 newlist = list->union( list2 )
 newlist = list->intersect( list2 )
 newlist = list->remainder( list2 )
Part 2: Scripting Techniques
Roadmap
 Core API
 manipulating your data
 accessing data collections
 searching your data
 Scripting techniques
1. essentials – putting it all together
2. writing export plugins
3. writing screen plugins
4. writing command-line tools
5. writing CGI scripts
Scripting Techniques:
Essentials
Putting it all together
 Two essential objects
1. Session
 connects to the repository
 many useful methods
2. Repository
 provides access to
 datasets

session->get_repository->get_dataset(“archive”)
 configuration settings
 Explore using perldoc
Scripting for the Web
 API provides lots of methods to help you
build Web pages and display (render)
data
 these methods return (X)HTML
 but not strings!
 XML DOM objects
 DocumentFragment, Element, TextNode...
 Build pages from these nodes
 node1->appendChild(node2)
 why? it’s easier to manipulate a tree than to
manipulate a large string
XML DOM vs. Strings
 p = make_element(“p”)
 p = “<p>”
 text = make_text(
 p += “Hello World”
“Hello World” )
 p->appendChild(text)
 p += “</p>”
<p>Hello World</p>
Can manipulate tree
to add extra text,
elements etc.
p
Hello World
Difficult to make changes to
the string – would need to
find the right position first
Render Methods: Session
 Session provides many useful Web page
building blocks
 make_doc_fragment()
 create an empty XHTML document
 fill it with things!
 make_text(text)
 create an XML TextNode
 make_element(name, attrs)
 create an XHTML Element
make_element("p", align => "right")
<p align=”right” />
Render Methods: Session (2)
 render_link(uri, target)
 create an XHTML link
1. link = session->
render_link(“http://www.eprints.org“)
2. text = session->make_text(“EPrints")
3. link->appendChild(text)
<a href=”http://www.eprints.org”>
EPrints</a>
Render Methods: Session (3)
 html_phrase(phraseid, inserts)
 render an XHTML phrase in the current language
 looks up phraseid from the phrases files
 inserts can be used to add extra information to the
phrase
 must be a corresponding <epc:pin> in the
phrase
 <epp:phrase>Number of results:
<epc:pin name=“count“/></epp:phrase>
Render Methods: Session (4)
 Many methods for building input forms,
including:
 render_form(method, dest)
 render_option_list(params)
 render_hidden_field(name, value)
 render_upload_field(name)
 render_action_buttons(buttons)
 ...
Rendering Methods: Data Objects
 render_citation(style)
 render_citation_link(style)
 create an XHTML citation for the object
 if style is set then use the named citation style
 render_value(fieldname)
 get an XHTML fragment containing the
rendered version of the value of the named
field
 in the current language
Rendering Methods: MetaFields
 render_name(session)
 render_help(session)
 get an XHTML fragment containing the
name/description of the field in the current
language
Rendering Methods: Searches
 render_description()
 get some XHTML describing the search
parameters
 render_search_form(help)
 render an input form for the search
 if help is true then this also renders the help for
each search field in current language
Getting User Input (CGI parameters)
 Session object also provides useful
methods for getting user input
 e.g. from an input form
 have_parameters
 true if parameters (POST or GET) are
available
 param(name)
 get the value of a named parameter
Scripting Techniques:
Writing Export Plugins
Plugin Framework
 EPrints provides a framework for plugins
 registration of plugin capabilities
 standard interface which plugins need to implement
 Several types of plugin interface provided
 import and export
 get data in and out of the repository
 interface screens
 add new tools and reports to UI
 input components
 add new ways for users to enter data
Plugin Framework (2)
 Not just a plugin framework for 3rd party extensions!
 Used extensively by EPrints itself
 majority of (dynamic) Web pages you see are screen
plugins
 search, deposit workflow, editorial review, item control
page, user profile, saved searches, adminstration tools...
 all import/export options implemented as plugins
 all input components in deposit workflow are plugins
 subject browser input, file upload...
Plugin Framework (3)
 EPrints is really a generic plugin framework
 with a set of plugins that implement the functions of a
repository
 Gives plugin developers many examples to
work from
 find a plugin that does something similar to what you
want to achieve and explore how it works
Plugins
Plugin Framework
Backend (data model)
Writing Export Plugins
 Typically a standalone Perl module in
 perl_lib/EPrints/Plugin/Export/
 Writing export plugins
1. register plugin
2. define how to convert data objects to an
output/interchange format
Export Plugin: Registration
 Register
 name
 the name of the plugin
 visible
 who can use it
 accept
 what the plugin can convert
 lists of data objects or single data objects (or both)
 type of record (eprint, user...)
 suffix and mimetype
 file extension and MIME type of format plugin
converts to
Registration Example: BibTeX
$self->{name} = "BibTeX";
$self->{accept} = [ 'list/eprint',
'dataobj/eprint' ];
$self->{visible} = "all";
$self->{suffix} = ".bib";
$self->{mimetype} = "text/plain";
 Converts lists or single EPrint objects
 Available to all users
 Produces plain text file with .bib extension
Registration Example: FOAF
$self->{name} = "FOAF Export";
$self->{accept} = [ 'dataobj/user' ];
$self->{visible} = "all";
$self->{suffix} = ".rdf";
$self->{mimetype} = "text/xml";
 Converts single User objects
 Available to all users
 Produces XML file with .rdf extension
Registration Example: XML
$self->{name} = "EP3 XML";
$self->{accept} = [ 'list/*', 'dataobj/*' ];
$self->{visible} = "all";
$self->{suffix} = ".xml";
$self->{mimetype} = "text/xml";
 Converts any data object
 Available to all users
 Produces XML file with .xml extension
Export Plugin: Conversion
 For a straight conversion plugin, this usually
includes:
1. mapping data objects to output/interchange
format
2. serialising the output/interchange format
 e.g. EndNote conversion section:
$data->{K} = $dataobj->get_value( "keywords" );
$data->{T} = $dataobj->get_value( "title" );
$data->{U} = $dataobj->get_url;
Export Plugin: Conversion (2)
 But export plugins aren’t limited to
straight conversions!
 Explore:
 Google Maps export plugin
 plot location data on map
 http://files.eprints.org/224/
 Timeline export plugin
 plot date data on timeline
 http://files.eprints.org/225/
Export Plugin: Template
1. Register
 subclass EPrints::Plugin::Export
 inherits all the mechanics so you don’t have to
worry about them
 could subclass existing plugin e.g. XML, Feed
 define name, accept, visible etc.
 in constructor new() of plugin module
2. Conversion
 define output_dataobj function
 will be called by plugin subsystem for every data
object that needs to be converted
Writing Import Plugins

Typically a standalone Perl module in


Reading input can be harder than writing
output



perl_lib/EPrints/Plugin/Import/
need to detect and handle errors in input
many existing libraries available for parsing a wide
variety of file formats
Writing import plugins
1.
2.
register
define how to convert input/interchange format into
data objects
 reverse of export
Scripting Techniques:
Writing Screen Plugins
Plugins: Writing Screen Plugins
 One or more Perl modules in
 perl_lib/EPrints/Plugin/Screen/
 may be bundled with phrases, config files,
stylesheets etc.
 Writing screen plugins
1. register
 where it appears in UI
 who can use it
2. define functionality
Screen Plugin: Registration
 Register
 actions
 the actions the plugin can carry out (if any)
 appears
 where abouts in the interface the plugin and/or
actions will appear
 named list
 position in list
 will be displayed as link, button or tab
Registration Example: Manage Deposits
$self->{appears} = [
{ place => "key_tools", position => 100, }
];
key_tools list
Registration Example: EPrint Details
$self->{appears} = [
{ place => "eprint_view_tabs", position => 100, },
];
eprint_view_tabs list
(each tab is a single screen
plugin)
Registration Example: New Item
$self->{appears} = [
{ place => “item_tools", position => 100,
action => “create”, },
];
item_tools list (create
action will be invoked
when button pressed)
Screen Plugin: Define Functionality
 3 types of screen plugin
1. Render only
 define how to produce output display
 examples: Admin::Status, EPrint::Details
2. Action only (no output display)
 define how to carry out action(s)
 examples: Admin::IndexerControl, EPrint::Move,
EPrint::NewVersion
3. Combined (interactive)
 define how to produce output/carry out action(s)
 examples: EPrint::RejectWithEmail, EPrint::Edit,
User::Edit
Screen Plugins: Displaying Messages
 Action plugins produce no output display
but can still display messages to user
 add_message(type, message)
 register a message that will be displayed to
the user on the next screen they see
 type can be
 error
 warning
 message (informational)
Screen Plugin Template: Action Only
1. Register
 subclass EPrints::Plugin::Screen
 define actions supported
 define where actions appear
 define who can use actions
 allow_ACTION function(s)
2. Define functionality
 define action_ACTION function(s)
 carry out the action
 use add_message to show result/error
 redirect to a different screen when done
Screen Plugin Template: Combined
 render function usually displays
links/buttons which invoke the plugin’s
actions
 e.g. EPrint::Remove
 registers remove and cancel actions
 render function displays Are you sure?
screen
 OK/Cancel buttons invoke remove/cancel
actions
Scripting Techniques:
Writing Command Line Scripts
Command Line Scripts
 Usually stored in bin directory
 Add batch/offline processes to your
repository
 e.g. duplicate detection – compare each
record to every other record
 e.g. file integrity - check stored MD5 sums
against actual MD5 sums
Connecting to the Repository
 Command line scripts (and CGI scripts)
must explicitly connect to the repository by
creating a new Session object
 new(mode, repositoryid)
 set mode to 1 for command line scripts
 set mode to 0 for CGI scripts
 And disconnect from the repository when
complete
 terminate()
 performs necessary cleanup
Using Render Functions
 XHTML is good for building Web pages
 but not so good for command line output!
 often no string equivalent
 use tree_to_utf8()
 extracts a string from the result of any
rendering method
 tree_to_utf8(
eprint->render_citation)
Search and Modify Template
 Common pattern for command line tools
1. Connect to repository
2. Get desired dataset
3. Search dataset
4. Apply function to matching results
 modify result
 commit changes
5. Disconnect from repository
Example: lift_embargos
 Removes access restrictions on
documents with expired embargos
1. Connect to repository
2. Get document dataset
3. Search dataset
 embargo date field earlier than today’s date
4. Apply function to matching results
 remove access restriction
 clear embargo date
 commit changes
5. Disconnect from repository
Scripting Techniques:
Writing CGI Scripts
CGI Scripts
 Usually stored in cgi directory
 Largely superceded by screen plugins
but can still be used to add e.g. custom
reports to your repository
 Similar template to command-line scripts
but build Web page output using API
render_ methods
Building Pages
 In Screen plugins, mechanics of sending
Web pages to the user’s browser are
handled by the plugin subsystem
 need to do this yourself with CGI scripts
 methods provided by the Session object
 build_page(title, body)
 wraps your XHTML document in the archive
template
 send_page()
 flatten page and send it to the user
Summary
 Use the core API to manipulate data in the API
 individual data objects
 EPrint, Document, User
 sets of data objects
 DataSet, List, Search
 Wrap this in a plugin or script
 Session, Repository
 Web output using render_ methods
 templates
Next: hands-on exercises designed to get you
started with these techniques