Chapter One - Center for the Study of Digital Libraries

Download Report

Transcript Chapter One - Center for the Study of Digital Libraries

Building collections
with Greenstone
How to Build a Digital Library
Ian H. Witten and David Bainbridge
Digital Library Collections

There is a distinction between
BUILDING collections
 DELIVERING information to users



Similar to ‘compile-time’ versus ‘runtime’
distinction in computer programming
Information structures should usually be
prepared in advance
Building a Collection

The Collector
A subsystem that takes you step by step through
building a simple collection
 Conceals details behind the scenes


First locate information on your computer or
the Web

Plain text, HTML, Word, PDF, email file, etc.
Plug-ins

Plug-ins are software modules that handle
Format conversion
 Metadata extraction


Plug-ins promote extensibility
Greenstone Archive Format

Greenstone Archive Format
XML-based file format
 File format for:

Documents
 Metadata

Collection Configuration File

Collection Configuration File
Defines the structure of a collection
 Governs how the collection is built
 Specifies how the collection will appear to users

Greenstone Extended Capabilities

Extending the Capabilities of Greenstone

Plug-ins


Classifiers


Handle different document and metadata formats
Handle different kinds of browsing structures
Format statements and Macros

Govern the user interface content and appearance
Why Greenstone?
Benefits of Greenstone





General system for constructing and presenting
digital collections
Handles millions of documents, text, images,
audio, video
User interfaces identical in Web-based and CDROM versions
Installs on Windows and Linux
Access locally or remotely using web browser
Organization of Collections

Each collection can be organized differently:
Format of source documents
 Metadata
 Directory structure
 Document structure
 Searching and browsing services
 Presentation
 Auxiliary services

Variation of Source Format

Source documents can be supplied in:










Plain text
HTML
PostScript
PDF
Word
E-mail
Other file types
Images
Video
Audio
Variation of Metadata


Different types of metadata
Metadata can be supplied differently
‘fields’ in MS Word
 <meta> tags in HTML
 Information coded into filename and directories
 Spreadsheet or other data file
 Explicit metadata format like MARC

Variation of Directory Structure

Collections can vary in the directory structure in
which the information is located
Variation of Document Structure

Document structure
Flat
 Divided sequentially into pages
 Hierarchical organization


Title or other metadata available at each level
Variation of Services

Searching
Metadata
 Indexes
 Hierarchical levels


Browsing
Metadata
 Browser type

Variation of Presentation

Results can be presented to users in various
ways:
Format that target documents are shown in
 Search results page
 Metadata browsers
 Interface language

Variation of Auxiliary Services

A collection may require additional services
User logging
 Etc.

Collection Configuration File
Allows Variation

A digital library collection is made by
Gathering raw material
 Designing the collection
 Putting design information about the structure and
presentation of the collection in the Collection
Configuration File

Front Page of Collection

Statement of collection’s purpose

Statement of collection’s coverage

Explanation of how collection is organized
Searching Involves Indexes

Searching is provided by indexes built from
different parts of the documents
Entire documents
 Paragraphs
 Titles
 Sections
 Section headings
 Figure captions

Indexes

Indexes can be created automatically using
Documents
 Supporting files


Indexes can be rebuilt automatically
New document in the same format becomes
available
 Process can awake, check for new material, and
rebuild the indexes

Plug-ins for Indexing


Source documents are converted into standard XML
form for indexing using plug-ins
Standard plug-ins process






Plain text
HTML
Word
PDF
Usenet and email messages
New plug-ins can be written for other document types
Browsing Involves Lists

Browsing involves lists that can be examined by
the user
Authors
 Titles
 Dates
 Hierarchical classification structures

Classifier Modules

Modules called classifiers are used to create
browsers and build browsing structures from
metadata
Scrollable lists
 Alphabetic selectors
 Dates
 Hierarchies


Programmers can write new classifiers to create
novel browsing capabilities
Search Terms

Search Terms in Greenstone:
Alphabetic characters
 Digits


Separated by white space

Punctuation acts as white space
Two Types of Queries

Query for ALL of the words


Boolean AND
Query for SOME of the words

Ranked
Indexes to Search

In most collections, you can choose different
indexes to search

Examples:
Author and title indexes
 Chapter and paragraph indexes


Usually the full matching document is returned
regardless of index searched
Preferences Page

Preferences Page

Allows advanced control over search operation:
Case-folding and stemming
 Advanced query mode where users specify Boolean
operators
 Large-query interface
 Display search history

Preferences Page

Preferences Page
Specify subcollections to be included in searches
 Specify presentation language
 Customize interface

Textual vs. standard interface
 Suppress navigation bar
 Suppress alert system

Using the Collector
The Greenstone Collector


Easiest way to build a simple collection
The Collector allows you to:
Create a new collection
 Modify or add to an existing collection
 Delete a collection

Starting the Collector


Click the Collector link from the default
Greenstone home page
Log in


When Greenstone is installed, an account called
admin is set up with a password chosen during
installation
The Collector works through a standard web
interface
Creating a New Collection



Collector’s main purpose is to build a new
collection
Structure of a collection is determined when the
collection is set up
Simplest to copy the structure of an existing
collection and then edit
Collection Building Steps
1.
Collection Information
2.
Source Data
3.
Configuration
4.
Building
5.
Viewing
Collection Building Steps
☐ Collection Information
☐ Source Data
☐ Configuration
☐ Building
☐ Viewing
1. Collection Information

Give the collection a name and provide
associated information

Title

Short phrase used to identify the collection within the
digital library
Contact e-mail address
 Brief description


Sets out the principles that govern what is included in the
collection
Collection Building Steps
☑ Collection Information
☐ Source Data
☐ Configuration
☐ Building
☐ Viewing
2. Source Data

Specify the location of the sources

Clone existing collection


Specify on a pull-down menu the existing collection
Create a completely new collection
2. Source Data

In the provided boxes, indicate where Source
Documents are located

Specification of sources
file://
 http://
 ftp://

file://

File name on the Greenstone server system


That file will be included in collection
Directory name on the Greenstone server

Everything in the folder and its subfolders will be
included
http://

Web page
The web page will be downloaded
 All pages it links to (and all pages they link to) that
reside on the same site, below the URL, will also be
downloaded


URL that leads to a list of files

Everything in the folder and its subfolders will be
included in collection
ftp://


File to be downloaded using FTP
Directory name on the FTP server

Downloads everything in the folder and its
subfolders
Collection Building Steps
☑ Collection Information
☑ Source Data
☐ Configuration
☐ Building
☐ Viewing
3. Configuration



This step can be bypassed
Allows adjustment of configuration options
The construction and presentation of all
collections are controlled by specifications in a
special collection configuration file
Collection Building Steps
☑ Collection Information
☑ Source Data
☑ Configuration
☐ Building
☐ Viewing
4. Building


The computer does the work of the building
process
Indexes are built:





For browsing
For searching
Following specifications in the collection
configuration file
Status line shows progress
Warnings shown if files can’t be found
Collection Building Steps
☑ Collection Information
☑ Source Data
☑ Configuration
☑ Building
☐ Viewing
5. Viewing


View the collection that has just been created
E-mail can be sent to the collection’s contact
address

Must enable by editing main.cfg configuration file
Working with Existing Collections




Add more material and rebuild the collection
Edit the configuration file to modify the
collection’s structure
Delete the collection
Put the collection on CD-ROM
Adding Material to a Collection

Do not re-specify files that are already in the
collection



Files would be included twice
If the building process fails, the old version
remains unchanged
Structure of collection can be changed

Edit the configuration file

May add plug-ins or an option to a plug-in
Plug-ins & Document Formats



Plug-ins are specified in the collection
configuration file
File name determines document format
Widely used document formats:
TEXTPlug
HTMLPlug
WORDPlug
PDFPlug
PSPlug
EMAILPlug
ZIPPlug
Text Files

TEXTPlug Plug-In
*.txt
 *.text



Plain text file
Title metadata based on the first line of the file
HTML Files

HTMLPlug Plug-In
*.htm
 *.html
 .shtml
 .shm
 .asp
 .php
 .cgi

HTML Files

HTMLPlug Plug-In
Imports HTML files
 Title metadata extracted from the HTML <title> tag
 Other HTML <meta> tag data can be extracted
 Parses and processes any links in the file
 Links to other files in the collection are trapped and
replaced by references to the document

HTML Files

file_is_url
Optional switch within the HTML plug-in
 Causes URL metadata to be inserted into each
document, based on the file-name convention that is
adopted by the mirroring package. The collection
uses this metadata to allow readers to refer to the
original source material rather than a local copy

Microsoft Word Files

WORDPlug Plug-In



*.doc
Imports Microsoft Word documents
Greenstone uses independent programs to
convert Word files to HTML
Many variants on the Word format
 Older Word formats use a simple text string
extraction

PDF Files

PDFPlug Plug-In




*.pdf
Imports PDF Files
Adobe’s Portable Document Format
Greenstone uses independent programs to
convert PDF files to HTML
PostScript Files

PSPlug Plug-In




*.ps
Imports PostScript Files
Works best when a standard conversion
program is already installed on the computer
Uses simple text extraction algorithm if no
conversion program is present
Email Files

EMAILPlug


Imports files containing email


Each source is checked for e-mail contents
Extracts metadata:





*.email
Subject
To
From
Date
Deals with common formats

Netscape, Eudora, Unix mail readers
Compressed & Archived Files

ZIPPlug Plug-In
*.zip
 *.tar
 .gz
 *.z
 *.tgz
 *.bz


Relies on standard utility programs being present
Building Collections
Manually
Building a Collection

Building a Collection:

The process of taking a set of documents and
metadata information and creating all the indexes
and data structures that support the searching,
browsing, and viewing operations that the collection
offers
Building a Collection

Four Phases in Building a Collection

Make


Import


Import the documents and metadata, convert to a Greenstone
standard form
Build


Make a skeleton framework structure to contain the collection
Build the required indexes and data structures
Install

Make the collection operational
Building Collections Manually
☐ Getting Started
☐ Making a framework for the collection
☐ Importing the documents
☐ Building the indexes
☐ Installing the collection
Getting Started


Locate the command prompt
Go to the directory where Greenstone was
installed


cd “C:\Program Files\gsdl”
Tell system where to find Greenstone files

setup.bat
Sets the variable GSDLHOME to the Greenstone home
directory
 To return later


cd “%GSDLHOME%”
Building Collections Manually
☑ Getting Started
☐ Making a framework for the collection
☐ Importing the documents
☐ Building the indexes
☐ Installing the collection
Make a framework for the collection


Use the Perl program mkcol.pl to ‘make a
collection’
Get description of usage and arguments
perl –S mkcol.pl
 mkcol.pl


May leave off first part if system recognizes that .pl files
are associated with Perl
Make a framework for the collection

perl –S mkcol.pl –creator emailAddress collectionName
Make a framework for the collection



Examine the file structure
cd “%GSDLHOME%\collect\collectionName”
List directory contents
dir
Seven subdirectories are created:
archives
building
etc (contains collect.cfg file)
images
import
index
perllib
Make a framework for the collection

collect.cfg File
emailAddress placed in the creator and maintainer
lines
 collectionName placed in collection-meta lines
 Plug-ins are inserted

Building Collections Manually
☑ Getting Started
☑ Making a framework for the collection
☐ Importing the documents
☐ Building the indexes
☐ Installing the collection
Importing the documents



The collection’s import directory should contain
the source material
Drag the directory containing the source
material into the import directory
You may drag several source directories and
hierarchies
Importing the documents

The import process:
Brings documents into the Greenstone system
 Standardizes document format
(the way that metadata is specified)
 Standardizes the file structure
(that contains the documents)

Importing the documents

To get a list of options for the import program:


perl –S import.pl
The basic import command is:

perl –S import .pl collectionName
Importing the documents

You may be in any directory when the import
command is issued


The software works by knowing the collection’s
name and the Greenstone home directory
Warnings may appear
When files are found without corresponding plugins
 These files will be ignored

Building Collections Manually
☑ Getting Started
☑ Making a framework for the collection
☑ Importing the documents
☐ Building the indexes
☐ Installing the collection
Building the indexes

Use the program buildcol.pl
Building the indexes

Modify collect.cfg file to customize the collection’s
appearance

collectionname


Web browsers receive this name as the title of the
collection’s front page
collectionextra
Description of the collection
 Appears under “About this collection” on the collection’s
home page
 Enter as a single line in the editor

Building the indexes

Modify collect.cfg file to customize the collection’s
appearance

iconcollection
Give the collection an icon image
 Put the location of the image between quotes
 If absent, the collection’s name will be used
 Use _httpprefix_ as a shorthand way of beginning any
URL that points within the Greenstone file area


Example:
_httpprevix_/collect/collectionName/images/icon.gif
Building the indexes

To get a list of options for the build program:


perl –S buildcol.pl
The basic build command is:

perl –S buildcol .pl collectionName
Building the indexes



The building process takes about a minute on
small collections and can take much longer for
very large collections
You may ignore most warning messages
Serious problems will cause the program to
terminate
Building Collections Manually
☑ Getting Started
☑ Making a framework for the collection
☑ Importing the documents
☑ Building the indexes
☐ Installing the collection
Installing the collection



Building is done in the building directory
Collection must be moved to the index directory
before users can see it
Drag contents of the building directory to the
index directory


If index already contains files, remove them first
Forgetting to move the contents of building to
index is a common mistake
Installing the collection

To view the newly built collection:

Restart Greenstone


If using the Local Library version
Reload Greenstone Home Page

If using the Web version
Importing and
Building
General Information

Two Main Parts to Collection Building:
Importing (import.pl)
 Building (buildcol.pl)

Files and Directories
Collection Specific Directories
GSDLHOME
collect – all the digital library collections
collectionName – directory of collection
import – original source material
archives – result of import process
building – temporary, contents manually moved to index
index – bulk of info served to users
(import, archives and building can be deleted)
etc – contains collect.cfg file
images – icons used for the collection
perllib – Perl programs specific to collection
Other Greenstone Directories
GSDLHOME
lib – common software for both the collection server and receptionist
bin – programs used for building process
script – Perl programs used
(mkcol.pl, import.pl, buildcol.pl)
perllib – Perl modules
plugins – Perl plugins
classify – Perl classifiers
cgi-bin – Greenstone runtime system
(absent in Local Library version)
src – source code in C++
colservr – the collection server
recpt – the receptionist
Other Greenstone Directories
GSDLHOME
packages – source code for external software packages used by Greenstone
(indexing and compression program, database manager program, etc.)
(each package is stored in a directory of its own with a readme file)
bin – executables
mappings – Unicode translation tables
etc – configuration files for the entire system, initialization and error logs, user
authorization database
images – user interface images and icons
macros – small code fragments that drive the user interface
tmp – temporary files
docs – documentation for the system
Object Identifiers






Document’s permanent name in the system
Remain the same when collection rebuilt
Assigned by the import process
Stored as an attribute in the document archive
file
Character strings starting with the letters HASH
(HASH0109d3850a6de440c4d1ca2)
Used to name directory where archive file is
stored
Plug-Ins


Plug-ins do most of the work of the import process
Operate in the order in which they are listed in the collect.cfg file






Input file is passed to each plug-in until one is found that can process it
If there is no plug-in that can process a file, a warning is printed
Plug-ins determine the traversal of the subdirectory structure in
the import directory
RecPlug - processes directories, recurses through directory
structures and passes the name through the plug-in list
GAPlug – processes Greenstone Archive Format documents (in
the archives directory structure)
ArcPlug – used during building, processes list of document OIDs
produced during import (list is stored in archives.inf file)
The Import Process
The Import Process




Brings documents and metadata into the system in a
standardized XML form
Original material placed in import directory
Import process transforms it to files in the archives directory
The original material can be deleted


New material added to collection by placing it in import directory
and re-executing the import process


Collection can be rebuilt from archive files
The new material finds it way into archives along with existing files
To keep the source form of collections


Do not delete the archives
“Source” form can be augmented and rebuilt later
The Build Process
The Build Process


Creates the indexes and data structures that
make the collection operational
Indexes for the whole collection are built all at
once
Build process does not work incrementally
 Adding new material to archives requires that entire
collection be rebuilt (by issuing buildcol.pl)
 Most collections can be rebuilt overnight

Options for Import and Build
Additional Options for Import
Additional Options for Build
Options for Import and Build


To see options for any Greenstone script, type its name
at the command prompt
Options for Import and Build help with debugging (see
Table 6.5 on page 310):







verbosity
archivedir
maxdocs
collectdir
out
keepold
debug
Greenstone Archive
Documents
Greenstone Archive Format
<!DOCTYPE GreenstoneArchive [
<!ELEMENT Section (Description,Content,Section*)>
<!ELEMENT Description (Metadata*)>
<!ELEMENT Content (#PCDATA)>
<!ELEMENT Metadata (#PCDATA)>
<ATTLIST Metadata name CDATA #REQUIRED>
]>
Document Metadata




Metadata – descriptive information about
author, title, date and keywords
Stored with metadata name
Stored at the beginning of the section
Example:

<Metadata name=“Title”>Freshwater Resources in
Arid Lands</Metadata>
Document Metadata



Dublin Core – a metadata standard
New metadata types can be invented
Metadata can be assigned by an automatic
process rather than manually entered
The Dublin Core
Collection
Configuration File
Collection Configuration File
Default Configuration File
Getting the Most Out
of Your Documents
Basic Plug-In Options
Document Processing Plug-ins
Document Processing Plug-ins
Document Processing Plug-ins
Assigning Metadata from a File


XML Document Type Definition (DTD)
Example XML Metadata File
Document Type Definition (DTD)
<!DOCTYPE GreenstoneDirectoryMetadata [
<!ELEMENT DirectoryMetadata (FileSet*)>
<!ELEMENT FileSet (FileName+,Description)>
<!ELEMENT FileName (#PCDATA)>
<!ELEMENT Description (Metadata*)>
<!ELEMENT Metadata (#PCDATA)>
<ATTLIST Metadata name CDATA #REQUIRED>
<ATTLIST Metadata mode (accumulate|override) "override">
]>
Example XML Metadata File
<?xml version="1.0" ?>
<!DOCTYPE GreenstoneDirectoryMetadata SYSTEM
"http://greenstone.org/dtd/GreenstoneDirectoryMetadata/1.0/GreenstoneDi
rectoryMetadata.dtd">
<DirectoryMetadata>
<FileSet>
<FileName>nugget.*</FileName>
<Description>
<Metadata name="Title">Nugget Point Lighthouse</Metadata>
<Metadata name="Place" mode="accumulate">Nugget Point</Metadata>
</Description>
</FileSet>
<FileSet>
<FileName>nugget-point-1.jpg</FileName>
<Description>
<Metadata name="Title">Nugget Point Lighthouse</Metadata>
<Metadata name="Subject">Lighthouse</Metadata>
</Description>
</FileSet>
</DirectoryMetadata>
Tagging Document Files
<!-<Section>
<Description>
<Metadata name="Title"> Realizing human rights for poor
people: Strategies for achieving the international
development targets </Metadata>
</Description>
-->
(text of section goes here)
<!-</Section>
-->
Classifiers
Format Statements
Format Statements
Examples of Format Strings