Indexing your web server(s) Helen Varley Sargan University of Cambridge

Download Report

Transcript Indexing your web server(s) Helen Varley Sargan University of Cambridge

University of Cambridge Computing Service

Indexing your web server(s)

Helen Varley Sargan Institutional Webmasters Workshop 7-9 September 1999

Why create an index?

University of Cambridge Computing Service

• • • • •

Helps users (and webmasters) to find things …but isn’t a substitute for good navigation Gives cohesion to a group of unrelated servers Observation of logs gives information on what people are looking for - and what they are having trouble finding You are already being part-indexed by many search engines, unless you have taken specific action against it Institutional Webmasters Workshop 7-9 September 1999

Current situation

Name

Based on UKOLN survey of search engines used in 160 UK HEIs carried out in July/Aug 1999.

Report to be published in Ariadne issue 21. See < http://www.ariadne

.ac.uk/ >.

ht://Dig Excite Microsoft Harvest Ultraseek SWISH Webinator Netscape wwwwais FreeFind Other None

Institutional Webmasters Workshop University of Cambridge Computing Service Total

25 19 12 8 7 5 4 3 3 2 13 59

7-9 September 1999

Current situation questions

University of Cambridge Computing Service

Is the version of Muscat used by Surrey the free version available for a time (but not any more)?

Are the users of Excite quite happy with the security and that development seems to have ceased?

Are users of local search engines that don't use robots.txt

happy with what other search engines can index on their sites (you have got a robots.txt

file haven't you?) Institutional Webmasters Workshop 7-9 September 1999

Types of tool

University of Cambridge Computing Service

• • •

External services are robots Tools you install yourself fall into two main categories (some will work both ways)

direct indexes of local and/or networked file structure

robot- or spider-based following instructions from the indexed

robots.txt

file on each web server The programs are either in a form you have to compile yourself or are precompiled for your OS, or they are written in Perl or Java, so will need either Perl or Java runtime to function.

Institutional Webmasters Workshop 7-9 September 1999

Controlling robot access 1

University of Cambridge Computing Service

• • •

All of our web servers are being part-indexed by external robots Control of external robots and a local robot mediated indexer is by the same route

a

robots.txt

file to give access information

Meta tags for robots in each HTML file giving indexing and link-following entry or exclusion

Meta tags in each HTML file giving description and keywords The first two controls are observed by all the major search engines. Some search engines do not observe description and keyword meta tags.

Institutional Webmasters Workshop 7-9 September 1999

Controlling robot access 2

University of Cambridge Computing Service

• • •

Some patchy support for Dublin Core metadata Access to branches of the server can be limited by the server software - by combining access control with metadata you can give limited information to some users and more to others.

If you don’t want people to read files, either password-protect that section of the server or remove them. Limiting robot access to a directory can make nosey users flock to look what’s inside.

Institutional Webmasters Workshop 7-9 September 1999

Security

University of Cambridge Computing Service

• • •

There has been a security problem with indexing software (Excite free version in 1998) Remember the security of the OS the indexing software is running under - keep all machines up-to date with security patches whether they are causing trouble or not.

Seek help with security if you are not an expert in the OS, particularly with Unix or Windows NT Institutional Webmasters Workshop 7-9 September 1999

What tool to use? 1

University of Cambridge Computing Service

• • •

Find out if any money, hardware and/or staff are available for the project first Make a shopping list of your requirements and conditions

hosting the index (where)?

– – –

platform (available and desirable)?

how many servers (and/or pages) will I index?

is the indexed data very dynamic?

– –

what types of files do I want indexed?

what kind of search (keyword, phrase, natural language, constrained)?

Are you concerned how you are indexed by others?

Institutional Webmasters Workshop 7-9 September 1999

What tool to use? 2

University of Cambridge Computing Service

• • • • •

Equipped with the answers to the previous questions, you will be able to select a suitable category of tool If you are concerned how others index your site, install a local robot- or spider-based indexer and look at indexer control measures Free externally hosted services for very small needs Free tools (mainly Unix-based) for the technically literate or built-in to some server software Commercial tools cover a range of platforms and pocket-depths but vary enormously in features Institutional Webmasters Workshop 7-9 September 1999

• • • •

Free externally hosted services

University of Cambridge Computing Service Will be limited to the number of pages indexed, possibly the number of times the index is access, and may be deleted if not used for a certain number of days (5-7) Very useful for small sites and/or those with little technical experience or resources Access is prey to Internet traffic (most services are in US) and server availability, and for UK users incoming transatlantic traffic will be charged for You may have to have advertising on your search page as a condition of use Institutional Webmasters Workshop 7-9 September 1999

Free tools - built in

University of Cambridge Computing Service

• •

Microsoft, Netscape, WebStar, WebTen and WebSite Pro all come with built in indexers (others may too) With any or all of these there may be problems indexing some other servers, since they are all using vendor-specific APIs (they may receive responses from other servers that they can’t interpret). Problems are more likely with more and varied server types being indexed Institutional Webmasters Workshop 7-9 September 1999

Free tools - installed

University of Cambridge Computing Service

• • • •

Most active current development on SWISH (both E and ++), Webglimpse, ht://Dig and Alkaline Alkaline is a new product, all the others have been through long periods of inactivity and all are dependent on volunteer effort All of these are now robot based but may have other means of looking at directories as well Alkaline is available on Windows NT, but all the others are Unix. Some need to be compiled. Institutional Webmasters Workshop 7-9 September 1999

Commercial tools

University of Cambridge Computing Service

• • • •

Most have specialisms - sort out your requirements very carefully before you select a shortlist Real money price may vary from US$250 to £10,000+ (possibly with additional yearly maintenance), depending on product The cost of most will be on a sliding scale depending on the size of index being used Bear in mind that Java-based tools will require the user to be running a Java-enabled browser Institutional Webmasters Workshop 7-9 September 1999

Case Study 1 - Essex

University of Cambridge Computing Service Platform:

Windows NT

Number of servers searched:

16

Number of entries:

approx 11,500

File types indexed:

Office files, html and txt. Filters available for other formats

Index updating:

possible.

Configured with windows task scheduler. Incremental updates

Constrained searches possible:

Yes

Configuration:

follows robots.txt but can take a 'back door' route as well. Obeys robots meta tag

Logs and reports:

Creates reports on crawling progress. Log analysis not included but can be written as add-ons (asp scripts)

Pros:

Free of charge with Windows NT.

Cons:

Needs high level of Windows NT expertise to set up and run it effectively. May run into problems indexing servers running diverse server software. Not compatible with Microsoft Index server (a single server product). Creates several catlog files, which may create network problems when indexing many servers.

Institutional Webmasters Workshop 7-9 September 1999

Case Study 2 - Oxford

University of Cambridge Computing Service Platform:

Unix

Number of servers searched:

131

Number of entries:

on any server) approx 43, 500 (specifically 9 levels down as a maximum

File types indexed:

formats Office files, html and txt. Filters available for other

Index updating:

updates possible.

Configured to reindex after a set time period. Incremental

Constrained searches possible:

ht://Dig server Yes but need to be configured on the

Configuration:

follows robots.txt

but can take a 'back door' route as well.

Logs and reports:

available somehow.

none generated in an obvious manner, but probably

Pros:

Free of charge. Wide number of configuration options available.

Cons:

Needs high level of Unix expertise to set up and run it effectively. Index files are very large.

Institutional Webmasters Workshop 7-9 September 1999

Case Study 3 Cambridge

Platform:

Unix

Number of servers searched:

232

University of Cambridge Computing Service Number of entries:

approx 188,000

File types indexed:

Many formats, including PDF, html and txt.

Index updating:

Intelligent incremental reindexing dependent on the frequency of file updates - can be given permitted schedule. Manual incremental updates easily done.

Constrained searches possible:

Yes easily configured by users and can also be added to configuration as a known constrained search.

Configuration:

follows robots.txt and meta tags. Configurable weighting given to terms in title and meta tags. Thesaurus add-on available to give user controlled alternatives

Logs and reports:

Logs and reports available for every aspect of use search terms, number of terms, servers searched, etc.

Pros:

Very easy to install and maintain. Gives extremely good results in a problematic environment. Technical support excellent.

Cons:

Relatively expensive.

Institutional Webmasters Workshop 7-9 September 1999

Recommendations

University of Cambridge Computing Service

• • •

Choosing an appropriate search engine is wholly dependent on your particular needs and circumstances Sort out all your robot-based indexing controls when you install your local indexer Do review your indexing software regularly trouble free it still needs maintaining if it’s Institutional Webmasters Workshop 7-9 September 1999