Transcript Indexing your web server(s) Helen Varley Sargan University of Cambridge
University of Cambridge Computing Service
Indexing your web server(s)
Helen Varley Sargan Institutional Webmasters Workshop 7-9 September 1999
Why create an index?
University of Cambridge Computing Service
• • • • •
Helps users (and webmasters) to find things …but isn’t a substitute for good navigation Gives cohesion to a group of unrelated servers Observation of logs gives information on what people are looking for - and what they are having trouble finding You are already being part-indexed by many search engines, unless you have taken specific action against it Institutional Webmasters Workshop 7-9 September 1999
Current situation
Name
Based on UKOLN survey of search engines used in 160 UK HEIs carried out in July/Aug 1999.
Report to be published in Ariadne issue 21. See < http://www.ariadne
.ac.uk/ >.
ht://Dig Excite Microsoft Harvest Ultraseek SWISH Webinator Netscape wwwwais FreeFind Other None
Institutional Webmasters Workshop University of Cambridge Computing Service Total
25 19 12 8 7 5 4 3 3 2 13 59
7-9 September 1999
Current situation questions
University of Cambridge Computing Service
•
Is the version of Muscat used by Surrey the free version available for a time (but not any more)?
•
Are the users of Excite quite happy with the security and that development seems to have ceased?
•
Are users of local search engines that don't use robots.txt
happy with what other search engines can index on their sites (you have got a robots.txt
file haven't you?) Institutional Webmasters Workshop 7-9 September 1999
Types of tool
University of Cambridge Computing Service
• • •
External services are robots Tools you install yourself fall into two main categories (some will work both ways)
–
direct indexes of local and/or networked file structure
–
robot- or spider-based following instructions from the indexed
robots.txt
file on each web server The programs are either in a form you have to compile yourself or are precompiled for your OS, or they are written in Perl or Java, so will need either Perl or Java runtime to function.
Institutional Webmasters Workshop 7-9 September 1999
Controlling robot access 1
University of Cambridge Computing Service
• • •
All of our web servers are being part-indexed by external robots Control of external robots and a local robot mediated indexer is by the same route
–
a
robots.txt
file to give access information
–
Meta tags for robots in each HTML file giving indexing and link-following entry or exclusion
–
Meta tags in each HTML file giving description and keywords The first two controls are observed by all the major search engines. Some search engines do not observe description and keyword meta tags.
Institutional Webmasters Workshop 7-9 September 1999
Controlling robot access 2
University of Cambridge Computing Service
• • •
Some patchy support for Dublin Core metadata Access to branches of the server can be limited by the server software - by combining access control with metadata you can give limited information to some users and more to others.
If you don’t want people to read files, either password-protect that section of the server or remove them. Limiting robot access to a directory can make nosey users flock to look what’s inside.
Institutional Webmasters Workshop 7-9 September 1999
Security
University of Cambridge Computing Service
• • •
There has been a security problem with indexing software (Excite free version in 1998) Remember the security of the OS the indexing software is running under - keep all machines up-to date with security patches whether they are causing trouble or not.
Seek help with security if you are not an expert in the OS, particularly with Unix or Windows NT Institutional Webmasters Workshop 7-9 September 1999
What tool to use? 1
University of Cambridge Computing Service
• • •
Find out if any money, hardware and/or staff are available for the project first Make a shopping list of your requirements and conditions
–
hosting the index (where)?
– – –
platform (available and desirable)?
how many servers (and/or pages) will I index?
is the indexed data very dynamic?
– –
what types of files do I want indexed?
what kind of search (keyword, phrase, natural language, constrained)?
Are you concerned how you are indexed by others?
Institutional Webmasters Workshop 7-9 September 1999
What tool to use? 2
University of Cambridge Computing Service
• • • • •
Equipped with the answers to the previous questions, you will be able to select a suitable category of tool If you are concerned how others index your site, install a local robot- or spider-based indexer and look at indexer control measures Free externally hosted services for very small needs Free tools (mainly Unix-based) for the technically literate or built-in to some server software Commercial tools cover a range of platforms and pocket-depths but vary enormously in features Institutional Webmasters Workshop 7-9 September 1999
• • • •
Free externally hosted services
University of Cambridge Computing Service Will be limited to the number of pages indexed, possibly the number of times the index is access, and may be deleted if not used for a certain number of days (5-7) Very useful for small sites and/or those with little technical experience or resources Access is prey to Internet traffic (most services are in US) and server availability, and for UK users incoming transatlantic traffic will be charged for You may have to have advertising on your search page as a condition of use Institutional Webmasters Workshop 7-9 September 1999
Free tools - built in
University of Cambridge Computing Service
• •
Microsoft, Netscape, WebStar, WebTen and WebSite Pro all come with built in indexers (others may too) With any or all of these there may be problems indexing some other servers, since they are all using vendor-specific APIs (they may receive responses from other servers that they can’t interpret). Problems are more likely with more and varied server types being indexed Institutional Webmasters Workshop 7-9 September 1999
Free tools - installed
University of Cambridge Computing Service
• • • •
Most active current development on SWISH (both E and ++), Webglimpse, ht://Dig and Alkaline Alkaline is a new product, all the others have been through long periods of inactivity and all are dependent on volunteer effort All of these are now robot based but may have other means of looking at directories as well Alkaline is available on Windows NT, but all the others are Unix. Some need to be compiled. Institutional Webmasters Workshop 7-9 September 1999
Commercial tools
University of Cambridge Computing Service
• • • •
Most have specialisms - sort out your requirements very carefully before you select a shortlist Real money price may vary from US$250 to £10,000+ (possibly with additional yearly maintenance), depending on product The cost of most will be on a sliding scale depending on the size of index being used Bear in mind that Java-based tools will require the user to be running a Java-enabled browser Institutional Webmasters Workshop 7-9 September 1999
Case Study 1 - Essex
University of Cambridge Computing Service Platform:
Windows NT
Number of servers searched:
16
Number of entries:
approx 11,500
File types indexed:
Office files, html and txt. Filters available for other formats
Index updating:
possible.
Configured with windows task scheduler. Incremental updates
Constrained searches possible:
Yes
Configuration:
follows robots.txt but can take a 'back door' route as well. Obeys robots meta tag
Logs and reports:
Creates reports on crawling progress. Log analysis not included but can be written as add-ons (asp scripts)
Pros:
Free of charge with Windows NT.
Cons:
Needs high level of Windows NT expertise to set up and run it effectively. May run into problems indexing servers running diverse server software. Not compatible with Microsoft Index server (a single server product). Creates several catlog files, which may create network problems when indexing many servers.
Institutional Webmasters Workshop 7-9 September 1999
Case Study 2 - Oxford
University of Cambridge Computing Service Platform:
Unix
Number of servers searched:
131
Number of entries:
on any server) approx 43, 500 (specifically 9 levels down as a maximum
File types indexed:
formats Office files, html and txt. Filters available for other
Index updating:
updates possible.
Configured to reindex after a set time period. Incremental
Constrained searches possible:
ht://Dig server Yes but need to be configured on the
Configuration:
follows robots.txt
but can take a 'back door' route as well.
Logs and reports:
available somehow.
none generated in an obvious manner, but probably
Pros:
Free of charge. Wide number of configuration options available.
Cons:
Needs high level of Unix expertise to set up and run it effectively. Index files are very large.
Institutional Webmasters Workshop 7-9 September 1999
Case Study 3 Cambridge
Platform:
Unix
Number of servers searched:
232
University of Cambridge Computing Service Number of entries:
approx 188,000
File types indexed:
Many formats, including PDF, html and txt.
Index updating:
Intelligent incremental reindexing dependent on the frequency of file updates - can be given permitted schedule. Manual incremental updates easily done.
Constrained searches possible:
Yes easily configured by users and can also be added to configuration as a known constrained search.
Configuration:
follows robots.txt and meta tags. Configurable weighting given to terms in title and meta tags. Thesaurus add-on available to give user controlled alternatives
Logs and reports:
Logs and reports available for every aspect of use search terms, number of terms, servers searched, etc.
Pros:
Very easy to install and maintain. Gives extremely good results in a problematic environment. Technical support excellent.
Cons:
Relatively expensive.
Institutional Webmasters Workshop 7-9 September 1999
Recommendations
University of Cambridge Computing Service
• • •
Choosing an appropriate search engine is wholly dependent on your particular needs and circumstances Sort out all your robot-based indexing controls when you install your local indexer Do review your indexing software regularly trouble free it still needs maintaining if it’s Institutional Webmasters Workshop 7-9 September 1999