history of information retrieval

Transcript history of information retrieval

LIS618 lecture 2 history

Thomas Krichel 2011-09-19

• • based on a very fine paper by Michael Lesk “The Seven Ages of Information Retrieval”. That paper was written in 1997, so it does not cover recent advances.

general problem

• • The general approach here is a computer science approach. How to build systems that will allow people to obtain information.

This can be distinguished from the approach more positivistic approach to observe how people actually deal with obtaining information.

general overview

• • In the last decades, no industry has been subjected to more technology change than any other industry.

Nevertheless, there are two other factors of note – the political and economic environment – research and theoretical advances

the broad picture

• • Through the life of information retrieval there has been a constant struggle about how to prepare the system by intellectual or statistical methods.

Intellectual methods require substantial human input. Despite that, they have not been completely been outruled.

statistical methods

• • • When we use statistical methods, we interrogate a text for tokens of potential semantic significance, e.g. words. (details in next lecture). We look at how much they occur in documents. It’s a brute force job. We store a representation of this count in a computer.

information analysis method

• • • • In this approach we have humans analyze the contents of documents to make judgments about them. This analysis predates computers. Early computer-based systems had to use this because the computers were not capable.

In modern systems, such as web information retrieval, this is making a comeback.

1945 As We May Think

• • • • This is very famous (but not a good IMHO) paper by Vannevar Bush.

Bush envisages the memex. It stores the human knowledge in a desktop device.

It lets users mange commented associations between its stored items.

The essay pays scant attention to information retrieval (bar condemning existing approaches).

1950s

• • • The USSR launches the sputnik. The US is worried – need to improve the organization of science – need to understand what the Russians are doing • learn Russian • work on machine translation Suddenly it seems like a good idea to go back to something like a wartime science effort.

major problems

• • • There were about 100 computers in the United States. Computers operated in a batch-processing, rather than in an interactive mode. Supply of machine-readable text was very small.

special hardware

• • There were some experiments with special hardware – Edge-notched cards by Calvin Moors – WRU Searching Selector of Allen Kent These were abandoned as the digital technology progressed.

•

Hans Peter Luhn (1896—1964)

In 1957 he was the first to propose extracting information from text by automated means.

– “It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements.”

KWIC

• • Luhn also worked on KWIC (key word in context) indexing.

The idea was that information retrieval should not only be based on words. Example the idea was that Information Retrieval should worked on KWIC key word in Hans Peter Luhn was the firs

1960

• • • This is when online IR really started.

There were the first systems being built, with heavy subsidies.

Some of today’s available scientific IR systems can be trace to precursors built at that time.

fuller descriptions

• • As computers become more powerful, more and more descriptors could be taken from texts to make them findable in a retrieval systems. Taking more descriptors, however meant more work for the people involved in finding the descriptive terms.

full-text indexing

• • During the 60s people started to look at the situation where all words would be used as descriptors for the document.

Such a strategy is – less work for the human – more work for the computer

IR performance criteria

• • • In order to compare full-text and selective indexing, criteria had to be found and experiments to be conducted.

Cyril Cleverdon (1914—1997) developed criteria of precision and recall.

He and his disciples worked an a standard document set that could be used for testing, about 14k abstract on aeronautics.

and they found

• • They found that full-text indexing was superior to human-aided indexing.

This was very controversial at the time. All catalogers got very upset!

new retrieval techniques

• • Having an experimental dataset that was shared by researchers was key to developing experiments that were comparable.

One idea was “relevance feedback”. The idea was the user would select relevant documents in a fist step, and then the term from these documents would be added to the query in a second step.

NASA ReCON

• • • Name stand for “remote console”.

It is thought to be the first multi-site bibliographic online information system for scientific literature. First large scale information systems – online – interactive – distributed

NASA ReCon

• • • • It dealt with 200000 citation strings. It searched for keywords manually extracted from text.

It allowed fielded searches. It allowed combinations of fields related and combinations are results received.

evolution from testbeds to services

• • • • ReCON was developed by Lockheed in 1965.

It become a testbed at Lockheed using the ERIC database in 1969.

It became the DIALOG online system in 1972.

That system was the largest online general retrieval system, mainly for bibliographic data for a long time.

specialized engines

• • • COLEX was a system developed in 1965 for the Air Force by Systems Development Corporation.

It become SDC’s ORBIT system.

The National Library of Medicine used that system to build the MEDLARS system that would give online access to medical abstracts as early an 1991.

increase of available text

• • • The development of computer-aided typesetting, and later word processing made for a lot more text available online than before. Among the first were the larger abstracting database such as Chemical abstracts.

Later newspapers followed.

other services

• • In 1973 the Lexis system of US court records was the first large scale full text database.

OCLC got tapes from LoC and printed catalog cards. Later they started shared cataloging, because the LoC only cataloged about 2/3 of what the member libraries had.

research

• • • Research into IR prototype systems actually declined in the US.

NSF shifted away research funds to the development of actual systems.

There was some work on AI is IR but it got pretty much nowhere.

research on models

• • Gerard Salton (1927—1995) and his group at Cornell refined the vector space model and introduced tf/idf term weighting in the early 70s. This will be covered later.

Later in the 70s the probabilistic model of information retrieval emerged in work by Keith van Rijsbergen (1943—) and friends. This will not be covered in the course.

80s

• • This decade saw the introduction of inexpensive personal computers. In libraries – Card catalogs are replaced by OPACs allowing non specialists access to online information. – Online information started to use full text.

full-text retrieval

• • • In the scientific abstract databases full-text is not that important.

But in legal databases, it is. Lexis were the first to provide full-text access.

Later they were followed by the newspapers. They were among the first non-reference publishers to adopt computerized typesetting. These databases contained text only.

research effort

• • • • The vector model was still the start, but researcher tried to using dictionaries to disambiguate terms in documents and queries.

Some work was done on bi-lingual retrieval.

Some work was done on part of speech assignments.

Research made no impact on real-world commercial systems.

CD ROM

• • • In the late 80s the CD-ROM appeared to make a dent in online information retrieval. The CD-ROM fits the standard print publishing model. Sharing CD-ROMs however, proved a nightmare in libraries.

90s

• • • The 90s were the decade of the Internet.

Online information retrieval become common.

The CD-ROM lost out.

integration of graphics

• • The web became compelling as soon as a graphical user interface was introduced by the mosaic browser.

This introduced the subject of multimedia information retrieval, largely defined to research experiments before, out in the open.

start of search engines

• • • Early search engines looked at the web with the eyes of earlier information retrieval engines.

The essentially applied the vector model to the text extracted from web pages. But web pages also contain also pictures and more importantly, links.

Stanford digital library project

• • Run at Stanford between 1995 and 2004 by a group of researchers to “provide an infrastructure that affords interoperability among heterogeneous, autonomous digital library services.” In 1996 as Sergei Brin and Larry Page started to work on a project to analyze the links between web pages.

web information retrieval

• • I argue that web information retrieval is different from conventional retrieval because on the web pages have different intrinsic values.

The PageRank algorithm is a simple algorithm to make such values appear.

http://openlib.org/home/krichel

Please shutdown the computers when you are done.

Thank you for your attention!