Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh www.Gelbukh.com.

Download Report

Transcript Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh www.Gelbukh.com.

Special Topics in Computer Science

The Art of Information Retrieval Chapter 13: Searching the Web

Alexander Gelbukh

www.Gelbukh.com

Previous chapter: Conclusions

     Interface is a key element of the system. If the users cannot use it, it does not matter how good it is.

Interface design choices are important at any stage of the process o Especially to formulate queries o Also to present results o 3D interfaces to present results Also, overall system interface and action tracking Difficult to assess quality. Difficult to find new ideas Very promising if you find them! 2

Previous chapter: Research topics

   Many ideas throughout the chapter o some may be obsolete New interface types! 3D interfaces Ways of assessing the quality of interfaces 3

Web: challenges (differences)

       Distributed data Volatile data: 40% / month Very large volume o Very large answers o 1998: 3,000,000 servers, 350,000,000 pages. o 2003:

Only

Google: 3,307,998,701 pages (10 times more) Unstructured and redundant data. 30% are duplicates Quality of data. 0.5% errors, 30% in foreign names Heterogeneous data (languages, alphabets: Chinese) Heterogeneous and inexperienced users 4

Search engines

      Difference: full text is not available o Now obsolete: Google stores it, some other engines too Centralized (logically) architecture o there are distributed (physically) architectures Crawlers (robots) collect data/index in a central place A search engine indexes only a small amount (2%? 30%?) of Web Recall is nearly not relevant for simple queries Google: a revolution (AltaVista of our days) 5

Ranking

     Commercial secret Ranking can take into account hypertext Google: PageRank algorithm o Roughly, # of

incoming

links (much more complicated) Problems: tricks o Link exchange o Anti-trick measures: detect link exchangers o Penalize tricks: repeated keywords, etc.

Related pages o Co-cited or co-citing pages are related o Clustering the search results 6

Crawling

     Depth-first? Width-first? Most popular first?

How to divide the work between crawlers?

Index is always obsolete o Not equally obsolete (like stars) o Depends on crawling policy o 2% - 9% of invalid links. Snapshots.

PageRank first!

Robot instruction file on each server 7

Metasearchers

  Search using many engines and unify the results o How to rank?! Marge rankings?

o Inquirus: Download each page and analyze it; rank Intersection of different major search engines is 1% 8

Other topics

    Indexing Hierarchies Interfaces User problems (understanding Boolean search, ...) have been covered in previous chapters Hyperlink (structured) search  Fish search: explore neighborhood of a hit on the fly o Relevant docs frequently have relevant neighbors 9

Research topics

      NLP techniques to improve indexing and ranking o WSD. Anaphora? Semantic structures Semantic Web o Ontologies Text Mining to improve navigation. Web Mining (links) Distributed architectures Scalable index compression (? – just bigger disks) Multimedia search 10

Conclusions

    Web has its own challenges as compared with general collections Search engines have to cope with them Gathering data (crawling) is a problem specific for Web Also, Web provides new types of info (links), which can be used by search engines 11

Special Topics in Computer Science

The Art of Information Retrieval Chapter 14: Libraries and Bibliographical Systems

Alexander Gelbukh

www.Gelbukh.com

Differences with IR...

    Historically first applications for searching o Predecessor of IR Docs: bibliographic records o Free text o Structured fields (e.g., date) Users: mostly librarians, or users of a library o thus: very limited budget Usually use Boolean model (IR: vector space) o Seems to be mostly due to historical reasons (among others) o Recently tend to add natural language search 13

... Differences with IR

 Creating the database is a subtask of such systems o Suite data to the system, not system to data as in IR o Carefully selected, structured, and annotated data o Annotation standards. Decimal classification, ...

14

Online Public Access Catalogs (OPAC)

 Three generations: 1. Known-item finding tools (by title, author, ...) 2. Subject headings, keywords, ...

3. Search strategy assistance, natural language queries, improved GUI, ...

  Prove to be very hard to use by inexperienced users Nowadays tend to become similar to digital library tools 15

Research topics

   Ease of use More power and flexibility ?

Integration with Digital Libraries ?

16

Conclusions

  Highly interoperable and standardized Look like legacy systems...

17

Special Topics in Computer Science

The Art of Information Retrieval Chapter 15: Digital Libraries

Alexander Gelbukh

www.Gelbukh.com

Digital libraries (DL)

 Simplistic view: library in a machine-readable form o Digitalization issues. Multilingual.

 5S model: 1. Streams (texts, multimedia, ...) 2. Structures (databases, indices, ...) 3. Spaces (interfaces in 1D, 2D, 3D, time, ...) 4. Scenarios (procedures, transformations, services, ...) 5. Societies (authors, annotators, ...)  This provides a way to define a DL 19

Architecture

   Provide Web services Manipulate Digital Objects (Items?) Repositories of such objects. Access protocol. Standards.

Security. Payment. Copyright. Watermarking    Parallel search across heterogeneous distributed (multilingual) collections Multimedia collections Metadata, Standard formats 20

Systems

  A lot of specific projects and systems are mentioned in the book.

Interoperability. Standards for automatic searching remote libraries. Protocols 21

Research topics

    Markup tools to produce high-quality documents Scaling Interoperability. Standards Better integration with IR 22

Conclusions

    Turning heaps of texts collected in conventional (or new) libraries into searchable and accessible information DLs are technological solutions, which involve IR as one of aspects Unlike Web, they handle carefully prepared docs.

Very costly.

Like Web, they are highly distributed and heterogeneous. Thus importance of standardization and interopearbility 23

Thank you!

The end Exam?

24