Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh www.Gelbukh.com.
Download ReportTranscript Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh www.Gelbukh.com.
Special Topics in Computer Science
The Art of Information Retrieval Chapter 13: Searching the Web
Alexander Gelbukh
www.Gelbukh.com
Previous chapter: Conclusions
Interface is a key element of the system. If the users cannot use it, it does not matter how good it is.
Interface design choices are important at any stage of the process o Especially to formulate queries o Also to present results o 3D interfaces to present results Also, overall system interface and action tracking Difficult to assess quality. Difficult to find new ideas Very promising if you find them! 2
Previous chapter: Research topics
Many ideas throughout the chapter o some may be obsolete New interface types! 3D interfaces Ways of assessing the quality of interfaces 3
Web: challenges (differences)
Distributed data Volatile data: 40% / month Very large volume o Very large answers o 1998: 3,000,000 servers, 350,000,000 pages. o 2003:
Only
Google: 3,307,998,701 pages (10 times more) Unstructured and redundant data. 30% are duplicates Quality of data. 0.5% errors, 30% in foreign names Heterogeneous data (languages, alphabets: Chinese) Heterogeneous and inexperienced users 4
Search engines
Difference: full text is not available o Now obsolete: Google stores it, some other engines too Centralized (logically) architecture o there are distributed (physically) architectures Crawlers (robots) collect data/index in a central place A search engine indexes only a small amount (2%? 30%?) of Web Recall is nearly not relevant for simple queries Google: a revolution (AltaVista of our days) 5
Ranking
Commercial secret Ranking can take into account hypertext Google: PageRank algorithm o Roughly, # of
incoming
links (much more complicated) Problems: tricks o Link exchange o Anti-trick measures: detect link exchangers o Penalize tricks: repeated keywords, etc.
Related pages o Co-cited or co-citing pages are related o Clustering the search results 6
Crawling
Depth-first? Width-first? Most popular first?
How to divide the work between crawlers?
Index is always obsolete o Not equally obsolete (like stars) o Depends on crawling policy o 2% - 9% of invalid links. Snapshots.
PageRank first!
Robot instruction file on each server 7
Metasearchers
Search using many engines and unify the results o How to rank?! Marge rankings?
o Inquirus: Download each page and analyze it; rank Intersection of different major search engines is 1% 8
Other topics
Indexing Hierarchies Interfaces User problems (understanding Boolean search, ...) have been covered in previous chapters Hyperlink (structured) search Fish search: explore neighborhood of a hit on the fly o Relevant docs frequently have relevant neighbors 9
Research topics
NLP techniques to improve indexing and ranking o WSD. Anaphora? Semantic structures Semantic Web o Ontologies Text Mining to improve navigation. Web Mining (links) Distributed architectures Scalable index compression (? – just bigger disks) Multimedia search 10
Conclusions
Web has its own challenges as compared with general collections Search engines have to cope with them Gathering data (crawling) is a problem specific for Web Also, Web provides new types of info (links), which can be used by search engines 11
Special Topics in Computer Science
The Art of Information Retrieval Chapter 14: Libraries and Bibliographical Systems
Alexander Gelbukh
www.Gelbukh.com
Differences with IR...
Historically first applications for searching o Predecessor of IR Docs: bibliographic records o Free text o Structured fields (e.g., date) Users: mostly librarians, or users of a library o thus: very limited budget Usually use Boolean model (IR: vector space) o Seems to be mostly due to historical reasons (among others) o Recently tend to add natural language search 13
... Differences with IR
Creating the database is a subtask of such systems o Suite data to the system, not system to data as in IR o Carefully selected, structured, and annotated data o Annotation standards. Decimal classification, ...
14
Online Public Access Catalogs (OPAC)
Three generations: 1. Known-item finding tools (by title, author, ...) 2. Subject headings, keywords, ...
3. Search strategy assistance, natural language queries, improved GUI, ...
Prove to be very hard to use by inexperienced users Nowadays tend to become similar to digital library tools 15
Research topics
Ease of use More power and flexibility ?
Integration with Digital Libraries ?
16
Conclusions
Highly interoperable and standardized Look like legacy systems...
17
Special Topics in Computer Science
The Art of Information Retrieval Chapter 15: Digital Libraries
Alexander Gelbukh
www.Gelbukh.com
Digital libraries (DL)
Simplistic view: library in a machine-readable form o Digitalization issues. Multilingual.
5S model: 1. Streams (texts, multimedia, ...) 2. Structures (databases, indices, ...) 3. Spaces (interfaces in 1D, 2D, 3D, time, ...) 4. Scenarios (procedures, transformations, services, ...) 5. Societies (authors, annotators, ...) This provides a way to define a DL 19
Architecture
Provide Web services Manipulate Digital Objects (Items?) Repositories of such objects. Access protocol. Standards.
Security. Payment. Copyright. Watermarking Parallel search across heterogeneous distributed (multilingual) collections Multimedia collections Metadata, Standard formats 20
Systems
A lot of specific projects and systems are mentioned in the book.
Interoperability. Standards for automatic searching remote libraries. Protocols 21
Research topics
Markup tools to produce high-quality documents Scaling Interoperability. Standards Better integration with IR 22
Conclusions
Turning heaps of texts collected in conventional (or new) libraries into searchable and accessible information DLs are technological solutions, which involve IR as one of aspects Unlike Web, they handle carefully prepared docs.
Very costly.
Like Web, they are highly distributed and heterogeneous. Thus importance of standardization and interopearbility 23
Thank you!
The end Exam?
24