Chapter 7 Web Content Mining Xxxxxx

Download Report

Transcript Chapter 7 Web Content Mining Xxxxxx

Chapter 7

Web Content Mining Xxxxxx

Introduction

• Web-content mining techniques are used to discover useful information from content on the web – textual – audio – video – still images – metadata – hyperlinks

Introduction

• • Some of the web content is generated dynamically using queries to database management systems Other web content may be hidden from general users

Introduction

• Problems with the web data – Distributed data – Large volume – Unstructured data – Redundant data – Quality of data – Extreme percentage volatile data – Varied data

Introduction

• Two approaches of web-content mining: – agent-based » software agents perform the content mining – database oriented » view the Web data as belonging to a database

Web Crawler

• • A computer program that navigates the hypertext structure of the web – Crawlers are used to ease the formation of indexes used by search engines – The page(s) that the crawler begins with are called the seed URLs. • Every link from the first page is recorded and saved in a queue Builds an index visiting number of pages and then replaces the current index – Known as a periodic crawler because it is activated periodically

Web Crawler

• Another type is a Focused Crawler – Generally recommended for use due to large size of the Web – Visits pages related to topics of interest • If a page is not pertinent, the entire set of possible pages below it is pruned

Multiple Layered Database

• Every layer of the database is more generalized than the layer below it • Unlike the lowest level, the upper levels are structured and can be mined by an SQL-like query language

Multiple Layered Database

• Provides an abstracted view of a fraction of the web • Virtual Web View (VWV), can be constructed

Search Engine

• Basic components to a search engine: – The spider gathers new or updated information on Internet websites – The index used to store information about several websites – The search software performs searching through the huge index in an effort to generate an ordered list of useful search results

Types of Queries

• • • Boolean Queries: – Boolean logic queries connect words in the search using operators such as AND or OR Natural Language Queries: – In natural language queries the user frames as a question or a statement Thesaurus Queries: – In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system

Types of Queries

• • • Fuzzy Queries: – Fuzzy queries reflect no specificity Term Searches: – The most common type of query on the Web is when a user provides a few words or phrases for the search Probabilistic Queries: – Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy

The Robot Exclusion

• • Why would the developers prefer to exclude robots from parts of their websites?

The robot exclusion protocol – to indicate restricted parts of the Website to robots that visit our site – for giving spiders (“robots”) limited access to a website

The Robot Exclusion

• Website administrators and content providers can limit robot activity through two mechanisms: – The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txt

on their site. – The Robots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links.

Personalization of Web Content

• Used to modify the contents of a web page as per the needs of a user – Essentially, this involves building web pages exclusively for each user

Types of Web Page Personalization

• • • – Collaborative filtering: Achieves personalization by suggesting Web pages that have earlier been given high ratings from similar users – Manual techniques: Perform personalization via the use of rules that are used to classify individuals based on profiles or demographics – Content-based filtering: Retrieves pages based on the similarity between them and user profiles

Multimedia Information Retrieval

• • Perspective of images and videos Content system for images is the Query by Image Content (QBIC) system: – A three-dimensional color feature vector, where distance measure is simple Euclidean distance.

k-dimensional color histograms, where the bins of the histogram can be chosen by a partition-based clustering algorithm.

– A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features.

Multimedia Information Retrieval

• The query can be expressed directly in terms of the feature representation itself – For instance, Find images that are 40% blue in

color and contain a texture with specific coarseness property

Multimedia Information Retrieval

• MIR System www.hermitagemuseum.org/html_En/index.html • A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at: www.hermitagemuseum.org/fcgi bin/db2www/qbicLayout.mac/qbic?selLang=English.

Multimedia Information Retrieval

• • As multimedia become apparent as a more extensively used data format, it is vital to deal with the issues of: – metadata standards – classification – query matching – presentation – evaluation To guarantee the development and deployment of efficient and effective multimedia information retrieval systems