Document 7231488

Download Report

Transcript Document 7231488

Search Engine using Web Mining
COMS E6125.001
Web Enhanced Information Mgmt
Prof. Gail Kaiser
Presented By:
Rupal Shah (UNI: rrs2146)
Web Mining
Web Usage Mining is the process of applying data mining
techniques to the discovery of usage patterns from Web data.
Data mining efforts associated with the Web is known as Web
Mining.
Classification of Web Mining
• Content
Mining: refers to the discovery of useful
information from Web content, including text, images, audio,
and video. Web content mining research includes resource
discovery from the Web, document categorization and
clustering, and information extraction from Web pages.
• Usage Mining: Web link structure has been widely used
to infer important information about Web pages
• Structure Mining: to understand the structure of the
Web as a whole. Citations (linkages) among Web pages are
usually indicators of high relevance or good quality. The
term in-links to indicate the hyperlinks pointing to a page
and the term out-links to indicate the hyperlinks found in a
page.
Data Source
The usage data collected at the different sources will represent the
navigation patterns of different segments of the overall Web
Traffic, ranging from single user, and single site browsing behavior
to multi user and multi site access patterns.

Server Level Collection

Client Level Collection

Proxy Level Collection
Server Level Collection

A Web server log is an important source for performing Web Usage
Mining because it explicitly records the browsing behavior of site
visitors.

The data recorded in server logs reflects the access of a Web site by
multiple users. These logs can be stored in various formats such as
Common log or Extended log formats.

Cookies are tokens generated by the Web server for individual client
browsers in order to automatically track the site visitors. Tracking of
individual users is not an easy task due to the stateless connection
model of the HTTP protocol.
Contd…

Cached page views are not recorded in a server log. In
addition, any important information passed through the POST
method will not be available in a server log.
Client Level Collection

It can be implemented by using a remote agent (such as Java
scripts or Java applets) or by modifying the source code of an
existing browser (such as Mosaic or Mozilla) to enhance its
data collection capabilities.

The implementation of client-side data collection methods
requires user cooperation, either in enabling the functionality of
the Java scripts and Java applets, or to voluntarily use the
modified browser.
Proxy Level Collection

A Web proxy acts as an intermediate level of caching between
client browsers and Web servers. Proxy caching can be used to
reduce the loading time of a Web page experienced by users
as well as the network traffic load at the server and client sides.

Proxy traces may reveal the actual HTTP requests from
multiple clients to multiple Web servers. This may serve as a
data source for characterizing the browsing behavior of a group
of anonymous users sharing a common proxy server.
Pattern Discovery

Discovering sequential pattern is to find inter-transaction
patterns such that the presence of a set of items is followed by
another item in the timestamp ordered transaction set. In Web
server transaction logs a visit by a client is recorded over a
period of time.

The discovery of sequential patterns in Web server access logs
allows Web based organizations to predict user visit patterns
and helps in targeting advertising aimed at groups of users
based on these patterns By analyzing this information the Web
mining system can determine temporal relationships.
Pattern Analysis

Pattern Analysis is to filter out uninteresting rules or patterns
from the set found in the pattern discovery phase. The exact
analysis methodology is usually governed by the application for
which Web mining is done.

The most common form of pattern analysis consists of a
knowledge query mechanism such as SQL.

Content and structure information can be used to filter out
patterns containing pages of a certain usage type, content type,
or pages that match a certain hyperlink structure.
Application of Web Mining

Counter-Terrorism

E-Commerce

Security Threat and many more
Future Scope of Web Mining


Web mining research has been the difficulty of creating suitable
test collections that can be reused by researchers. A test
collection is important because it allows researchers to
compare different algorithms using a standard test-bed under
the same conditions, without being affected by such factors as
Web page changes or network traffic variations.
Although textual documents are comparatively easy to index,
retrieve, and analyze, operations on multimedia files are much
more difficult to perform; and with multimedia content on the
Web growing rapidly, Web mining has become a challenging
problem. Various machine-learning techniques have been
employed to address this issue. Predictably, research in pattern
recognition and image analysis has been adapted for study of
multimedia documents on the Web.
Conclusion

As Web and its usage continues to grow, so it grows the
opportunity to analyze Web data and extract all manner of
useful knowledge from it.

Web Mining is still in their initial stage and should continue to
develop as Web evolves. One future research direction for Web
Mining is Multimedia data mining. In addition to textual
documents like HTML, MS Word, PDF and Plain text files, a
large number of multimedia documents are contained on the
Web such as images, audio and video.
Thank You