Slide Source for NRC meeting [James Wang] (.ppt)
Download
Report
Transcript Slide Source for NRC meeting [James Wang] (.ppt)
Advanced Techniques for
Automatic Web Filtering
James Z. Wang
PNC Tech. Career Dev. Professor
Penn State University
Joint Work: Jia Li, Assist. Prof., Penn State Statistics
Gio Wiederhold, Prof., Stanford Computer Science
http://wang.ist.psu.edu
7/26/2016
J. Z. Wang, Penn State University
1
Outline
The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang, Penn State University
2
The Size and Content of the Web
02/99: ~16 million total web
servers
Estimated total number of
pages on the web:
~800 million
15 Terabytes of text
(comparable to text of
Library of Congress)
Year 2001: 3 to 5 billion
pages
Lawrence, Giles, Nature, 1999.
7/26/2016
J. Z. Wang, Penn State University
3
Outline
The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang, Penn State University
4
Pornography-free Websites
E.g. Yahoo!Kids, disney.com
Useful in protecting those children too young
to know how to use the Web browser
It is difficult to control access to other sites
7/26/2016
J. Z. Wang, Penn State University
5
Text-based Filtering
E.g. NetNanny, Cyber Patrol, CyberSitter
Methods:
Store more than 10,000 IPs
Blocking based on keywords
Block all image access
Problems:
7/26/2016
Internet is dynamic
Keywords are not enough (e.g. text incorporated
in images)
Images are needed for all net users
J. Z. Wang, Penn State University
6
Classification of Web Community
Flake, Lawrence, Giles, ACM KDD, 2000
7/26/2016
Graph clustering based on max flow – min cut
analysis of the Web connectedness
J. Z. Wang, Penn State University
7
Outline
The problem
Related approaches
Filtering based on image content
7/26/2016
Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
J. Z. Wang, Penn State University
8
Goals and Methods
The problem comes from images, we deal
with images
Goals: use machine learning and image
retrieval to classify Web images and Websites
Requirements: high accuracy and high speed
Challenges: non-uniform image background,
textual noise in foreground, wide range of
image quality, wide range of camera
positions, wide range of composition…
7/26/2016
J. Z. Wang, Penn State University
9
The WIPE System
Inspired by the UC Berkeley’s FNP System
Detailed analysis of images
Skin filter and human figure grouper
Speed: 6 mins CPU time per image
Accuracy: 52% sensitivity and 96% specificity
Stanford WIPE System
7/26/2016
Wavelet-based feature extraction + image
classification + integrated region matching +
machine leaning
Speed: < 1 second CPU time per image
Accuracy: 96% sensitivity and 91% specificity
J. Z. Wang, Penn State University
10
System Flow
Original Web Image
Feature Extraction
(color, texture, shape)
Type Classification
photograph
Photo Classification
Result: REJECT or PASS
Training Features
7/26/2016
J. Z. Wang, Penn State University
11
Wavelet Principle
7/26/2016
J. Z. Wang, Penn State University
12
Type Classification
Graphs:
Manuallygenerated
images
with
smooth
tones.
7/26/2016
J. Z. Wang, Penn State University
13
Type Classification
Photographs:
Images with
continuous
tones.
7/26/2016
J. Z. Wang, Penn State University
14
Photo Classification
Content-based image retrieval
+ statistical classification
7/26/2016
J. Z. Wang, Penn State University
15
Experimental Results
Tested on a set of over 10,000 photographic
images
Speed: Less than one second of response
time on a Pentium III PC
Accuracy
7/26/2016
Type of
Images
Test +
(Rejected)
Test –
(Passed)
Objectionable
96%
4%
Benign
9%
91%
J. Z. Wang, Penn State University
16
Comment on Accuracy
The algorithm can be adjusted to trade off
specificity for higher sensitivity
In a real-world filtering application system,
both the sensitivity and the specificity are
expected to be higher
7/26/2016
Icons and graphs can be classified with almost
100% accuracy higher specificity
Combine text and image classification higher
sensitivity and higher speed
J. Z. Wang, Penn State University
17
False Classifications
Benign Images
Partially obscured human
Areas with similar features
Painting, fine-art
Partially undressed human
7/26/2016
J. Z. Wang, Penn State University
Animals (w/o clothes)
18
False Classifications
Objectionable Images
Partially dressed
Dressed but objectionable
Undressed area
too small
Dark, low contrast
7/26/2016
J. Z. Wang, Penn State University
Frame and text noise
19
Website Classification
by Image Content
An objectionable site will have many such images
For a given objectionable Website, we denote p as the
chance of an image on the Website to be an
objectionable image
p is the percentage of objectionable images over all
images provided by the site
We assume some distributions of p over all Websites
(e.g., Gaussian, shifted Gaussian)
Classification levels could be provided as a service
to filtering software producers
7/26/2016
J. Z. Wang, Penn State University
20
Flow in Website classification
7/26/2016
J. Z. Wang, Penn State University
21
Website Classification
Based on statistical analysis (see paper), we
know we can expect higher than 97%
accuracy on Website classification if
We download 20-35 images for each site
We classify a Website as objectionable if 20-25%
of downloaded images are objectionable
Using text and IP addresses as criteria, the
accuracy can be further improved
7/26/2016
skip IPs for museums, dog-shows, beach towns,
sport events
J. Z. Wang, Penn State University
22
Outline
The problem
Related approaches
Filtering based on image content
Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang, Penn State University
23
Conclusions and Future Work
Perfect filtering is never possible
Effective filtering based on image content is
feasible with the current technology
Systems that combine content-based filtering
with text-based criteria will have good
accuracy and acceptable speed
Objectionable websites are automatically
identifiable, a service for the community?
The technology can still be improved through
further research.
7/26/2016
J. Z. Wang, Penn State University
24
References
http://WWW-DB.Stanford.EDU/IMAGE (papers)
http://wang.ist.psu.edu
... /cgi-bin/zwang/wipe2_show.cgi (demo)
http://www-db.stanford.edu
... /pub/gio/inprogress.html#COPA (testimony)
[email protected] (James Wang)
[email protected] (Gio Wiederhold)
[email protected] (Michel Bilello)
7/26/2016
J. Z. Wang, Penn State University
25