Slide Source for NRC meeting [James Wang] (.ppt)

Transcript Slide Source for NRC meeting [James Wang] (.ppt)

Advanced Techniques for
Automatic Web Filtering
James Z. Wang
PNC Tech. Career Dev. Professor
Penn State University
Joint Work: Jia Li, Assist. Prof., Penn State Statistics
Gio Wiederhold, Prof., Stanford Computer Science
http://wang.ist.psu.edu
7/26/2016
J. Z. Wang, Penn State University
1
Outline



The problem
Related approaches
Filtering based on image content





Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang, Penn State University
2
The Size and Content of the Web




02/99: ~16 million total web
servers
Estimated total number of
pages on the web:
~800 million
15 Terabytes of text
(comparable to text of
Library of Congress)
Year 2001: 3 to 5 billion
pages
Lawrence, Giles, Nature, 1999.
7/26/2016
J. Z. Wang, Penn State University
3
Outline



The problem
Related approaches
Filtering based on image content





Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang, Penn State University
4
Pornography-free Websites



E.g. Yahoo!Kids, disney.com
Useful in protecting those children too young
to know how to use the Web browser
It is difficult to control access to other sites
7/26/2016
J. Z. Wang, Penn State University
5
Text-based Filtering


E.g. NetNanny, Cyber Patrol, CyberSitter
Methods:




Store more than 10,000 IPs
Blocking based on keywords
Block all image access
Problems:



7/26/2016
Internet is dynamic
Keywords are not enough (e.g. text incorporated
in images)
Images are needed for all net users
J. Z. Wang, Penn State University
6
Classification of Web Community

Flake, Lawrence, Giles, ACM KDD, 2000

7/26/2016
Graph clustering based on max flow – min cut
analysis of the Web connectedness
J. Z. Wang, Penn State University
7
Outline



The problem
Related approaches
Filtering based on image content





7/26/2016
Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
J. Z. Wang, Penn State University
8
Goals and Methods




The problem comes from images, we deal
with images
Goals: use machine learning and image
retrieval to classify Web images and Websites
Requirements: high accuracy and high speed
Challenges: non-uniform image background,
textual noise in foreground, wide range of
image quality, wide range of camera
positions, wide range of composition…
7/26/2016
J. Z. Wang, Penn State University
9
The WIPE System

Inspired by the UC Berkeley’s FNP System





Detailed analysis of images
Skin filter and human figure grouper
Speed: 6 mins CPU time per image
Accuracy: 52% sensitivity and 96% specificity
Stanford WIPE System



7/26/2016
Wavelet-based feature extraction + image
classification + integrated region matching +
machine leaning
Speed: < 1 second CPU time per image
Accuracy: 96% sensitivity and 91% specificity
J. Z. Wang, Penn State University
10
System Flow
Original Web Image
Feature Extraction
(color, texture, shape)
Type Classification
photograph
Photo Classification
Result: REJECT or PASS
Training Features
7/26/2016
J. Z. Wang, Penn State University
11
Wavelet Principle
7/26/2016
J. Z. Wang, Penn State University
12
Type Classification
Graphs:
Manuallygenerated
images
with
smooth
tones.
7/26/2016
J. Z. Wang, Penn State University
13
Type Classification
Photographs:
Images with
continuous
tones.
7/26/2016
J. Z. Wang, Penn State University
14
Photo Classification
Content-based image retrieval
+ statistical classification
7/26/2016
J. Z. Wang, Penn State University
15
Experimental Results



Tested on a set of over 10,000 photographic
images
Speed: Less than one second of response
time on a Pentium III PC
Accuracy
7/26/2016
Type of
Images
Test +
(Rejected)
Test –
(Passed)
Objectionable
96%
4%
Benign
9%
91%
J. Z. Wang, Penn State University
16
Comment on Accuracy


The algorithm can be adjusted to trade off
specificity for higher sensitivity
In a real-world filtering application system,
both the sensitivity and the specificity are
expected to be higher


7/26/2016
Icons and graphs can be classified with almost
100% accuracy  higher specificity
Combine text and image classification  higher
sensitivity and higher speed
J. Z. Wang, Penn State University
17
False Classifications
Benign Images
Partially obscured human
Areas with similar features
Painting, fine-art
Partially undressed human
7/26/2016
J. Z. Wang, Penn State University
Animals (w/o clothes)
18
False Classifications
Objectionable Images
Partially dressed
Dressed but objectionable
Undressed area
too small
Dark, low contrast
7/26/2016
J. Z. Wang, Penn State University
Frame and text noise
19
Website Classification
by Image Content

An objectionable site will have many such images




For a given objectionable Website, we denote p as the
chance of an image on the Website to be an
objectionable image
p is the percentage of objectionable images over all
images provided by the site
We assume some distributions of p over all Websites
(e.g., Gaussian, shifted Gaussian)
Classification levels could be provided as a service
to filtering software producers
7/26/2016
J. Z. Wang, Penn State University
20
Flow in Website classification
7/26/2016
J. Z. Wang, Penn State University
21
Website Classification

Based on statistical analysis (see paper), we
know we can expect higher than 97%
accuracy on Website classification if



We download 20-35 images for each site
We classify a Website as objectionable if 20-25%
of downloaded images are objectionable
Using text and IP addresses as criteria, the
accuracy can be further improved

7/26/2016
skip IPs for museums, dog-shows, beach towns,
sport events
J. Z. Wang, Penn State University
22
Outline



The problem
Related approaches
Filtering based on image content





Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang, Penn State University
23
Conclusions and Future Work





Perfect filtering is never possible
Effective filtering based on image content is
feasible with the current technology
Systems that combine content-based filtering
with text-based criteria will have good
accuracy and acceptable speed
Objectionable websites are automatically
identifiable, a service for the community?
The technology can still be improved through
further research.
7/26/2016
J. Z. Wang, Penn State University
24
References






http://WWW-DB.Stanford.EDU/IMAGE (papers)
http://wang.ist.psu.edu
... /cgi-bin/zwang/wipe2_show.cgi (demo)
http://www-db.stanford.edu
... /pub/gio/inprogress.html#COPA (testimony)
[email protected] (James Wang)
[email protected] (Gio Wiederhold)
[email protected] (Michel Bilello)
7/26/2016
J. Z. Wang, Penn State University
25

Slide Source for NRC meeting [James Wang] (.ppt)

Transcript Slide Source for NRC meeting [James Wang] (.ppt)

Directory