Slide Source for Stanford Security Workshop, March 2001 (.ppt)

Download Report

Transcript Slide Source for Stanford Security Workshop, March 2001 (.ppt)

A Tool for Implementing COPA+
(Child Online Protection Act)
James Z. Wang & Gio Wiederhold,
Penn State University. Inf.Sc. / Stanford University, CSD
Joint Work: Jia Li, Penn State Statistics
wang.ist.psu.edu / www-db.stanford.edu/IMAGE
www-db.stanford.edu/pub/gio/inprogress.html#COPA
7/26/2016
J. Z. Wang & Gio Wiederhold
1
Outline



The Issues: legal and community pressures
Current approaches to protect kids
Filtering based on image content





Goals and methods
The WIPE system
Experimental results
Website classification by image content
Conclusions and future work
7/26/2016
J. Z. Wang & Gio Wiederhold
2
Status of legal attempts to restrict
dissemination of porn to minors:
CDA: Communications Decency Act of 1996. Restricts Transmission of
Porn. Overturned for being overly restrictive of the rights of adults by
Philadelphia district court; decision upheld by Supreme court in 1997.
COPA: Child Online Protection Act of 1998. Fines to ISPs for delivering
porn to minors. Again overturned for being overly restrictive of the
rights of adults in implementation, by Philadelphia district court,
decision upheld by appeals, now before Supreme court. NRC study.
CIPA: Children's Internet Protection Act , passed late 2000, requires
schools and libraries to install filtering software on all Internetconnected computers to screen out pornographic images as a
condition of receiving federal funding. The law goes into effect April
20, but a suit is being brought again to the Philadelphia court.
Regulations giving the specifics of how to comply to be issued by the
Federal Communications Commission ( http://www.fcc.gov ) in late
March 2001.
The suits were/are filed by the ACLU and the ALA (Am.Library Ass.).
Other participants in the arguments include the porn-industry, religious
and parental organizations, the FBI, and filtering technology providers
7/26/2016
J. Z. Wang & Gio Wiederhold
3
The Size and Content of the Web
02/99: ~16 million total web
servers

Estimated total number of
pages on the web:
~800 million

15 Terabytes of text
(comparable to text of
Library of Congress)

Year 2001: 3 to 5 billion
pages
[Lawrence, Giles, Nature, 1999]

Frequency of access and search
#2, after music [Google]
7/26/2016
J. Z. Wang & Gio Wiederhold
4
Pornography-free Websites



E.g. Yahoo!Kids, disney.com
Useful in protecting those children too young
to know how to use the Web browser
It is difficult to control access to other sites
7/26/2016
J. Z. Wang & Gio Wiederhold
5
Filtering Software
E.g.: NetNanny, Cyber Patrol, CyberSitter
 Methods:




Store more than 10,000 IPs
Blocking based on keywords
Block all image access
Problems:


Internet is dynamic, especially porn sites
Keywords are not satisfactory




text hidden incorporated in images
Excessive filtering (Anne Sexton, cum laude, breast cancer)
Images are needed for all net users
Poor reputation, poor sales, no funds to improve
7/26/2016
J. Z. Wang & Gio Wiederhold
6
Image based-filtering
The problem comes from images!
 Requirements: high accuracy and high speed
 Challenges: non-uniform image background,
textual noise in foreground, wide range of
image quality, wide range of camera positions,
wide range of composition…
Our approach: rapid feature extraction, machine
learning of patterns, fast matching
Applications: classify Web images and Websites
7/26/2016
J. Z. Wang & Gio Wiederhold
7
The

Inspired by the UC Berkeley’s FNP System





WaveletImagePornographyElimination System
Detailed analysis of images
Skin filter and human figure grouper
Speed: 6 mins CPU time per image
Accuracy: 52% sensitivity and 96% specificity
Stanford WIPE (medical image analysis spinoff)



Wavelet-based feature extraction + image
classification + integrated region matching +
machine leaning
Speed: < 1 second CPU time per image
Accuracy: 96% sensitivity and 91% specificity
7/26/2016
J. Z. Wang & Gio Wiederhold
8
System Flow
Source Web Image
Feature Extraction
(color, texture, shape)
Training
Feature Extraction
(color, texture, shape)
Type Classification
photograph
Features
from
Training
graph
Photo Classification
Result: REJECT or PASS
7/26/2016
J. Z. Wang & Gio Wiederhold
9
Wavelet Principle
7/26/2016
J. Z. Wang & Gio Wiederhold
10
Type Classification
Graphs:
Manuallygenerated
images with
constant
tones, sharp
edges.
7/26/2016
J. Z. Wang & Gio Wiederhold
11
Type Classification
Photographs:
Images with
continuous
tones.
7/26/2016
J. Z. Wang & Gio Wiederhold
12
Photo Classification
Content-based image retrieval
+ statistical classification
7/26/2016
J. Z. Wang & Gio Wiederhold
13
Experimental Results



Tested on a set of over 10,000 photographic
images (i.e., after type classification)
Speed: Less than one second of response
time on a Pentium III PC
Accuracy
7/26/2016
Type of
Images
Test +
(Rejected)
Test –
(Passed)
Objectionable
96%
4%
Benign
9%
91%
J. Z. Wang & Gio Wiederhold
14
Comment on Accuracy


The algorithm can be adjusted to trade-off
specificity for higher sensitivity
In a real-world filtering application system,
both the sensitivity and the specificity are
expected to be higher


7/26/2016
Icons and graphs can be classified with almost
100% accuracy  higher specificity
Combine text and image classification  higher
sensitivity and higher speed
J. Z. Wang & Gio Wiederhold
15
False Classifications
Benign Images
Partially obscured human
Areas with similar features
Painting, fine-art
Partially undressed human
7/26/2016
J. Z. Wang & Gio Wiederhold
Animals (w/o clothes)
16
False Classifications
Objectionable Images
Partially dressed
Dressed but objectionable
Undressed area
too small
Dark, low contrast
7/26/2016
J. Z. Wang & Gio Wiederhold
Frame and text noise
17
Website Classification
by Image Content

An objectionable site will have many such images




For a given objectionable Website, we denote p as the
chance of an image on the Website to be an
objectionable image
p is the percentage of objectionable images over all
images provided by the site
We assume some distributions of p over all Websites
(e.g., Gaussian, shifted Gaussian)
Classification levels could be provided as a service
to filtering software producers
7/26/2016
J. Z. Wang & Gio Wiederhold
18
Flow in Website classification
7/26/2016
J. Z. Wang & Gio Wiederhold
19
Website Classification

Based on statistical analysis (see paper), we
know we can expect higher than 97%
accuracy on Website classification if



We download 20-35 images for each site
We classify a Website as objectionable if 20-25%
of downloaded images are objectionable
Using text and IP addresses as criteria, the
accuracy can be further improved

7/26/2016
skip IPs for museums, dog-shows, beach towns,
sport events
J. Z. Wang & Gio Wiederhold
20
Internet High Level Domain Proposal

.... .kids




.... .xxx






Sites that are kid-safe, rated by independent
organization – several candidates
Supported o.a. by porn industry
Danger: fake .kids sites
Legitimate sites for adults, easy to filter out for kids
Potential loss of business for porn-industry (work, schools)
No candidate organization – consortium of filter comp's
Fear of government interference and loss of freedom
No mechanism to force objectionable sites into .xxx
Rejected by ICANN, accepted by New.net
7/26/2016
J. Z. Wang & Gio Wiederhold
(Idealab)
21
Conclusions and Future Work





Perfect filtering is never possible
Effective filtering based on image content is
feasible with the current technology
Systems that combine content-based filtering
with text-based criteria will have good accuracy
and acceptable speed
Objectionable websites are automatically
identifiable, a service for the community?
These results were produced rapidly, they can
be improved through further research.
7/26/2016
J. Z. Wang & Gio Wiederhold
22
References








http://WWW-DB.Stanford.EDU/IMAGE (papers)
http://wang.ist.psu.edu
/cgi-bin/zwang/wipe2_show.cgi (demo)
http://www-db.stanford.edu
/pub/gio/inprogress.html#COPA (testimony)
[email protected] (James Wang)
[email protected] (Gio Wiederhold)
[email protected] (Michel Bilello)
7/26/2016
J. Z. Wang & Gio Wiederhold
23