CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer , Dr.

Download Report

Transcript CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer , Dr.

CSIS
Stylometry System – Use Cases and
Feasibility Study
Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh
Iyer , Dr. Sandra Westcott
Student-Faculty Research Day
May 7, 2010
Seidenberg School of Computer Science and Information Systems
Stylometry System
CSIS
Stylometry
• Discipline that determines authorship of
literary works through the use of statistical
analysis and machine learning
• Is about pattern recognition
Stylometry System
Stylometry
• Feature sets used for literary works
– Lexical
• Word or character base
– How terms or characters are used within a community
– Syntax
• Patterns used to form sentences
– Structural
• Layout of the text
– Content-specific
• Words that are important within a specific domain
• Has been used to determine authorship since the mid
1400’s
Stylometry System
CSIS
CSIS
The Project
•
Part I
–
•
Search to determine interesting and unique applications
of stylometry for Research
Part II
–
Feasibility study on existing tools/applications for
email authorship (250 words or less)
Stylometry System
CSIS
Existing / Potential Uses of Stylometry
•
Music Lyrics
•
Plagiarism
•
Music Melody
•
Social Networking
•
Paintings
•
Electronic Mail
•
Literary Works
•
Instant Messaging
•
Forensic Linguistics
- Social networking, electronic mail, and instant
messaging are still in early stages of study
Stylometry System
CSIS
Use Cases
-
Twitter
-
-
Used to verify existing Twitter accounts and help
mitigate impersonations
Electronic mail
-
-
Implemented in a corporate setting helping identify
anonymous emails meant to do harm
Chat
-
Assist in determining authorship of instant messages
Stylometry System
CSIS
Use Cases
-
Terrorism
-
Help identify an author of terrorist content or identify
terrorist content by using contextual analysis
Applied to blogs, forums, wikis, email, chat and other
forms of digital content
Stylometry System
CSIS
Tools Tested
-
JGAAP (Java Graphical Authorship Attribute Program)
-
-
Java based tool
Developed by Dr. Juola at Duquesne University
Runs on Windows and Linux
Identification tool
- 1 of n decision – Many known email authors trying to
determine the author of one unknown email
One unknown email author compared to 99 known email
authors
Stylometry System
CSIS
Tools Tested
-
C# Tool
-
Written in C programming language
Developed by prior Pace CS graduate students
Identification tool
-
-
1 of n decision – Many known email authors trying to
determine the author of one unknown email
One unknown email author compared to 99 known
email authors
Stylometry System
CSIS
Tools Tested
-
Signature Tool
-
Written in C programming language
Created by Peter Millican from Hartford College
Authentication Tool
-
-
-
Either match / no match
Match testing – 9 known and 1 unknown sample
(same author)
No Match – 10 known and 1 unknown (two different
authors)
Stylometry System
CSIS
Testing methodology
-
Each team member submitted emails from different
authors.
- Total of 100 emails collected from 10 different authors
- Removed from native program and saved as text files
- Average size of email: 195.7 words
- Three (3) identification and authentication tools tested
- 100 tests run on each software tool
Stylometry System
CSIS
Testing Results
JGAAP (Levenshtein Distance algorithm)
Canonizers
C# Tool Match Test
On
Off
Words
50%
30%
Accuracy
Word Length
50%
30%
57%
Characters
60%
40%
Syllables per
Word
40%
30%
Word Bigrams
70%
60%
Categorizing the result based
on the country of the author
Signature Tool Match Test
Match
Events
Accuracy
FRR
Word Length
53.33%
Letters
46.67%
Tool
India
USA
India
USA
46.67%
JGAAP
50%
100%
NA
NA
53.33%
Signature
61.11%
75.00%
81.48%
83.33%
C# Tool
42%
80.00%
NA
NA
Signature Tool No-Match Test
Events
No-Match
Accuracy
FAR
Word Length
53.33%
46.67%
Letters
82.22%
17.78%
Stylometry System
CSIS
Conclusion
-
Overall the moderate accuracy of the test results
suggest that none of the tools evaluated are
capable of accurate stylometric email author
identification
-
Categorizing email samples by country of origin
seems to yield better accuracy results for all three
tools tested.
Stylometry System
CSIS
Recommendations
-
Further testing and research using email from authors of
different countries
Continue to refine and add to the stylistic feature set
created by prior Pace graduate students
-
-
Emoticons
Font color
Font size
Embedded images
Hyperlinks
Internet ‘slang’ (ex – LOL, TTYL)
Further research on individuals who disguise their
identity
Stylometry System