Special Topics in Data Mining Applications Focus on: Text Mining INFS-795

Download Report

Transcript Special Topics in Data Mining Applications Focus on: Text Mining INFS-795

Special Topics in Data Mining Applications Focus on: Text Mining

INFS-795 Spring 2005 -- GMU

General Info

• Instructor: Carlotta Domeniconi – Office: S&T2, Rm 449 – Email: [email protected]

– Phone: (703) 993-1697 • • Office hours: Tue 4-6pm, or by appointment http://www.ise.gmu.edu/~carlotta/ • Visit the class webpage often!

Course Format

• Lectures by the instructor; • One midterm; • Paper presentations by students; • One project: – Project proposal; – Project presentation – Project paper;

Important Dates

March 10

: Project proposal due; •

March 24

: Midterm Exam; •

March 31

: Students’ presentations start; •

May 12

: Paper on the project due.

Visit the class webpage often !!!

The final grade is based on…

• Midterm:

25%

• Paper presentation:

15%

• Project (proposal, presentation, paper):

50%

Participation in class

and quizzes on papers presented:

10%

Course Overview

Classification

: – Bayes decision theory – Density estimation; Discriminant analysis – Decision trees; Nearest neighbors – Curse of dimensionality – Dimensionality reduction: • Principal Component Analysis (PCA) • Linear Discriminant Analysis (LDA) – Support Vector Machines

Course Overview

Clustering

: – Basics – Distance measures – K-means – Subspace clustering

Course Overview

Text categorization

: – Document representation; – Latent semantic indexing; – Unsupervised and supervised feature selection; – Feature weighting; – Similarity measures; – Semantic distances; – Kernel methods; – Detecting Spam email.

Course Overview

• Presentation/Discussion of papers – list of papers provided; • Project proposals; • Project presentations; • Paper on the project;

We will study and learn…

• Fundamental principles and techniques in data mining / machine learning; • Problems that arise in – Document classification • Existing approaches in data mining to address these problems; • Their limitations; • Can we do better?

Some useful books

• On Pattern Classification: – R. O. Duda, P. E. Hart, D. G. Stork, “ Pattern Classification ”, Second Edition, Wiley, 2001.

• On Document Classification: – S. Chakrabarti, “

Mining the Web: Discovering Knowledge from Hypertext Data

”, Elsevier Science, 2003.

– Thorsten Joachims, “

Learning to Classify Text using Support Vector Machines

”, Kluwer 2002.

• On Text Retrieval: – M. Berry and M. Browne, “

Understanding Search Engines. Mathematical Modeling and Text Retrieval”,

SIAM, 1999.

• On Statistical Learning: – T. Hastie, R. Tibshirani, and J. Friedman, “

The Elements of Statistical Learning. Data Mining, Inference and Prediction

”, Springer, 2001. (Last Print!)