Pattern Analysis & Machine Intelligence Research Group

Download Report

Transcript Pattern Analysis & Machine Intelligence Research Group

Pattern Analysis & Machine Intelligence
Research Group
UNIVERSITY OF WATERLOO
LORNET Theme 4
Data Mining and Knowledge Extraction for LO
T L : Mohamed Kamel
PI’s: O. Basir, F. Karray, H. Tizhoosh
Assoc PI’s: A. Wong, C. DiMarco
Knowledge Extraction and LO Mining
GOAL:

Develop Data mining and knowledge extraction
techniques and tools for learning object
repositories.

These tools can provide context and facilitate
interactions, efficient organization, efficient
delivery, navigation and retrieval.
PAMI Research Group, University of Waterloo
Theme Overview
From Text
Syntactic: Keyword, Keyphrase-based
Semantic: Concept-based
From Images
Image Features, Shape Features
From Text + Images
Knowledge
Extraction
Describing Images with Text
Enriching Text with Images
Classification
(MCS, Data Partitioning,
Imbalanced Classes)
LO Similarity and Ranking
Clustering
(Parallel/Distributed Clustering,
Cluster Aggregation)
Association Rules / Social Networks
Reinforcement Learning
LO
Mining
Specialized / Personalized Search
Tagging
Matching
and
and
Organizing
Ranking
PAMI Research Group, University of Waterloo
Types of Data in LORNET
TELOS
LCMS
Course
Course
Course
Resource
Resource
Resource
Module
Module
Module
Lesson
Lesson
Lesson
Subject Matter
Text, Images, Flash, Applets, Metadata, Interaction Logs
Discussion Board
Board
Board
Board
Semantic
Layer
Thread
Thread
Thread
Post
Post
Post
Discussions
Text, Interaction Logs
LOR
Record
Record
Record
Resources
LO Descriptors
Metadata,
Semantic References
Metadata
Metadata
Metadata
Metadata
PAMI Research Group, University of Waterloo
LO
LO
LO
LO Mining Scenarios
Task
Environment
Knowledge
Extraction
Tagging /
Organizing
Matching /
Ranking
Ontology Construction
Grouping Components
Finding & Ranking
Components
E-Learning Design
Environment
(LMS)
Extracting LO Summary
Extracting LO Concepts
Extracting Image
Description
Grouping LOs
Finding Similar LOs
Ranking LOs
Learning Object
Content MS
(LCMS)
Summarizing Documents
Extracting Concepts
from Documents
Grouping Documents
Tagging Documents
Finding Similar Topics
Finding Similar Profiles
Building Social Networks
Detect Plagiarism
Extracting Metadata
Extracting Ontologies
Classifying LOs
Building LO Clusters
Detecting Duplicate LOs
Ranking LOs
Metadata Matching
TELOS
LO Repository
PAMI Research Group, University of Waterloo
LO Mining and Knowledge Extraction
Applications /
Services
Data Mining
Algorithms
LO Automatic
Tagging
Text Mining
Parsing, Tokenization,
Keyword/phrase
Exraction
Math & Statistics
Vectors, Matrices,
Statistics
LO Grouping/
Ranking
Semantic Analysis
NLP, Ontologies,
Knowledge Rep.
LO
Similarity
LO
Summarization
Categorization
Classification,
Clustering
Data Representation
Features, Feature Types,
Normalization, Discretization
LO
Recommendation
Learning from
Interactions
Reinforcement Learning,
Multi-Agent Systems
Data Structures
Arrays, Lists, Trees,
Graphs
Data Mining
Foundations
PAMI Research Group, University of Waterloo
. . . .
Image Mining
Feature Extraction,
Shape Analysis,
Indexing and Retrieval
Data Access
Data Sources, Data
Readers/Writers,
Data Converters
Projects Overview
Information Extraction
Categorization
Analyzing content to extract relevant information
Organizing LOs according to their content
Text
Document
Text
Document
Keyword Extraction
Summarization
Concept Extraction
Social Network Analysis
- Traditional
- MCS
- Imbalanced
Classification
- Traditional
- Ensembles
- Distributed
Clustering
Personalization
Image Mining
Providing user-specific results
Describing and finding relevant images
Interaction
Logs
Image
Reinforcement
Learning
- Traditional
- Oppositionbased
- Traditional
- Fusion-based
CBIR
Integration and Applications
Software Components
Theme and Industry Collaboration
In Progress
PAMI Research Group, University of Waterloo
Publications
Information Extraction: Summarization
LO Content Package Summarization


Learning objects stored in IMS content pacakges
are loaded and parsed. Textual content files are
extracted for analysis.

Statistical term weighting and sentence ranking are
performed on each document, and to the whole
collection.

Top relevant sentences are extracted for each
document.

Planned functionality: Summarization of whole
modules or lessons (as opposed to single
documents).
Benefits


Provide summarized overview of learning objects for
quick browsing and access to learning material.
Scenarios

Learning Management Systems can call the
summarization component to produce summaries
for content packages.
Data is courtesy University of Saskatchewan
PAMI Research Group, University of Waterloo
Information Extraction: Concept Extraction
Text
Text
Sentence Separator
Language
Dependent
Natural Languagel Processing
POS Tagger
Concept-Based Statistical Analyser
F-measure of Hierarchical Clustering
Syntax Parser
Concept - based Model
Language Independent
Text Pre- processor
Improvement
Single-Term
Concept-based
Reuters
0.723
0.925
+27.94%
ACM
0.697
0.918
+31.70%
Brown
0.581
0.906
+55.93%
Semantic
Parser
Semantic
Role
Labeler
Entropy of Hierarchical Clustering
Concept - based Statistical
Analyzer
Conceptual Ontological Graph
(COG)
Representation
(ctf: conceptual term frequency)
Concepts
Concepts
Concepts
Concepts
Improvement
Single-Term
Concept-based
Reuters
0.251
0.012
-95.21%
ACM
0.317
0.043
-86.43%
Brown
0.385
0.018
-95.32%
(tf : term frequency)
Conceptual Ontological Graph (COG) Ranking
Precision of Search
Single-Term
Concept-based
Improvement
Cran
0.536
0.901
+68.09%
Reuters
0.591
0.897
+51.77%
Single-Term
Concept-based
Improvement
Cran
0.486
0.827
+70.16%
Reuters
0.452
0.841
+86.06%
Recall of Search Result
PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction
Semantic Keyword Extraction

Tasks



Progress


Developing tools and techniques to extract semantic keywords
toward facilitating metadata generation
Developing algorithms to enrich metadata (tags) which can be
applied in index-based multimedia retrieval
Proposed a new information theoretic inclusion index to measure
the asymmetric dependency between terms (and concepts),
which can be used in term selection (keyword extraction) and
taxonomy extraction (pseudo ontology)
Makrehchi, M. and Kamel, ICDM07, WI 07
PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction

Rule-based Keyword Extraction




Learn rules to find keywords in English
sentences
Rules represent sentence fragments
 Specific enough for reliable keyword
extraction
 General enough to be applied to
unseen sentences
Rule generalization
 Begin with an exact sentence
fragment
 Merge with another by moving
different words to the lowest common
level in the part-of-speech hierarchy
 Keep merged rule if it does not
reduce precision and recall of
keyword extraction; keep original
rules otherwise
Keyword extraction
 Find sequence of rules that best
cover an unseen sentence
 Extract keywords according to rules
Rule base size shows quick initial growth, followed
by slow and irregular growth and rule elimination
Learns 20 rules from the first 50 training rules
Learns 13 additional rules from the next 220
training rules


Both
precision and recall values increase
during training
Precision
Recall
(blue) increases 10%
(red) shows slight upward trend
PAMI Research Group, University of Waterloo
Categorization: Ensemble-based Clustering

Consensus Clustering




Categorization of learning objects using proposed consensus clustering
algorithms.
The goal of consensus clustering is to find a clustering of the data objects
that optimally summarizes an ensemble of multiple clusterings.
Consensus clustering can offer several advantages over a single data
clustering, such as the improvement of clustering accuracy, enhancing the
scalability of clustering algorithms to large volumes of data objects, and
enhancing the robustness by reducing the sensitivity to outlier data objects
or noisy attributes.
Tasks




Development of techniques for producing ensembles of multiple data
clusterings where diverse information about the structure of the data is
likely to occur.
Development of consensus algorithms to aggregate the individual
clusterings.
Develop solutions for the cluster symbolic-label matching problem
Empirical analysis on real-world data and validation of proposed method.
PAMI Research Group, University of Waterloo
Categorization using cluster ensemble
Dataset
#
samples
#
attributes
#
classes
K-means’ Mean
Error Rate in %
Ensemble’s Mean
Error Rate in %
Synthetic1
1000
8
5
17.41
0
Yahoo! (text)
2340
1458
6
38.23
16.24
Texture (image)
5500
40
11
37.99
11.54
Optical Digit
Recognition
500
64
10
27.31
16.40
PAMI Research Group, University of Waterloo
Categorization: Distributed Clustering
Hierarchical P2P Document Clustering
Root
h=H

Peer nodes are arranged into groups called
“neighborhoods”.

Multiple neighborhoods are formed at each level of
the hierarchy.

This size of each neighborhood is determined
through a network partitioning factor.
h = H-1
h=2
h=1
SuperNode (S)
h=0


Each neighborhood has a designated supernode.

Supernodes of level h form the neibhorhoods for
level h+1.

Clustering is done within neighborhood boundaries,
then is merged up the hierarchy through the
supernodes.



Significant speedup over centralized clustering and
flat peer-to-peer clustering.
Multiple levels of clusters.
Distributed summarization of clusters using
CorePhrase keyphrase extraction.
Scenarios

HP2PC Architecture
h=3
P( 2)  { p1( 2) , p2( 2) }
Q( 2)  {Q1( 2) }
Distributed knowledge discovery in hierarchical
organizations.
P(1)  { p1(1) , p2(1) , p3(1) , p4(1) }
Q(1)  {Q1(1) , Q2(1) }
h=2
β=0
Benefits

Neighborhood (Q)
P(0)  { p1(0) ,, p16(0) }
Q(0)  {Q1(0) ,, Q4(0) }
h=1
β = 0.33
h=0
β = 0.2
HP2PC Example
3-level network, 16 nodes
PAMI Research Group, University of Waterloo
Categorization: Multiple Classifier Systems


Tasks

To investigate various aspects of
cooperation in Multiple Classifier
Systems (Classifier Ensembles)

To develop evaluation measures in
order to estimate various types of
cooperation in the system

To gain insight into the impact of
changes in the cooperative
components with respect to system
performance using the proposed
evaluation measures

To apply these findings to optimize
existing ensemble methods

To apply these findings to develop
novel ensemble methods with the
goal of improving classification
accuracy and reducing computation
complexity
Progress

Proposed a set of evaluation
measures to select sub-optimal
training partitions for training
classifier ensembles.

Proposed an ensemble training
algorithm called Clustering, Declustering, and Selection (CDS).

Proposed and optimized a
cooperative training algorithm called
Cooperative Clustering, Declustering, and Selection (CO-CDS).

Investigated the applications of
proposed training methods (CDS
and CO-CDS) on LO classification.
PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution

Objective


Advance classification of multi-class imbalanced data
Tasks

To develop cost-sensitive boosting algorithm AdaC2.M1

To improve the identification performance on the important
classes

To balance classification performance among several classes
PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution
Class Distribution
Performance of Base Classification and AdaBoost
C4.5
class
Ind. size
C1
Dist.
49
C1
7.84%
C2
C2
288
46.08%
C3
C3
288
46.08%
HPWR (Od=3)
Meas.
Base
AdaBoost
Base
AdaBoost
R
0
5.11
10.70
44.06
P
N/A
6.5
11.82
32.89
F
N/A
5.84
10.83
35.84
R
73.21
92.28
88.31
87.43
P
69.53
88.75
86.79
91.99
F
72.29
90.38
87.43
89.64
R
67.94
91.36
87.63
88.42
P
73.89
87.88
87.07
89.91
F
71.91
89.42
86.99
89.03
0
11.46
33.32
68.50
G-measure
Balanced performance among classes - Evaluated by G-mean
C4.5
HPWR (Od=3)
Class
Meas.
Base
AdaBoost
AdaC2.M1
Base
AdaBoost
AdaC2.M1
C1
R
0
5.11
77.58
10.70
44.06
65.72
P
N/A
6.50
14.12
11.82
32.89
30.83
R
73.21
92.28
64.73
88.31
87.43
83.12
P
69.53
88.75
97.24
86.79
91.99
91.38
R
67.94
91.36
65.23
87.63
88.42
83.95
P
73.89
87.88
93.22
87.07
89.91
90.81
0
11.46
68.42
33.32
68.50
76.08
C2
C3
G-mean
PAMI Research Group, University of Waterloo
Personalization

Opposition-based Reinforcement Learning for
Personalizing Image Search

Developing a reliable technique to assist users, facilitate and
enhance the learning process

Personalized ORL tool assists user to observe the searched
images desirable for her/him

Personalized tool gathers images of the searched results,
selects a sample of them

By interacting with user and presenting the sample, it learns
the user’s preferences
PAMI Research Group, University of Waterloo
Personalization
PAMI Research Group, University of Waterloo
Image Mining: CBIR

Content based image retrieval
Rich
Documents
images

Build an IR system that can retrieve images based on:
Textual Cues, Image content, NL Queries
Documents contain QI
Image Retrieval
Tool Set
Images contain QT
Images match QI
NL Description of Image
Query Image QI
Query Text QT
Query Document
Automated image tagging
PAMI Research Group, University of Waterloo
Illustrative Example
IZM
FD
Accuracy
= 70%
x
x
x
x
Accuracy
= 60%
x
x
x
x
x
x
x
x
MTAR
x
x
x
x
Accuracy
= 95%
x
x
x
x
x
Accuracy
= 55%
x
x
x
The proposed approach
PAMI Research Group, University of Waterloo
Experimental Results (Cont’d)
The Performance of the proposed approach
PAMI Research Group, University of Waterloo
Integration and Applications
Progress
Finished
core parts of the common data mining
framework.
Built
components and services from theme researchers’
work around the data mining framework.
Provided
documentation for the data mining framework
and software components.
Launched
web site to host components and
documentation from Theme 4:
http://pami.uwaterloo.ca/projects/lornet/software/
PAMI Research Group, University of Waterloo
Integration and Applications

Progress

Core parts of the common data mining framework are available,
including:
•
•
•
•
•

Components and tools built around the common data mining
framework:
•
•
•
•

Vector and matrix manipulation.
Document parsing and tokenization.
Statistical term and sentence analysis.
Similarity calculation using multiple distance functions.
IMS Content Package compliant parser.
Metadata extraction from single documents; supports Dublin Core encoding.
Document similarity calculation using cosine similarity.
Single document and content package summarization.
Building of standard text datasets from large document collections.
Integration with TELOS:
•
•
•
•
Developed C# TELOS connector for integrating Theme 4 components.
Worked on component manifest specification with Theme 6.
Provided metadata extraction as part of a complete scenario for TELOS components integration.
The following components were wrapped for use by TELOS through the C# connector: Automatic
Metadata Extractor, Document Similarity, and Document Summarizer.
PAMI Research Group, University of Waterloo
Industry Collaboration



Pattern Discovery Software (PDS) provided data mining software tools for use by
researchers.
Vestech provided opportunities for researchers to work on speech technologies.
Desire2Learn opened job opportunities for LORNET researchers.
PAMI Research Group, University of Waterloo
Software Components
Overview of Components

General Tools


C# Connector for TELOS
Common Data Mining Framework
Scenarios for Use of Software Components
Environment
Data Types
TELOS











Metadata Extractor
Document Summarizer
Content Package Summarizer
Document Similarity
LO Recommender
Metadata Harvester
Keyword Extractor
Taxonomy Extractor
Metadata Enrichment Tools
Concept-based and Semantic Text Mining
Tools






LO Classifier
LO Multiple Classifier
LO Clusterer
LO Ensemble Clusterer
LO Consensus Clusterer
LO Distributed Clusterer



Learning Object
Repository



Metadata
Structured Text
Categorical




e-Learning
Environment



Metadata Extractor
 LO Search Engine
 Document Similarity
 Document Classifier
 Document Clusterer
 Semantic-based Ontology
Representation
 Semantic Metadata Matching
 POS Rule-Learning System
 Triplet Representation System
Categorization Tools




Metadata
Ontology
Standard Text Mining Tools


Tasks

Structured Text
Images
Object Relationships
Context










User-centric
Ontology construction and unification
Finding relations between components
Ranking components
Grouping components
Tagging components
Automatic metadata extraction
LO automatic classification
LO organization through clustering
Multiple organization strategies through
cluster ensembles
Extracting concepts from LO
Summarizing Documents
Grouping LOs
Tagging LOs
Discovering Similar Topics
Discovering Similar Peers
Building Social Networks
Detecting Plagiarism
LO recommendation using similarity ranking
Personalization / Specialization through
reinforcement learning
Tools
Personalized
Social
Image
Search Engine
Network Learner
Mining Tools
Content-based
Image Search
Image Search
Consensus-based Fusion for Image Retrieval
Personalized
PAMI Research Group, University of Waterloo
Legend
 Integrated
 Ready
 In Progress
 Year 5
Publications
Papers
Papers
Theses
(accepted / published)
(submitted / in prep)
(completed / in progress)
4.1 Information Extraction from
Text
11
7
3/2
4.2 Semantic Knowledge
Synthesis from Text
10
4
4/1
4.3 Knowledge Discovery through
Categorization
12
10
4/1
4.4 Knowledge from Interaction
8
3
1/2
4.5 Knowledge from Image Mining
10
3
2/1
Total
51
27
14//7
PAMI Research Group, University of Waterloo
= 21
Theme 4 Team
Leader: M. Kamel

PI’s:


Dr. Basir
Dr. Tizhoosh


Dr. Karray
Asso PI (Wong,
DiMarco
Graduated




Researchers





H. Ayad
R. Kashef
A. Ghazel
Dr. Makhreshi




M. Shokri
S. Hassan
A. Farahat
Dr. R. Khoury


CRC/CFI/OIT
NSERC
PAMI Lab





Funding





PDS,
Vestech,
Desire2Learn




PAMI Research Group, University of Waterloo
R. Khoury, PhD 07
L. Chen, PhD 07
M. Makhreshi,PhD 07
K.Hammouda,PhD 07
R. Dara, PhD 07
Y.Sun, PhD 07
K. Shaban, PhD 06
Y. Sun, PhD 06
M. Hussin, PhD 05
Jan Bakus, PhD 05
A. Adegorite, MA.Sc04
A. Khandani, MA.Sc05.
S. Podder, MA.Sc.04
Pattern Analysis and Machine Intelligence Lab
Electrical and Computer Engineering
University of Waterloo
Canada
www.pami.uwaterloo.ca
www.pami.uwaterloo.ca/projects/lornet/software/
www.pami.uwaterloo.ca/kamel.html
PAMI Research Group, University of Waterloo
publications