Interval-valued Fuzzy-Rough Feature Selection in Datasets with

Download Report

Transcript Interval-valued Fuzzy-Rough Feature Selection in Datasets with

Interval-valued
Fuzzy-Rough Feature Selection
in Datasets with Missing Values
Dr. Richard Jensen
Aberystwyth University, UK
Prof Qiang Shen
Aberystwyth University, UK
[email protected]
[email protected]
FUZZ-IEEE 2009
Richard Jensen and Qiang Shen
Outline
• The importance of feature selection
• Rough set theory
• Fuzzy-rough feature selection (FRFS)
• Interval-valued FRFS
• Experimentation
• Conclusion
Richard Jensen and Qiang Shen
Feature selection
• Why dimensionality reduction/feature selection?
High dimensional
data
Dimensionality
Reduction
Intractable
Low dimensional
data
Processing System
• Growth of information - need to manage this effectively
• Curse of dimensionality - a problem for machine learning
• Data visualisation - graphing data
Richard Jensen and Qiang Shen
Feature selection
• Feature selection (FS) is a DR technique that
preserves data semantics (meaning of data)
Feature set
Generation
Subset
Evaluation
Subset
suitability
Continue
Stopping
Criterion
Stop
Validation
• Subset generation: forwards, backwards, random…
• Evaluation function: determines ‘goodness’ of subsets
• Stopping criterion: decide when to stop subset search
Richard Jensen and Qiang Shen
Rough set theory
Upper
Approximation
Set A
Lower
Approximation
Equivalence
class Rx
Rx is the set of all points that are indiscernible
with point x in terms of feature subset B
Richard Jensen and Qiang Shen
Rough set feature selection
• Attempts to remove unnecessary or
redundant features
• Evaluation: function based on rough set
concept of lower approximation
• Generation: greedy hill-climbing algorithm
employed
• Stopping criterion: when maximum evaluation
value is reached
Richard Jensen and Qiang Shen
Fuzzy-rough sets
Fuzzy-rough set
Fuzzy similarity
Richard Jensen and Qiang Shen
7
Fuzzy-rough sets
• Fuzzy-rough feature selection
• Evaluation: function based on fuzzy-rough lower
approximation
• Generation: greedy hill-climbing
• Stopping criterion: when maximal ‘goodness’ is
reached (or to degree α)
• Problem #1: how to choose fuzzy similarity?
• Problem #2: how to handle missing values?
Richard Jensen and Qiang Shen
Interval-valued FRFS
• Answer #1: Model uncertainty in fuzzy
similarity by interval-valued similarity
IV fuzzy rough set
IV fuzzy similarity
Richard Jensen and Qiang Shen
Interval-valued FRFS
• When comparing two object values for a
given attribute – what to do if at least one is
missing?
• Answer #2: Model missing values via the
unit interval
Missing values
Richard Jensen and Qiang Shen
Other measures
• Boundary region
• Discernibility function
Richard Jensen and Qiang Shen
Experimentation
• Datasets corrupted with noise
• 10-fold cross validation with JRip
Richard Jensen and Qiang Shen
Results: lower
Richard Jensen and Qiang Shen
Results: boundary
Richard Jensen and Qiang Shen
Results: discernibility
Richard Jensen and Qiang Shen
Conclusion
• New approaches to fuzzy-rough feature
selection based on IVFS
• Can handle missing values effectively
• Allows greater flexibility w.r.t. similarity relations
• Future work
• Further investigations
• Development and extension of other fuzzy-rough
methods to handle missing values – classifiers,
clusterers etc.
Richard Jensen and Qiang Shen
• WEKA implementations of all fuzzy-rough
feature selectors and classifiers can be
downloaded from:
Richard Jensen and Qiang Shen
Richard Jensen and Qiang Shen
RSAR approximations
• Approximating a concept X using knowledge in P
• Lower approximation: contains objects that definitely
belong to X
PX  {x U : [ x]P  X }
• Upper approximation: contains objects that possibly
belong to X
PX  {x U : [ x]P  X  }
Richard Jensen and Qiang Shen
FRFS
• Based on fuzzy similarity
| a( x)  a ( y ) |
 Ra ( x, y)  1 
| a max a min |
 R ( x, y )   { R ( x, y )}
P
a
aP
• Lower/upper approximations
 R X ( x)  inf I (  R ( x, y ),  ( y ))
P
yU
P
X
 R X ( x)  sup T ( R ( x, y),  ( y))
P
yU
P
Richard Jensen and Qiang Shen
X