Transcript Slide 1

Data mining
• Data mining, at its core, is the transformation
of large amounts of data into meaningful
patterns and rules.
Definition of Data Mining
• The nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data stored in structured databases.
- Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial,
valid, novel, potentially useful, understandable.
• Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data dredging,…
Two types of Data mining
• Directed Data mining (supervised)
• Undirected (Unsupervised)
Direct Data mining
• In directed data mining, you are trying to
predict a particular data point .
• For example. the sales price of a house given
information about other houses for sale in the
neighborhood
Undirect Data mining mining
• In undirected data mining, you are trying to
create groups of data, or find patterns in
existing data .
• For example. In effect, every U.S. census is
data mining, as the government looks to
gather data about everyone in the country
and turn it into useful information.
• Modern data mining started in the mid-1990s,
as the power of computing, and the cost of
computing and storage finally reached a level
where it was possible for companies to do it
in-house, without having to look to outside
computer powerhouses
• The term data mining is all-encompassing,
referring to dozens of techniques and
procedures used to examine and transform
data.
Data Mining at the Intersection of
Many Disciplines
ial
e
Int
tis
tic
s
c
tifi
Ar
Pattern
Recognition
en
Sta
llig
Mathematical
Modeling
Machine
Learning
Databases
Management Science &
Information Systems
ce
DATA
MINING
Data Mining
Characteristics/Objectives
• Source of data for DM is often a consolidated data
warehouse (not always!)
• DM environment is usually a client-server or a
Web-based information systems architecture
• Data is the most critical ingredient for DM which
may include soft/unstructured data
• The miner is often an end user
• Striking it rich requires creative thinking
• Data mining tools’ capabilities and ease of use are
essential (Web, Parallel processing, etc.)
Data in Data Mining
• Data: a collection of facts usually obtained as the result of
experiences, observations, or experiments
• Data may consist of numbers, words, images, …
• Data: lowest level of abstraction (from which information
and knowledge are derived)
Data
- DM with different
data types?
Categorical
Nominal
- Other data types?
Numerical
Ordinal
Interval
Ratio
A Taxonomy for Data Mining Tasks
Data Mining
Learning Method
Popular Algorithms
Supervised
Classification and Regression Trees,
ANN, SVM, Genetic Algorithms
Classification
Supervised
Decision trees, ANN/MLP, SVM, Rough
sets, Genetic Algorithms
Regression
Supervised
Linear/Nonlinear Regression, Regression
trees, ANN/MLP, SVM
Unsupervised
Apriory, OneR, ZeroR, Eclat
Link analysis
Unsupervised
Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Sequence analysis
Unsupervised
Apriory Algorithm, FP-Growth technique
Unsupervised
K-means, ANN/SOM
Prediction
Association
Clustering
Outlier analysis
Unsupervised
K-means, Expectation Maximization (EM)
• The ultimate goal of data mining is to create a
model.
• A model that can improve the way you read and
interpret your existing data and predict your
future data.
• Since there are so many techniques with data
mining, the major step to creating a good model
is to determine what type of technique to use.
That will come with practice and experience, and
some guidance. From there, the model needs to
be refined to make it even more useful.
Weka as a Data mining tool
• Data mining isn't solely the domain of big companies and expensive
software.
• In fact, there's a piece of software that does almost all the same
things as these expensive pieces of software — the software is
called WEKA .
• WEKA is the product of the University of Waikato (New Zealand)
and was first implemented in its modern form in 1997.
• It uses the GNU General Public License (GPL).
• The software is written in the Java™ language and contains a GUI
for interacting with data files and producing visual results (think
tables and curves).
• It also has a general API, so you can embed WEKA, like any other
library, in your own applications to such things as automated
server-side data-mining tasks.