Association for Computing Machinery Penn State Student Chapter of the

Transcript Association for Computing Machinery Penn State Student Chapter of the

Penn State Student Chapter of the
Association for
Computing Machinery
We welcome all interested students to our
4th general meeting of the Spring 2005 semester!
When: Monday, April 11th, 2005 from 7-8 pm
Where: Cybertorium (213 IST)
Agenda:
• Brief overview of our ACM chapter
• New officer introductions
• Special topic presentation: No Pain, No Game
Presented by IST Professor Brian K. Smith
• Co-op/Intern presentation: Working at IBM
Presented by Rick Osowski
Free refreshments will be provided
IST 210
Data Warehousing, Data Mining,
and Advanced Applications
2
IST 210



Data Rich, but Information Poor
Data is stored, not explored :
by its volume and complexity
it represents a burden,
not a support
Data overload results in
uninformed decisions,
contradictory information,
higher overhead,
wrong decisions,
increased costs
Data is not designed and
is not structured for
successful management decision
making
3
IST 210
Improving Decision Making
Decisions
Information
Data
Warehouse
Data
4
IST 210
Data Warehouse Concepts
5
IST 210
What’s a Data Warehouse?
A data warehouse is a single, integrated source of decision
support information formed by collecting data from multiple
sources, internal to the organization as well as external, and
transforming and summarising this information to enable
improved decision making.
A data warehouse is designed for easy access by users to large
amounts of information, and data access is typically supported
by specialized analytical tools and applications.
6
IST 210
Data Warehouse Characteristics
 Key Characteristics of a Data Warehouse
Subject-oriented
Integrated
Time-variant
Non-volatile
7
IST 210
Subject Oriented
• Example for an insurance company :
Applications Area
Data Warehouse
Auto and Fire
Policy
Processing
Systems
Commercial
and Life
Insurance
Systems
Data
Data
Accounting
System
Billing
System
Policy
Customer
Claims
Processing
System
Losses
Premium
8
IST 210
Integrated
• Data is stored once in a single integrated location
(e.g. insurance company)
Auto Policy
Processing
System
Customer
data
stored
in several
databases
Data Warehouse
Database
Fire Policy
Processing
System
FACTS, LIFE
Commercial, Accounting
Applications
Subject = Customer
9
IST 210
Time - Variant
• Data is stored as a series of snapshots or views which record how it is
collected across time.
Data Warehouse Data
Data
{
Time
Key


Data is tagged with some element of time - creation date, as of
date, etc.
Data is available on-line for long periods of time for trend
analysis and forecasting. For example, five or more years
10
IST 210
Non-Volatile
• Existing data in the warehouse is not overwritten or
updated.
External
Sources
Production
Databases
Data
Warehouse
Environment
Production
Applications
• Update
• Insert
• Delete
Data
Warehouse
Database
• Load
• Read-Only
11
IST 210
Transaction System vs. Data Warehouse
12
IST 210
Transaction-Based Reporting System
Day-to-day operations
On-line, real time update
into disparate systems
System
Experts
Data
Manipulation
Users
Unix
VMS
MVS
Other
13
IST 210
Warehouse-Based Reporting System
Executive Reporting
and On-Line Analysis
Unix
Other
BENEFIT: Reduce
data processing
costs
Data Staging,
Transformation
and Cleansing
Data
Warehouse
Environment
BENEFIT: Integrated,
consistent data
available for analysis
Summarization
MVS
Interfaces
VMS
OLAP
BENEFIT: Improve Network
Reporting processes and
analytical capabilities
14
IST 210
Transaction - Warehouse Process
“Transaction Based Process”
Day-to-day
operations
On-line, real time
update.
Detailed Information to
operational systems.
Decision support for management
use.
Summarize &
Refine
Transform
Batch Load
“Warehouse Based Process”
15
IST 210
Transaction System vs. Data Warehouse
 Transaction System
 Supports day-to-day operational processes
 Contains raw, detailed data that has not been
refined or cleansed
 Volatile -- data changes from day-to-day, with
frequent updates
 Technical issues drive the data structure and
system design
 Disparate data structures, physical locations,
query types, etc.
 Users rely on technical analysts for reporting
needs
 Operational processes impacted by queries
run off of system
 Data Warehouse
 Supports management analysis and decisionmaking processes
 Contains summarized, refined, and cleansed
information
 Non-volatile -- provides a data “snapshot”;
adjustments are not permitted, or are limited
 Business analysis requirements drive the data
structure and system design
 Integrated, consistent information on a single
technology platform
 Users have direct, fast access via On-line
Analytical Processing tools
 Minimal impact on operational processes
16
IST 210
Data Warehouse Architecture
17
IST 210
Data Warehouse Architecture
Operational System
Data Warehouse
Ad-hoc
Reporting
Conversion
& Interface
ODS
OLAP
Cubes
Canned
Reports
Staging Area
Data Marts
18
Data Warehouse Architecture
IST 210
Conversion and Cleansing Activities
Conversion
& Cleansing






Map source data to target
Data scrubbing
Derive new data
Data Extraction
Transform / convert data
Create / modify metadata
19
Data Warehouse Architecture
IST 210
Data Warehouse Components
Detailed
Data
Summary
Data





Ranges from detailed to
summarized data
Contains metadata
Many views of the data
Subject-Oriented
Time-variant
Metadata
20
Requirements Gathering Process
IST 210
Business Measure Definition




Standard definition and related business rules and formulas

Priority of the information (For example, is the information
necessary to derive other business measures?)

Data load frequency (e.g., monthly, quarterly, etc.)
Source data element(s), including quality constraints
Data granularity levels (e.g., county detail for state)
Data retention (e.g., one month, one quarter, one year, multiple
years)
21
IST 210
Star Join Schema
Dimension Tables
Region_Dimension_Table
region _id
NE
NW
SE
SW
Product_Dimension_Table
prod_grp_id
prod_id
prod_grp_desc
prod_desc
10
20
30
100
140
220
Fewer devices
Circuit boards
Components
Power supply
Motherboard
Co-processor
region _doc
Northeast
Northwest
Southeast
Southwest
account _id
100000
110000
120000
130000
140000
account _doc
ABC Electronics
Midway Electric
Victor Components
Washburn, Inc.
Zerox
Account_Dimension_Table
month
prod_id
region_id
account_id
vend_id
net-sales
gross_sales
01-1996
02-1996
03-1996
100
140
220
SW
NE
SW
100000
110000
100000
100
200
300
30,000
23,000
32,000
50,000
42,000
49,000
Fact Table
Monthly_Sales_Summary_Table
month
mo_in_fiscal_yr
month_name
Vendor_Dimension_Table
vend_id
01-1996
02-1996
03-1996
4
5
6
January
February
March
100
200
300
vendor_desc
PowerAge, Inc.
Advanced Micro Devices
Farad Incorporated
Time_Dimension_Table
22
IST 210
Multi-Dimensional Analysis
Geography Dimension
Zip Code
Customer Dimension
County
Class of Trade
Region
State
Client Type
Account
Store
Net Sales by Brand by
Region by Client Type
Product
Family
Product
Line
Brand
Category
Group
Product Dimension
Business Measure:
Net Dimension
Sales
Product
Item
DW0117
23
Application Solution Classes
IST 210

Executive information system (EIS) :


Present information at the highest level of summarization using
corporate business measures. They are designed for extreme ease-ofuse and, in many cases, only a mouse is required. Graphics are usually
generously incorporated to provide at-a-glance indications of
performance
Decision Support Systems (DSS) :

They ideally present information in graphical and tabular form,
providing the user with the ability to drill down on selected
information. Note the increased detail and data manipulation
options presented
24
IST 210
Data Mining
1
25
IST 210
Data Mining


The process of extracting valid, previously unknown,
comprehensible, and actionable information from
large databases and using it to make crucial business
decisions, (Simoudis,1996).
Involves the analysis of data and the use of software
techniques for finding hidden and unexpected
patterns and relationships in sets of data.
26
IST 210
Data Mining




Reveals information that is hidden and unexpected,
as little value in finding patterns and relationships
that are already intuitive.
Patterns and relationships are identified by examining
the underlying rules and features in the data.
Data mining can provide huge paybacks for
companies who have made a significant investment
in data warehousing.
Relatively new technology, however already used in a
number of industries.
27
IST 210
Examples of Applications of Data Mining

Retail / Marketing





Identifying buying patterns of customers
Finding associations among customer demographic
characteristics
Predicting response to mailing campaigns
Market basket analysis
Banking




Detecting patterns of fraudulent credit card use
Identifying loyal customers
Predicting customers likely to change their credit card
affiliation
Determining credit card spending by customer groups
28
IST 210
Examples of Applications of Data Mining

Insurance



Claims analysis
Predicting which customers will buy new policies
Medicine


Characterizing patient behavior to predict surgery visits
Identifying successful medical therapies for different
illnesses
29
IST 210
Data Mining Operations and Associated Techniques
30
IST 210
Database Segmentation





Aim is to partition a database into an unknown number of
segments, or clusters, of similar records.
Uses unsupervised learning to discover homogeneous subpopulations in a database to improve the accuracy of the
profiles.
Less precise than other operations thus less sensitive to
redundant and irrelevant features.
Sensitivity can be reduced by ignoring a subset of the attributes
that describe each instance or by assigning a weighting factor to
each variable.
Applications of database segmentation include customer
profiling, direct marketing, and cross selling.
31
IST 210
Scatterplot
32
IST 210
Visualization
33
IST 210
Data Mining and Data Warehousing




Major challenge to exploit data mining is identifying
suitable data to mine.
Data mining requires single, separate, clean,
integrated, and self-consistent source of data.
A data warehouse is well equipped for providing data
for mining.
Data quality and consistency is a pre-requisite for
mining to ensure the accuracy of the predictive
models. Data warehouses are populated with clean,
consistent data.
34
IST 210
Data Mining and Data Warehousing



It is advantageous to mine data from multiple sources to
discover as many interrelationships as possible. Data
warehouses contain data from a number of sources.
Selecting the relevant subsets of records and fields for data
mining requires the query capabilities of the data warehouse.
The results of a data mining study are useful if there is some
way to further investigate the uncovered patterns. Data
warehouses provide the capability to go back to the data
source.
35
IST 210
Advanced Database Topics
36
IST 210
A Little History







Prior to the 1980s  hierarchical and network
databases.
Hardware  dumb terminals using private networks
Database  centralized and stored on the disk packs
End user terminals  simply input/output devices
Processing at the mainframe
Data  text data
Networks had to handle text data
No access from outside to the organization's private
network.
37
IST 210
New Needs




Microcomputer enabled workstation processing
power.
Satellite and network technology provided for very
high speed, high traffic, and low cost long distance
communications networks.
Internet in the late 1990s and the corresponding
phenomenal growth in electronic commerce (Ecommerce) necessitated public access to data in
people's homes.
The volume of data needed to be transmitted
increased greatly.
38
IST 210
New Needs



Business environment changed during the last two
decades
Information stored at different locations, on different
hardware and operating systems, with different
commercial DBMS products, and with different
underlying data models had to be combined
The centralized database was no longer feasible to
handle these new demands
39
IST 210
Distributed Database Scenario

There are many advantages to using a distributed
database rather than a centralized database. They
are:





Improved performance, because high traffic data are stored
locally.
More efficient data management, because the DBA workload
is shared.
Better network integrity, because the whole system does not
stop if one computer goes down.
Expansion of the database is facilitated when the
organization grows, since new data does not have to be
centralized. It can remain and be administered in the original
location.
Data for the whole organization can still be accessed from
any location.
40
Distributed Database
IST 210



Data administration is improved (??)
In a distributed database system even a
simple task like creating a backup copy of the
database can take a considerable amount of
time.
If the database is divided among several
locations the time and workload for this task
can be shared.
41
IST 210
Replication of Data





System failure in one location should not stop processing in
other locations
Replicate all or parts of the database in more than one location.
Database replication improves performance and provides a failsafe option, but it involves considerable complexity
Replication of frequently used data improves response time and
reduces network traffic
If the data changes at one location it must be changed at all
locations
42
IST 210
Distributed Systems in an Ideal World



C. J. Date established rules for the ideal distributed
DBMS system
Rules are a goal that distributed systems strive
toward, but have not yet reached
According to Date's rules:



Each site is responsible for its own portion of the distributed
database, including security, backup, and recovery.
Each site has equal capabilities and does not rely on any
other site.
The system should work regardless of the computer
hardware, operating system, or network installed at any site.
43
IST 210
Date's Rules of Distributed Databases:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Local site independence
Central site independence
Failure independence
Location transparency
Fragmentation transparency
Replication transparency
Distributed query processing
Distributed transaction processing
Hardware independence
Operating system independence
Network independence
Database independence
44
IST 210
Complexities of Distributed Databases


There also are many complications involved in the
management of distributed database systems.
The distributed database must be carefully designed
to insure the following:




Store data as close as possible to where it is used most
often.
Make the location of the data transparent to the end user.
Make the system easy to expand.
Optimize queries to improve response time in the distributed
environment.
45
IST 210
Database Design


The designer must analyze the organization's needs
and business processes to determine the best way to
distribute the database.
There are several possibilities for storing the data in
more than one location:





Centralized master database
Replication of the entire or part of the database in several
locations
Horizontal partitions
Vertical partitions
Mixture of the above
46
Fragmentation
IST 210

Horizontal fragmentation
of the database



means that rows of a
table(s) may be stored
in different locations
Similar to the
separation of the
customer table in the
retailing example
above.
Vertical fragmentation
means that columns of a
table ( i.e., attributes or
groups of attributes of an
entity) are stored in
different locations.
47
IST 210
Query Formulation



Distributed databases require a considerable amount
of network overhead
Poorly formulated query it may cause unnecessary
data retrieval from the database
Query optimization is ideally performed by the
distributed database management system
48
IST 210
OODB

In traditional relational databases E-R Modeling and
normalization focuses on identifying entities, their attributes, and
the relationships between entities


This works well for most organizational data, especially business
data
The advent of the microcomputer and processing power on the
desktop



Computer aided design, CAD, became the norm for engineering
work, so it became necessary to store drawings
Powerful multimedia PCs with sound cards and color monitors
enabled the manipulation of sound and video files
Many other applications were developed that required more
than just text and numeric processing
49
IST 210


Why??
These new applications were facilitated by the development of
Object-Oriented Programming
Still evolving development of object-oriented data modeling,
object-oriented databases, and object-oriented database
management systems




OODBMS and O/R DBMS are two types of database management
systems that are currently available
O/R DBMS uses the basic theory of relational database
management systems with object-oriented features added
OODBMS is more object-oriented and was developed separately
from the relational products
OODMBS suffers from a lack of standardization that is available
with relational database systems
50

Association for Computing Machinery Penn State Student Chapter of the

Transcript Association for Computing Machinery Penn State Student Chapter of the

Directory