Transcript Slide 1
Managing Data for DSS II Managing Data for DS Data Warehouse Common characteristics: – Database designed to meet analytical tasks comprising of data from multiple applications – Small number of users with intense and long interactions – Read intensive usage – Periodic updates to the contents – Consists of current as well as historical data – Relatively fewer but large tables – Queries results is large results sets, involving full table scan and joins spanning several tables – Aggregation, vector operation and summarization are common – The data frequently resides in external heterogeneous sources Introduction- Terminology Current Detail Data- data acquired directly from operational databases, often representing entire enterprise Old Detail Data- Aged current detail data, historical data organized by subjects, it helps in trend analysis Data Marts- A large data store for informational needs where scope is limited to a department, SBUs etc., In a phased implementation data marts are a way to build a warehouse. Summarized Data- Aggregated data along the lines required for executive reporting,trend analysis and decision support. Metadata- It is data about the data, description of contents, location, structure, end-user views, identification of authoritative data, history of updates, security authorizations Introduction- Architecture Management External Meta data Current Information Delivery System Report, Query & EIs Data Mining Tools Extract, Cleanup & Load Realized or Virtual MDDB Currentl Repository OLAP Tools • The Data Warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making. – Integrated • The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization. – Subject-Oriented • The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company. • Time Variant – The Warehouse data represent the flow of data through time. It can even contain projected data. – Non-Volatile • Once data enter the Data Warehouse, they are never removed. • The Data Warehouse is always growing. Major Tasks in Data Preparation • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data Extraction, Cleanup, Integration • Data Cleaning – Missing Values • • • • Ignore the tuple Fill in the value manually Use a global constant to fill Attribute mean as missing value – Average income of all customer is 30000 pm • Attribute mean of all samples belonging to same class – Missing value with average income of same class e.g., credit_risk, emp_status • Most probable value – Regression, Bayesian classifiers, decision tree induction Extraction, Cleanup, Integration • Data Cleaning – Noisy Data- A random error or variance of measured value. Given price how can we smooth our the data to remove noise. • Binning – Smooth the sorted data by consulting the neighbours. – Given 4 8 15 21 21 24 25 28 34 – Parttion it» Bin 1: 4 8 15 » Bin 2: 21 21 24 » Bin 3: 25 28 34 – Replace the Bin values by mean or Bin boundaries • Clustering • Regression- Smoothen it by fitting in fitting in functions. – Inconsistent Data – Manually or through rule base Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range – min-max normalization – Z-score normalization – Normalization by decimal scaling Data Transformation: Normalization • min-max normalization v min A v' (new _ max A new _ min A) new _ min A max A min A Suppose that the minimum and maximum values for the attribute income are £12,000 and £98,000, respectively. We map income to the range [0.0, 1.0]. By min-max normalization, a value of £73,600 for income is transformed to (73600-12000)/(98000-12000)*(1.0-0.0)+0=0.716. Data Transformation: Normalization • min-max normalization v min A v' (new _ max A new _ min A) new _ min A max A min A • z-score normalization v m ean v' stand _ dev • normalization by decimal scaling A A v v' j 10 Where j is the smallest integer such that Max(| v ' |)<1 Star Schema • The star schema is a data-modeling technique used to map multidimensional decision support into a relational database. • Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database. • Four Components: – Facts – Dimensions – Attributes – Attribute hierarchies A Simple Star Schema Star Schema • Facts – Facts are numeric measurements (values) that represent a specific business aspect or activity. – The fact table contains facts that are linked through their dimensions. – Facts can be computed or derived at run-time (metrics). • Dimensions – Dimensions are qualifying characteristics that provide additional perspectives to a given fact. – Dimensions are stored in dimension tables. Star Schema • Attributes – Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts. – Dimensions provide descriptive characteristics about the facts through their attributes. Possible Attributes For Sales Dimensions Three Dimensional View Of Sales Slice And Dice View Of Sales Star Schema • Attribute Hierarchies – Attributes within dimensions can be ordered in a well-defined attribute hierarchy. – The attribute hierarchy provides a top-down data organization that is used for two main purposes: • Aggregation • Drill-down/roll-up data analysis A Location Attribute Hierarchy Attribute Hierarchies In Multidimensional Analysis Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type location location_key street city province_or_street country Sales Fact Table Time_keu Prod_key Loca_key branch_keyunit_sales $L_sales $_cost 1 2 2 1 20 100000 5000 4 6 17 8 30 600 20 6 12 4 77 25 5000 200 Time Table Time_ke day 6 Mon 4 Tue 15 Wed day_week 6 4 1 month Jan Jan Jan Item Table Item_key Item name brand 12 Hp Cam HP 6 IBM PC IBM PC season Q1 Q1 Q1 year 2008 2008 2008 type supplier HP3225 Hp Delhi IBM PC 2GIBM New York