Statistical Analysis with Oracle

Download Report

Transcript Statistical Analysis with Oracle

Statistical Analysis with Oracle

Marc Daniau – Product group [email protected]

Introduction

The ANSI SQL offers analytical capabilities useful for advanced data analysis like linear regression and correlation. Beyond supporting the standard SQL functions Oracle provides additional vendor-specific functions that enable to perform statistical analysis such as descriptive statistics and hypothesis testing. In the following presentation we will put into practice the main statistical built-in functions from Oracle 10g release 2. We will also show an example of a user-defined function.

SLIDE 2 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

SLIDE 3 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Summarizing a data set

The built-in SQL functions involved in the following sample are avg, median, stats_mode, variance, stddev, min, max and percentile_cont .

Note there is no built-in functions for skew and kurtosis; we had to use custom SQL expressions.

SLIDE 4 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

The BOBJ Universe

Here is a sample Universe exposing the stats_* functions.

New functions can be added to the list by updating the ‘oracle.prm’ file.

SLIDE 5 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Box Plot

The quartiles are computed using the Oracle SQL function percentile_cont() within group (order by ()) .

SLIDE 6 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

A real-life example: GE Delivery span

The delivery span focuses on the time between when a customer requested the product and when it was delivered.

The span is a measure of variation similar to the inter quartile range but instead of looking at the middle 50% of the observations, it looks at the middle 90%.

Span = 95th Percentile – 5th Percentile The Oracle SQL functions to compute the span are: percentile_cont(0.95) and percentile_cont(0.05) .

SLIDE 7 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

A real-life example: GE Delivery span

The goal is to squeeze the two sides of the delivery span, days early & days late, ever closer to the center: the exact day the customer desired. Reducing variation is what quality is about.

Days Late Days Early

SLIDE 8 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

SLIDE 9 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression Studying the relationship of two time series

The built-in SQL functions used are regr_* and stats_* .

In this DeskI document, it appears that X and Y increase together.

SLIDE 10 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression Checking the normality of the residuals

The SQL function row_number() over() is used. A custom PL/SQL function inverse_phi() is employed for building the X axis; an alternative consists of using a lookup Z table.

SLIDE 11 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

User defined functions

The Oracle user can extend the list of built-in functions with his own PL/SQL functions. Following are examples of custom functions.

The custom function inverse_phi().

SLIDE 12 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

User defined functions

Custom functions can be exposed in the BOBJ SQL editor by adding them in the ‘oracle.prm’ file.

The custom function inverse_phi().

SLIDE 13 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression Detecting Outliers (unusual values)

The variables X and Y in this example are not time series. Mortality (Y) tends to decrease as education (X) increases. The limits are computed with the regr_* SQL functions. Ouliers appear in red.

SLIDE 14 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression Detecting Outliers (unusual values)

After changing the sigma factor from 2 to 2.5, New Orleans is no longer beyond limits.

SLIDE 15 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Non Linear Regression

Some simple non-linear relationships can be transformed into linear relationships. Every time you can transform a relationship into an equation of the form Y = a + b X , you can use the least squares method to fit the data.

This technique is used for fitting curves such as exponential, logarithmic, power, hyperbola, logistic (Pearl) and Gompertz.

SLIDE 16 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Non Linear Regression Fitting a S-shaped curve (Pearl)

The sample below plots the cumulative sales by month for a given product. The X and Y variables are computed using respectively row_number() over and sum(sum()) over() . The ln() function is used for transformation; regr_slope() and regr_intercept() for getting the regression coefficients.

SLIDE 17 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Non Linear Regression Fitting a S-shaped curve (Gompertz)

The error values are computed within the BOBJ report.

The number of periods to forecast SLIDE 18 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Weighted Least Squares (local regression) Smoothing a time series

The moving slope and intercept are calculated on the RDBMS side. The user chooses the strength of the smoother (moving window size).

SLIDE 19 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Multiple Linear Regression Forecasting

The regression coefficients are calculated on the fly by the rdbms using regr_* functions.

SLIDE 20 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

SLIDE 21 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Checking for auto-correlation The Lag Plot

The lag plot helps uncover seasonality and patterns from sequential data (e.g. time series) that may have been missed by looking only at a line chart. The lag plot is useful to check auto-correlation (lack of independence in the series).

It consists of a scatter diagram plotting Yt on the vertical axis versus Yt-lag on the horizontal axis.

The Oracle SQL function lag(,) over() is illustrated next.

SLIDE 22 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Checking for auto-correlation The Lag Plot

The sample line chart below tells us there is no shift of average over time. The lag plot (lag 1) on the right shows a pattern that cannot be seen in the line chart. Note that outliers (unusual values) appear more clearly on the lag plot.

SLIDE 23 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Auto-correlation function (ACF)

While the lag plot displays the individual data points for a given lag, the ACF plot gives a summary picture of auto correlation over multiple lags. The ACF plot is useful in identifying seasonal or cyclical patterns in a time series.

It shows the correlation coefficient statistic on the vertical axis over the different lags on the horizontal axis.

The Oracle SQL functions regr_* are used in the following example.

SLIDE 24 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

ACF Plot

The example below is the ACF plot of the monthly Australian beer production. The thin bars represent the correlation coefficients. The dotted flat lines are the significance limits.

The peak on lag 12 indicates seasonality.

SLIDE 25 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

SLIDE 26 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Comparison types

Organizations measure performance of individual entities such as products, geographies, suppliers or employees by making different types of comparisons : Over time Against pre-determined goals Against a comparator group SLIDE 27 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Evaluation against a group

Most of the examples presented in this section illustrate how to evaluate the performance of an individual entity against a comparator group. We will see various methods: Percent rank T-score Percent of leader Range score Percent of total Individual ratio versus group ratio SLIDE 28 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent Rank

The WebI document below ranks countries on economic performance. It involves the following Oracle SQL functions: rank() over() and percent_rank() over() .

SLIDE 29 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

T-score Comparing against the group average

The Xcelsius radar chart here requires the avg() over() and stddev_pop() over() SQL functions. The data set consists of 7 indicators and 7 countries. The blue line represents the group average normalized at value 50. The amber line corresponds to the country T-score (here USA).

SLIDE 30 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

The slicer allows the user to see the performance of the different countries in the G7 group.

Percent of Leader Comparing against the best in the group

The sample document below rates UK versus other nations on educational research. It requires the max() over() SQL functions. The underlying data set covers 35 nations and includes 7 indicators within a period from 1987 to 1998.

SLIDE 31 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Range score Positioning the individual in the group range

The report below evaluates companies performance based on 3 indicators. The range score normalizes the results before computing a composite score. It involves the SQL functions min() over() and max() over() .

We apply a weighted average to obtain a composite score. This is done within the BOBJ document.

SLIDE 32 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent of Total

Even though it is limited to additive indicators such as revenue, cost or number of customers, the percent of group total is commonly used. The Pareto chart is an example of graph that displays the percentage of each entity (represented as bars) relative to the total value of the group. It helps identify the largest contributors by presenting them first on the left side of the graph.

SLIDE 33 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent of Total - Pareto chart

The Oracle built-in function ratio_to_report() over() enable to compute the percent of total required by the Pareto chart.

SLIDE 34 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Cumulative Percent of Total

Here the percent of total for the two indicators is obtained with the ratio_to_report() over() SQL function. The cumulative sum is done within the BOBJ crosstab.

The report indicates that Callahan generates the largest revenue with the smallest number of customers.

We also observe that roughly half of the Sales (48.7%) is generated by a quarter of the customers (22.4%) … and by a quarter of the sales force (Callahan and King).

SLIDE 35 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Individual ratio versus group ratio

We just saw how we can apply a percent of total to additive indicators like Sales and Profit. Let’s assume we want to track the performance of Gross Profit rate calculated as Gross Profit divided by Gross Sales. The percent of total analysis will not work in that case since ratios are not additive. We could use the average of the Gross Profit rates in the group as the baseline but instead we will compare to the group Gross Profit rate obtained as: group total Gross Profit / group total Gross Sales .

SLIDE 36 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Individual ratio versus group ratio

We use the rollup() SQL function in order to generate on the fly a summary row that aggregates values for Sales, Profit and Profit rate. SLIDE 37 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

The blue cursors on both sides of the table indicate the group row which serves as the baseline. The categories above the baseline exhibit a better profit rate than the group.

Measuring process performance

Companies that implement a quality improvement methodology measure performance of their processes by analyzing the variation. A process is given upper and lower specification limits (USL & LSL) and a target. Variation over the target is measured to evaluate how capable the process is relative to the specifications. One process performance measurement method is presented next for illustration. SLIDE 38 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Process capability/performance Measuring process variations

The Oracle built-in SQL function stats_one_way_anova() returns the variations.

The custom PL/SQL function phi() is required to get the ppm/defect rates.

SLIDE 39 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

SLIDE 40 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Frequency Histogram – 1/3

The bucketing of individual values is performed with the width_bucket() function. The count aggregation is done by the rdbms. Empty buckets are handled in the BOBJ report.

The data set is from the Oracle HR demo database.

Salary exibits a distribution stretched out towards the right.

SLIDE 41 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Frequency Histogram – 2/3

Running the query.

The 3 arguments of the width_bucket() function are prompted The end-user can enter a custom value: e.g. 0.277

SLIDE 42 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Listing the main power-transforms.

Frequency Histogram – 3/3

The square root transform brings some symmetry to the salary data. The power-transforms require the following SQL functions: power() and ln() .

SLIDE 43 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Automatic bucketing

The BOBJ document below requires no inputs from the user with regards to the histogram buckets. We use the Sturges rule to determine the number of bars automatically. The histogram count aggregation is performed by the rdbms.

The normal plot complements the histogram in assessing the normality of the data set.

SLIDE 44 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

User-defined intervals

In the sample below the end-user enters the histogram interval width. The count aggregation is done by the rdbms.

The Gaussian curve is estimated within the BOBJ document.

The blue bars represent the differences between the observed frequencies (histogram bars) and the estimated frequencies (fitted curve). The Gaussian curve is turned into a horizontal line (zero baseline).

SLIDE 45 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Weibull distribution

The Hazard plot below uses the following SQL Functions: row_number() over() and sum() over() . The regression line is computed within the BOBJ document.

SLIDE 46 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics Descriptive Statistics Linear Regression Correlation Performance measurement Distribution analysis Hypothesis Testing

SLIDE 47 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

t-test on a single group Comparing the mean to a target

In this sample report we use the Oracle SQL function stats_t_test_one() .

SLIDE 48 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

t-test for two paired groups Before-After comparison of mean

In this sample report we use the Oracle SQL function stats_t_test_paired() .

SLIDE 49 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

t-test for two independent groups Comparing means

In this sample report we use the Oracle SQL function stats_t_test_indep() .

SLIDE 50 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Analysis of variance Comparing means across multiple groups

The Oracle SQL function stats_one_way_anova() is used here. The following sample involves six groups.

SLIDE 51 COPYRIGHT © 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.