EIN 4905/ESI 6912 Decision Support Systems Excel

Download Report

Transcript EIN 4905/ESI 6912 Decision Support Systems Excel

Spreadsheet-Based Decision Support Systems

Chapter 7: Statistical Analysis Prof. Name Position University Name [email protected]

(123) 456-7890

     7.1 Introduction 7.2 Understanding Data 7.3 Relationships in Data 7.4 Distributions 7.5 Summary

Overview

2

Introduction

 Performing basic statistical analysis of data using Excel functions  Statistical features of the Data Analysis Toolpack  Trend curves for analyzing data patterns  Basic linear regression techniques in Excel  Several different distribution functions in Excel 3

Understanding Data

 Statistical Functions  Descriptive Statistics  Histograms 4

Statistical Functions

AVERAGE

– Finds the mean of a set of data.

=AVERAGE(range or range_name)

MEDIAN

– Finds the middle number in a list of sorted data.

=MEDIAN(range or range_name)

STDEV

– Finds the standard deviation of a set of data.

– This is equal to the square root of the

variance

, which measures the difference between the mean of the data set and the individual values.

=STDEV(range or range_name)

5

Figures 7.1 and 7.2

6

Figures 7.3 and 7.4

7

Analysis Toolpack

 An Excel Add-In which includes several statistical analysis techniques  To ensure that it is an active Add-in, choose

Tools > Add-ins

from the menu. Select

Analysis Toolpack

from the list. 8

Descriptive Statistics

 Provides a list of statistical information about your data set including – Mean – Median – Standard deviation – Variance  Go to

Tools > Data Analysis > Descriptive Statistics

9

Descriptive Statistics (cont)

 The

Input Range

the data set. refers to the location of  You can check whether your data is

Grouped By Columns

or

Rows

.  If there are labels in the first row of each column of data, then check the

Labels in First Row

box.  The

Output Range

refers to where you want the results of the analysis to be displayed in the current worksheet.  The

Summary Statistics

calculate the most commonly used statistics from our data.

box will 10

Figure 7.7

 Quarterly stock returns for three different companies are recorded. We want to know – Average stock return – Variability of stock returns – Which quarters had the highest and lowest stock returns 11

Figures 7.8 and 7.9

12

Figure 7.11

 The standard deviation can be used to understand how common

outliers

are in the data.

13

More Descriptive Statistics

Confidence Level for Mean

– The mean is calculated using the specified confidence level (for example, 95% or 99%), the standard deviation, and the size of the sample data.

– The confidence level and calculated mean are then added to the analysis report.

– You can compare the actual mean to this calculated mean based on the specified confidence level.

Kth Largest

– Gives the largest ranked data value for a specified value of k.

– For k = 1, the maximum data value would be returned.

Kth Smallest

– Gives the smallest ranked data value for a specified value of k.

– For k = 1, the minimum data value would be returned.

14

Descriptive Statistics Functions

 

PERCENTILE

– Returns a value for which a desired percentile

k

=PERCENTILE(data_set, k)

of the specified

data_set

falls below.

For example, for the MSFT data, the value for which 95% of the data falls below is –

=PERCENTILE(B4:B27,0.95) =

0.108

 

PERCENTRANK

– Returns the percentile of the

data_set

which falls below a given

value

. –

=PERCENTRANK(data_set, value)

For example, the percent of the MSFT data which falls below the value 0.108 is –

=PERCENTRANK(B4:B27, 0.108) =

0.95, or 95% 15

Histograms

 Histograms calculate the number of occurrences, or frequency, which values in a data set fall into various intervals.  Choose the

Histogram

option from the

Analysis Toolpack

list.

16

Histograms (cont’d)

 The

Input Range

set. is the range of the data  The location of the bin values. –

Bin Range

Bins

are the intervals into which values can fall; they can be defined by the user or can be evenly distributed among the data by Excel. is used to specify the  The

Output Range

output, or the frequency calculations for each bin. is the location of the  The chart options include a simple

Chart Output

(the actual histogram),

Cumulative Percentage Pareto

for each bin value, and a organization of the chart.

17

Figures 7.15 and 7.16

18

Figures 7.17 and 7.18

 To create your own bin values, make a list of upper bounds for each interval.

19

Figure 7.19

20

Histograms (cont’d)

 Histograms can also be formatted.

– Right-click on the histogram and change the

Chart Options

or other parameters.

21

Histograms (cont)

 There are four basic shapes to a histogram: –

Symmetric:

has only one peak; that is, there is a central high part and almost equal lower parts to the left and right of this peak. – –

Positively skewed:

has a peak on the left and many lower points (stretching) to the right.

Negatively skewed:

has a peak on the right and many lower points (stretching) to the left.

Multiple peaks:

imply that more than one source, or population, of data is being evaluated.

22

 Trend Curves  Regression

Relationships in Data

23

Data Relationships

 Relationships in data are usually identified by comparing two variables: the

dependent variable

and the

independent variable

. – The dependent variable is the variable we are most interested in. By understanding its current behavior we can better predict its future behavior. – The independent variable is the variable we use as the comparison in order to make this prediction. 24

Trend Curves

 Trend curves are used to graph and analyze these relationships between data.

 Trend curves graph the data with – the independent variable on the x-axis – the dependent variable on the y-axis  To add a trend curve to your chart, right-click on the data points in an

XY Scatter

chart and choose

Add Trendline

from the drop-down list of options.

25

Trend Curves (cont’d)

 There are five basic trend curves which Excel can model: – – – – –

Linear Exponential Power Moving Average Logarithmic

26

Trend Curves (cont’d)

 Click on the

Options

tab to set options for the trend curve.

 Set the name of the trendline.

 Specify a period forward or backwards for which you want to predict the behavior of your dependent variable.

 Check to

Display Equation Display R-Squared Value

. and 27

Linear Trend Curves

 Number of

Units Produced

each month and the corresponding

Monthly Plant Cost

are recorded.

 The company wants to be able to estimate their plant costs based on the planned production amounts.  The

dependent variable

is therefore the

Monthly Plant Cost

and the

independent variable

is the

Units Produced

. 28

Figures 7.26 and 7.29

 Graph the data and then add a Linear trendline.

29

Figure 7.30

  Use the displayed equation to predict future values.

First check the accuracy of the equation by calculating the error from the known data.

 Linear trends have the relationship:

y = a*x - b

30

Exponential Trend Curves

Sales

data for ten years is recorded.  We want to be able to predict sales for the next few years.

 The

independent variable

Sales

. is

Years

and our

dependent variable

is 31

Figures 7.34 and 7.35

 Exponential trends have the relationship:

y = a*e^(b*x)

or

y = a*EXP(b*x)

32

Power Trend Curves

 We are given yearly

Production

values and yearly

Unit Cost

for production.  We want to determine the relationship between

Unit Cost

in order to be able to predict future

Unit Costs

.

and

Production

33

Figures 7.39 and 7.40

 Power trends have the relationship:

y = a*x^b

34

Regression Analysis

 We can use some regression analysis parameters to ensure that the relationships we have chosen for our data are “good” fits.

 These parameters include – – – –

R-Squared value Standard error Slope Intercept

35

R-Squared Value

 The

R-Squared

value measures the amount of influence the independent variable has on the dependent variable.

 The closer the R-Squared value is to 1, the stronger the relationship is between the independent and dependent variables.  If the R-Squared value is closer to 0, then there may not be a relationship between these two variables. 36

Figure 7.42

 We fit a

Linear

trendline to the

Monthly Plant Cost per Units Produced

chart (see Figure 7.44).  The R-Squared value is 0.8137, which is fairly close to 1, implying a good fit.

37

Figure 7.45

 The

RSQ

Excel function can calculate the R-squared value from a set of data.

=RSQ(y_range, x_range)

 Note that this function only works with Linear trend curves. 38

Standard Error

 The standard error measures the accuracy of any predictions made.  It can be calculated in Excel using the

STEYX

function –

=STEYX(y_range, x_range)

 This function can also only be used for Linear trend curves. 39

Slope and Intercept

 Two Excel functions can be used with a linear regression line of a collection of data.

SLOPE

function –

=SLOPE(y_range, x_range)

INTERCEPT

function –

=INTERCEPT(y_range, x_range)

40

Distributions

 Many distributions have Excel functions associated with them. – These functions are basically equivalent to using distribution tables. – That is, given certain parameters of a set of data for a particular distribution, you would look at a distribution table to find the corresponding area from the distribution curve.  Some common distributions are – Normal – Exponential – Uniform – Binomial – Poisson – Beta – Weibull 41

Normal Distribution

 The parameters for this distribution are simply the value we are interested in finding the probability for, and the mean and standard deviation of the set of data.  The function we use with the Normal distribution is

NORMDIST

=NORMDIST(x, mean, std_dev, cumulative)

42

Normal Distribution (cont)

 The

cumulative

parameter will be seen in many Excel distribution functions.  This parameter can take the values

True

or

False

to determine if you want the value returned from the

cumulative distribution function

or the

probability density function

, respectively. – The

cumulative distribution function

(

cdf

) will find the probability that a value in the data set is less than or equal to x.

– The

probability density function (pdf

) will find the probability that a value is exactly equal to x. 43

Figure 7.48

 Annual drug sales at a local drugstore are distributed Normally with a mean of 40,000 and standard deviation of 10,000.  The probability that the actual sales for the year are 42,000 is 0.58, or 58%.

44

Figure 7.49

   What is the probability that annual sales will be between 35,000 and 49,000? To find this value, we will subtract the

cdf

values for these two bounds.

=NORMDIST(49000, 40000, 10000, True) – NORMDIST(35000, 40000, 10000, True)

This will return a 0.51 probability, or 51% chance.

45

Standard Normal Distribution

 If the mean of your data is 0 and the standard deviation is 1, then placing these values in the NORMDIST function with the

cumulative

parameter as

True

will find the resulting value from the Standard Normal distribution.  The

STANDARDIZE

function will convert the x value from a data set of a mean not equal to 0 and a standard deviation not equal to 1 into a value which does assume a mean of 0 and a standard deviation of 1.

=STANDARDIZE(x, mean, std_dev)

 The resulting standardized value is then used as the main parameter in the

NORMSDIST

function –

=NORMSDIST(standardized_x)

46

Figure 7.50

 Consider the same example used previously to find the probability that a drugstore’s annual sales are 42,000. 47

Uniform Distribution

 The Uniform distribution does not actually have a corresponding Excel function; however, a simple formula can be used to model the Uniform distribution.

1 / (b – a)

 Given that a value x is Uniformly distributed between

a

and

b

, we can use this formula to determine the probability that x will have an integer value in this interval. 48

Figure 7.51

 Consider any values for a and b, then use the formula to calculate the Uniform value.

49

Poisson Distribution

 The Poisson distribution has only the mean as its parameter.  The function we use for this distribution is

POISSON

=POISSON(x, mean, cumulative)

 The Poisson distribution value is the probability that the number events which occur is either between 0 and x (cdf) or equal to x (pdf).

50

Figure 7.52

  For example, consider a bakery which serves an average of 20 customers per hour. Find the probability that at most 35 customers will be served in the next two hours. 51

Exponential Distribution

 The Exponential distribution has only one parameter:

lambda

of the data set.

= 1 /

mean

 The function we use for this distribution is

EXPONDIST

=EXPONDIST(x, lambda, cumulative)

 The Exponential distribution is commonly used for modeling

interarrival

times.

52

Figure 7.53

   Let us use the same example with the bakery data.

Arrival rate

is said to be 20 customers per hour.

Interarrival mean

, or the Exponential mean, is 1 / arrival rate. Therefore, for this example, the interarrival mean is 1/20 hours per customer arrival.  To find the probability that a customer arrives in 10 minutes, we would set – x = 10/60 = 0.17 hours – lambda = 1/(1/20) = 20 hours –

=EXPONDIST(0.17, 20, True)

53

Binomial Distribution

 The Binomial distribution has the following parameters: the number of trials and the probability of a success.  We are trying to determine the probability that the number of successes is less than or equal to (using

cdf

) or equal to (

pdf

) some x value.  The function we use for this distribution is

BINOMDIST

=BINOMDIST(x, trials, prob_success, cumulative)

54

Figure 7.54

  Suppose a survey shows that 40 percent of people pay more attention to ads in the newspaper, and 60 percent pays more attention to ads on television.

What is the probability that out of 100 people surveyed, 50 of them respond more to ads on television? 55

Beta Distribution

 The Beta distribution has the following parameters:

alpha, beta, A,

and

B

. – –

Alpha

and

beta

are determined from the data set

A

and

B

are optional bounds on the x value for which you want the Beta distribution value  The function we use for this distribution is

BETADIST

=BETADIST(x, alpha, beta, A, B)

 If

A

and

B

are omitted, then a standard cumulative distribution is assumed and they are given the values 0 and 1, respectively.

56

Figure 7.55

 Determine the probability that a team can complete a project in 10 days.  Estimate the total time needed to be 1 to 2 weeks; these estimates will be the bound values, or the

A

and

B

parameters.  Use a mean and standard deviation of 12 and 3 days to compute the alpha and beta parameters.

57

Weibull Distribution

 The Weibull distribution has the parameters

alpha

and

beta

.  The function we use for this distribution is

WEIBULL

=WEIBULL(x, alpha, beta, cumulative)

 The Weibull distribution is most commonly used to determine reliability functions. 58

Figure 7.56

 On average, a lightbulb will last 1200 hours, with a standard deviation of 100 hours. We can use these values to calculate alpha and beta.

 We can now use the WEIBULL distribution to determine the probability that a lightbulb will be reliable for 55 days = 1320 hours.

59

Summary

      The Analysis Toolpack is an Excel Add-In that includes statistical analysis techniques such as

Descriptive Statistics

,

Histograms

,

Exponential Smoothing

,

Correlation

,

Covariance

,

Moving Average

, and others.

The

Descriptive Statistics

about a data set, including the mean, median, standard deviation, and variance. option provides a list of statistical information Histograms calculate the number of occurrences, or frequency, which values in a data set fall into various intervals. Relationships in data are usually identified by comparing the

dependent variable

and the

independent variable

. There are five basic trend curves that Excel can model:

Linear

,

Exponential

,

Power

,

Moving Average

, and

Logarithmic

.

Some of the more common distributions that can be recognized when performing a statistical analysis of data are the

Normal

,

Exponential

,

Uniform

,

Binomial

,

Poisson

,

Beta

, and

Weibull

distributions.

60

 (place links here)

Additional Links

61