Combining the Power of R and Excel: RExcel

Download Report

Transcript Combining the Power of R and Excel: RExcel

Combining the Power of R and Excel: RExcel
A LISA Short Course
February 2012
Matthew Lanham
Ph.D. Student, Business Information Technology
M.S. Student, Statistics
As you come in, please get materials here:
https://filebox.vt.edu/users/lanham/LISA/
Motivation for this course:
Two Facts
1. Excel is the most prevalent software used for data storage and analysis. There are a
lot of built in statistical functions in Excel along in addition to the “Analysis
ToolPak.”
2. R is a free and open source program, and one of the most powerful and the
fastest-growing statistics programs.
Why not use them both together!!
This is you with Excel This is you with Excel + R
Outcome from this course:
I hope to have provided you some
examples that you might
incorporate in your own work that
might prove beneficial.
Lets get started:
1) Double-click the RExcel2010 with Rcommander Icon
This will open Excel and Rcommander. R commander is
like using the standard R GUI, but looks a bit different.
You will find R in the Excel Ribbon as well.
Part 1
Transferring data between R and Excel
• Data from Excel to R
• Data from R to Excel
RExcel Drop-down


















Close R – Will close the open instance of R and Rcommander as well
Run Code – Will run R code
Get R Value (Array or Dataframe) – Gets data
Put R Value (Array or Dataframe) – Defines a cell or range for R
Get R Output – Retrieves code output from R to Excel
Set R working dir – Define the folder location you want to work from on your PC.
Load R file – Used to load a data set or .R file
Copy code – copies code in Excel
Debug R – If checked, this will open a debugger if an error occurs
Error log – This will show you all the R errors
Options – Offers a few basic options
Set R sever – allows to select the server type, server name (for remote servers), and R process
name (for servers from a serverpool).
RExcel Help – Takes you here: file:///C:/Program%20Files%20%28x86%29/RExcel/doc/RExcel.html
Rhelp – Takes you here: http://127.0.0.1:18357/doc/html/index.html
Rcommander – Opens Rcommander with menus in the Excel Ribbon or in Rcommander.
Demo worksheets – There are five demos for learning how to use the software
Mark calc cells – If activated, this will mark all cells containing calculated results with a special
marker in the upper left corner
About RExcel
Part 1
Functions, Arrays, and Dataframes
• Advantages
– Use Excel as a container for dependencies
– Use R code functions without lengthy “IF”
statements
– Allows automatic recalculations via Excel’s
computation engine(R will not do this by itself)
See RExcelExamples workbook, Part1 tab
Regression: Excel and R
Excel
1. Excel Functions
TREND(Y-range, X-range, X-value for prediction) function
LINEST(Y-range, X-range, Const, Stats) array function
2. Excel’s Analysis TookPak
Data -> Data Analysis -> Regression -> Then fill in the dialog box (see example sheet)
R
1. Use Rcommander
"Statistics" -> "Fit models" -> "Linear regression.."
2. Use R code via RExcel myfit = lm(formula = Sales ~ Advertising,
data = salesdata)
summary(myfit)
Benefits of each: Use what you like and is more advantageous to your problem
• The Excel functions automatically update
• Analysis TookPak outputs the statistics in a nice readable table
• Rcommander has nice drop-down menus
• R provides plots that are not easily available via Excel alone
• R is more extensible and allows more advanced modeling
Part 2
Regression: Assumption Review
Part 2
Sales vs. Advertising
600.0
Sales (in $1000s)
500.0
400.0
300.0
200.0
100.0
0.0
Gauss-Markov Theorem
35
55
75
95
115
Advertising (in $1000s)
Tells us that our OLS estimators (our intercept and slope) are unbiased and have minimum
variance among all linear unbiased estimators IF…
Two assumptions:
(1) Independence => 𝐶𝑜𝑣 𝜀𝑖 , 𝜀𝑗 = 0
(2) Equal variance (aka. Homoscedasticity, same finite variance) => 𝑉𝑎𝑟 𝜀𝑖 = 𝜎 2
To make tests inferences, do statistical tests, and create confidence intervals, we
need to assume a third condition:
(3) Error is normally distributed => 𝜀𝑖 ~ 𝑁(𝑚𝑒𝑎𝑛 = 0, 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 )
Part 3
Regression: Assumption investigation (a.k.a. Diagnostics)
 Linear relationship among
Sales and Advertising looks
fine.
 What do you think about our
independence assumption?
 What do you think about the
constant finite variance
assumption?
 What about normality?
 Anything else stand out?
Regression: Fit without influential points
Here we see our new fitted line, in addition to how
well our model performed at estimating sales.
Part 3
Part 3
Regression: More diagnostics
 Linear relationship among
Sales and Advertising is fine.
 What do you think about our
independence assumption?
 What do you think about the
constant finite variance
assumption?
 What about normality?
 Anything else stand out?
Regression: Interpretation
Regression Statistics
Multiple R
0.956
R Square
0.914
Adjusted R Square
0.909
Standard Error
32.707
Observations
18
Part 3
What is this calculation?
This is the Pearson correlation of (x and y) for simple regression, or the sqrt of "R Square"
a.k.a. Multiple R-squared, is the fraction of total variation explained by the model)
Similar to R-square but adjusts for the number of covariates in the model.
This is the standard error of our residuals.
This is the total number of observations we used.
Regression Statistics
What does it mean?
Multiple R
0.956 There is a strong positive linear relationship among advertising and sales
R Square
0.914 91.4% of the variation in sales is explained by the variation in advertising
Adjusted R Square
0.909 90.9% of the variation in sales is explained by the variation in advertising, accounting for number of covariates used.
Standard Error
32.707 This is our measure of spread or variability for our residuals in the model.
Observations
18 This is the total number of observations we used
ANOVA
The F and Significant F tell us that
our slope is statistically significant.
Meaning, it is highly unlikely it is 0.
df
SS
MS
F
Significance F
Regression
1 181861.7 181861.7
170.0
0.000
Residual
16
17116.4
1069.8
Total
17 198978.0
ANOVA - This is just a table that summarizes the levels of variation.
df
SS
MS
F
Significance F
Regression k = # covariates SSR = variation in the the mean response MSR = SSR/k
MSR/MSE p-value of F-test
Residual
n -1 - k
SSE = variation in residuals
MSE = SSE/(n-1-k)
Total
n -1
SST = total variation in the response
Intercept
Advertising
Intercept
Advertising
Coefficients Standard Error
-109.64
37.46
6.26
0.48
t Stat
P-value Lower 95% Upper 95%
-2.93
0.010
-189.06
-30.23
13.04
0.000
5.24
7.28
Coefficients
Standard Error
y-int
seY = s.e. of y-intercept
slope
seX = s.e. of slope
t Stat
y-int/seY
slope/seX
P-value
significance for y-int
significance for slope
95% CI for parameter
Lower limit Upper limit
Lower limit Upper limit
Using R commander
• Obtain data sets from R libraries or load in your own
• Nice drop-down for basic statistics and plots (code
prints to R commander window)
• Common distributions are available via drop-down
See RExcelExamples workbook, Part4 tab
Part 4
Using built-in R commander plug-ins
Part 5
Lets look at RmcdrPlugin.HH
These plots are useful, but somewhat dull. The code that generates these will show
up in the R commander window (very useful for newbies). Like plotting in Excel, you
can get what you need by default, but you’ll probably have to modify the graph a
bit.
XY conditioning plot (HH)
Side-by-side Boxplot
Additional References
• http://rcom.univie.ac.at/RExcelDemo/