If you do not have access to advanced statistical software, Excel is an excellent option for running multiple regressions. Excel can perform various statistical analyses, including regression analysis, and is easy to learn. Its accessibility makes it an excellent option for nearly everyone.
Despite the fact that Excel isn’t your primary statistical software package, this post is an excellent introduction to performing and interpreting regression analysis. In this post, I will provide step-by-step instructions on how to conduct multiple regression analyses using Excel.
Moreover, I also show you how to describe the model, choose the appropriate options, assess the model, and check the assumptions. Links to additional resources that I have written explain regression analysis concepts in a way that Excel’s documentation does not. Lastly, I provide an example dataset so we can talk about and analyze it together that ” how to do multiple regression in excel”.
What is Linear Regression?
Linear regression is a type of plot which graphs a linear relationship between an independent and dependent variable. It is usually utilized to visually display the degree of the relationship as well as the distribution of the results, all with the aim of explaining the behavior of the dependent variables.
Let’s say we wanted to evaluate the strength of the connection between the quantity of ice cream consumed and the weight gain. Then we would consider the independent variable that is the quantity of ice cream consumed, and then connect it to the dependent variable that is obesity, to determine whether there was a connection. Regression is a visual representation of this relationship, the less the Variability The more data you have The more data there is, the stronger the correlation and the more closely the relationship with that regression line.
- Linear regression is a model for the dependence between a dependent variable and an independent variable(s).
- Regression analysis is possible when the variables are non-dependent that there isn’t any heteroscedasticity and the errors of the variables are not related.
- Modeling linear regression using Excel is made easier by using The Data Analyse ToolPak.
There are some essential assumptions regarding your data set that must be met for you to conduct the Regression analysis:
- The variables need to be independent (using the Chi-square test).
- The data must not have different error variances (this is called heteroskedasticity (also spelled heteroscedasticity)).
- The error terms of every variable should not be correlated. If they are not, this indicates that these variables have been closely correlated.
How To do Multiple Regression In Excel?
1. Open Excel.
2. By clicking on the “Data” tab, you can find out if the “Data Analysis” Tool Pak is active. The add-in will need to be enabled if the option isn’t appearing.
- Select “Options” from the “File” menu (or use Alt+F to enter the menu).
- On the left side of the window, click “Add-Ins”
- The “Manage: Add-ins” option is at the bottom of the window. Click “Go”.
- Select “Analysis ToolPak” and click “OK” in the new window.
- The add-in has now been enabled.
3. Using your data file, enter your data. A column must have adjacent data and its labels in the first row.
4. The “Data” tab should appear, then “Data Analysis” should appear in the “Analysis” grouping (usually at or near the far right of the Data tab options).
5. Select the dependent (Y) range by first placing the cursor in the “Input Y-Range” field, then highlighting the data column in the workbook.
6. Firstly, the independent variables are entered by placing the cursor in the “Input X-Range” field, followed by highlighting multiple columns (e.g. $C$1:$E$53).
- The independent variable data columns MUST be adjacent to each other for the input to work.
- Select the box next to “Labels” if you’re using labels (which should appear in the very first row in each column).
- Default confidence levels are set to 95%. Click the box next to “Confidence Level” and edit the adjacent value if you wish to adjust it.
- In the “New Worksheet Ply” field of “Output Options,” enter a name.
7. In the “Residuals” category, select the options you want. Using the “Residual Plots” and “Line Fit Plots” options, you can generate graphics of residual output.
8. The analysis will be created once you click “OK”.
Multiple Regression Using The Data Analysis Add-In
This requires the Data Analysis Add-in: see Excel 2007: Access and Activating the Data Analysis Add-in
The data used are in carsdata.xls
We then create a new variable in cells C2:C6, cubed household size as a regressor.
Then in cell C1 give the heading CUBED HH SIZE.
(It turns out that for these data squared HH SIZE has a coefficient of exactly 0.0 the cube is used).
The spreadsheet cells A1:C6 should look like:
We have a regression with an intercept and the regressors HH SIZE and CUBED HH SIZE
The population regression model is: y = β1 + β2 x2 + β3 x3 + u
It is assumed that the error u is independent with constant variance (homoskedastic) – see EXCEL LIMITATIONS at the bottom.
We wish to estimate the regression line: y = b1 + b2 x2 + b3 x3
Using the Data Analysis Add-in and Regression, we perform this analysis.
There is only one difference between one-variable regression and two-variable regression: the number of columns in the Input X Range is included.
However, you need to ensure that the regressors are in contiguous columns (here a row and a column).
If this is not the case in the original data, then contiguous columns must be copied in order to obtain the regressors.
We obtain OK by clicking.
The regression output has three components:
- Regression statistics table
- ANOVA table
- Regression coefficients table.
1. Coefficient of Multiple Determination
It is not a guarantee that our equation will fit the data well simply because it fits the data better than any other linear equation. What is the true fit of our equation?
Researchers use the coefficient of multiple determination (R2) to answer this question. The coefficient of multiple determination measures the proportion of variation in the dependent variable that can be predicted from the set of independent variables. The regression equation will be large (i.e., close to 1) if the data is well fit.
Sums of squares can be used to define the coefficient of multiple determination:
SSR = Σ ( ŷ – y )2
SSTO = Σ ( y – y )2
R2 = SSR / SSTO
In this equation, SSR is the sum of squares due to regression, SSTO is the total sum of squares, * is the predicted value of the dependent variable, y is the mean of the dependent variable, and y is the raw score of the dependent variable.
The coefficient of multiple determination will never have to be manually calculated. Excel (and the majority of other analysis packages) provide this standard output.
Based on the output, it appears that the regression equation pretty closely matches the data. 0.905 is the coefficient of multiple determination. Based on our sample problem, this means 90.5% of test score variation can be explained by IQ and study hours.
An Alternative View of R2
The coefficient of multiple correlation (R2) is the square of the correlation between actual and predicted values of the dependent variable. Thus,
R2 = r2y, ŷ
where y is the dependent variable raw score, ŷ is the predicted value of the dependent variable, and ry, ŷ is the correlation between y and ŷ.
2. ANOVA Table
Statistical significance of the regression sum of squares is another way to evaluate the regression equation. In this case, we examine the Excel ANOVA table:
The table below tests the statistical significance of independent variables as determinants of the dependent variable. An overall F-test is reported in the last column. There is a large F statistic (33.4) and a small p-value (0.00026). This suggests that either or both of the independent variables are able to explain higher than what could be assumed by chance.
Similar to the multiple correlation coefficient as well as the overall F test within the ANOVA table indicates that the regression equation is a good fit with the data quite well.
3. Coefficients Table:
When using multiple regression, there are many independent variables, consequently, it is natural to inquire whether an independent variable is a significant contributor to the regression once effects from other variables are taken into consideration. Answers to that question are located in the table of the regression coefficients:
The table of regression coefficients gives the following information about every coefficient: its amount and standard error, a t-statistic as well as the importance of the statistical t-statistic. In this instance, the t-statistics for IQ along with study times are both statistically significant at a 0.05 level. This indicates that IQ plays a major role in regression once the effects of studying hours are taken into consideration. Study hours are a significant contributor to the regression once impacts of IQ are taken into consideration.
Note: This analysis is devoid of any discussion of multicollinearity something we will address in the next section. next lesson. However, be aware that it is advisable to determine the presence of multicollinearity in independent variables prior to assessing the relevance for regression coefficients.