Fall Semester 2002

Lecture Notes Ten: Regression

Here's a picture of how we calculate the residuals.

I now include a Review of analysis of bivariate numerical data and Simple linear regression.

BIVARIATE NUMERICAL DATA

A scatterplot is useful for examining the relationship between two numerical variables. In Excel this kind of chart is called an XY (scatter) chart; other names include scatter diagram, scattergram, and XY plot. Such a graphical display is often the first step before fitting a curve to the data using a regression model.

Example. The data shown below were collected in a study of real estate property valuation.

```  SqFt    Price

521     26.0
661     31.0
694     37.4
743     34.8
787     39.2
825     38.0
883     39.6
920     31.2
965     37.2
1011     38.4
1047     43.6
1060     44.8
1079     40.6
1164     41.8
1298     45.2```
The 15 properties were sold in a particular calendar year in a particular neighborhood in a city stratified into a number of neighborhoods. Although the data displayed is from a single year, similar data is available for each neighborhood for a number of years.

Because we expect that selling price might depend on square feet of living space, selling price becomes the dependent variable and square feet the explanatory variable. Some call the dependent variable the response variable or the y variable. Similarly, other terms for the explanatory variable are predictor variable, independent variable, or the x variable.

Our initial purpose is to visually examine the relationship between the square feet of living space and the selling price of the parcels. Then we will calculate two summary measures, correlation and covariance, using both the analysis tools and functions. Finally, we will include a third variable, assessed value of the property, and use the analysis tool to compute pairwise correlations. In the second part of this set of notes we will fit straight lines and curves to this same data using regression models.

1.1 XY (Scatter) Charts

The following steps describe how to create and embellish a scatterplot using Excel's Chart Wizard.

1. Arrange the data in columns on a worksheet with the x values (for the horizontal axis) on the left and the y values (for the vertical axis on the right). If the x variable is not on the left, insert a column on the left, select the x data, and click and drag to move the x data to the column on the left.

2. Select the x and y values (A2:B16). Do not include the labels above the data.

3. Click on the Chart Wizard tool.

4. In step 1 (Chart Type) of the Chart Wizard on the Standard Types tab, select XY (Scatter) in the Chart Type list box and verify that the chart sub-type is "Scatter. Compares pairs of values." Click on the wide button Press and Hold to View Sample to preview the chart. Click Next.

5. In step 2 (Chart Source Data) on the Data Range tab, verify that cells A2:B16 were selected and that Excel is treating the data series as columns. (If you don't select the data range before starting the Chart Wizard, you can enter the data range in this step.) On the Series tab, verify that Excel is using cells A2:A16 for x values and cells B2:B16 as y values. (If the data ranges for the x and y values aren't correct, you can specify their locations here.) Click Next.

6. In step 3 (Chart Options) on the Titles tab, select the Chart Title edit box and type Real Estate Properties. Don't press Enter; use the mouse or Tab key to move among the edit boxes. Type Living Space, in Sq. Ft. for the value (x) axis title (the horizontal axis), and Selling Price, in Thousands of Dollars for the value (y) axis title (the vertical axis).

7. In step 3 (Chart Options) on the Gridlines tab, clear all checkboxes.

8. In step 3 (Chart Options) on the Legends tab, clear the checkbox for Show Legend. (With only one set of data on the chart, a legend is not needed). Click Next.

9. In step 4 (Chart Location), verify that you want to place the chart as an object in the current worksheet. Click Finish.

The chart is embedded on the worksheet. The property data show a general positive relationship; more living space is associated with a higher selling price, on the average. Follow steps 10 through 12 to obtain an embellished scatterplot.

10. Change the x-axis to display 400 to 1400 square feet. Select the value (x) axis. Right-click, choose Format Axis from the shortcut menu, and click the Scale tab. Type 400 in the Minimum edit box, 1400 in the Maximum edit box, and 200 in the Major Unit edit box. Click OK.

11. Change the y-axis to display 20 to 50 thousands of dollars. Select the value (y) axis. Right-click, choose Format Axis from the shortcut menu, and click the Scale tab; type 20, 50, and 10 in the Minimum, Maximum, and Major Unit edit boxes. Click the Number tab and set Decimal Places to zero. Then click OK.

12. Click just inside the outer border of the chart to select the chart area. Click and drag the sizing handles so the chart is approximately 6 standard column widths by 15 rows. Click the chart title and choose Arial Bold 12 from the formatting toolbar. For each horizontal and vertical axis and title, click the chart object and choose Arial Regular 10 from the formatting toolbar. Double-click the y-axis title and change the space after the comma to a carriage return. Select the Price data (B2:B16) and click the Increase Decimal button several times so that three significant figures are displayed to the right of the decimal point.

1.2 Analysis Tool: Correlation

The correlation coefficient is a useful sumary measure for bivariate data, in the same sense that the mean and standard deviation are useful summary measures for univariate data. The possible values for the correlation coefficient range from -1 (exact negative correlation, with all points falling on a downward-sloping straight line) through 0 (no linear relationship) to +1 (exact positive correlation, with all points falling on an upward-sloping straight line). The correlation coefficient measures only the amount of straight-line relationship; a strong curvilinear relationship (a U-shaped pattern, for example) might have a correlation coefficient close to zero. The long name for the correlation coefficient is "Pearson product moment correlation coefficient," which is often shortened to simply "correlation".

The following steps describe how to obtain the correlation coefficient using the analysis tool.

1. Enter the x and y data in a worksheet as shown in columns A and B of the model spreadsheet and enter Analysis Tool: Correlation in cell D1.

2. From the Tools menu, choose Data Analysis. From the Data Analysis dialog box, select Correlation in the Analysis Tools list box and click OK.

3. In the Input Section of the Correlation dialog box, specify the location of the data in the Input Range edit box, including the labels (A1:B16). Verify that the data is grouped in columns and be sure the Labels in First Row box is checked.

4. In the Output options section, click the Output Range button, select the Range edit box, and specify the upper-left cell where the correlation output will be located (D2).

5. Click OK. The output appears in cells D2:F4 as shown in the model spreadsheet. (The discussions of CORREL function and covariance outputs follow).

The output is a matrix of pairwise correlations. The diagonal values are 1, indicating that each variable has perfect positive correlation with itself. The value 0.814651 is the correlation of Price and SqFt. The upper-right section is blank, because its values would be the same as those in the lower-left section.

The following steps describe how to use Excel's CORREL function to determine the correlation.

1. Enter CORREL Function in cell D6.

2. Select cell D7. Click the Paste Function tool button (icon fx). In the Paste Function dialog box, select Statistical in the Function Category list box. In the Function Name list box, select CORREL. Then click OK.

3. To move the CORREL dialog box, click in any open area and drag. Select the Array1 edit box, and click and drag on the worksheet to select A2:A16. Select the Array2 edit box, and click and drag to select B2:B16. Do not include the text labels in row 1 in either selection. Then click OK.

The value of the correlation coefficient appears in cell D7. Alternatively, you could have entered the formula

=CORREL(A2:A16,B2:B16)
by typing or by a combination of typing and pointing. Unlike the static text output of the analysis tool, the worksheet function is dynamic. If the data values in A2:B16 are changed, the value of the correlation coefficient in cell D7 will change.

1.3 Analysis Tool: Covariance

The covariance is another measure for summarizing the extent of the linear relationship between two numerical variables. Unfortunately, the covariance is difficult to interpret because its measurement units are the product of the units for the two variables. For the selling price and living space data in the example above, the covariance is expressed in units of square feet times thousands of dollars. It is usually preferable to use the correlation coefficient because it is scale-free. However, the covariance is used in finance theory to describe the relationship of one stock price with another.

The covariance computed by the analysis tool is a population covariance: that is, Excel uses n in the denominator, where n is the number of data points.

The following steps describe how to obtain the covariance using the analysis tool.

1. Enter the x and y data in a worksheet as shown in columns A and B of the model spreadsheet below.

2. From the Tools menu, choose Data Analysis. From the Data Analysis dialog box, select Covariance in the Analysis Tools list box and click OK.

3. In the input section of the Covariance dialog box, specify the location of the data in the Input Range edit box, including the labels (A1:B16). Verify that the data is grouped in columns and be sure the Labels box is checked.

4. In the Output Options section, click the Output Range button, select the Range Edit box, and specify the upper-left cell where the correlation output will be located (D11).

5. Click OK. The output appears in cells D11:F13 as shown in the model spreadsheet below.

The output is a matrix of pairwise sample covariances. The diagonal values are sample variances (the square of the sample standard deviation) for each variable. The value 853.2427 is the population covariance of Price and SqFt. The upper-right section is blank, because its values would be the same as those in the lower-left section.

The following steps describe how to use Excel's COVAR function to double-check the population covariance.

1. Optional: Enter COVAR Function in cell D15.

2. Select cell D16. Click the Paste Function tool button (icon fx). In the Paste Function dialog box, select Statistical in the Function Category list box. In the Function Name list box, select COVAR. Then click OK.

3. To move the COVAR dialog box, click in any open area and drag. Select the Array1 edit box, and click and drag on the worksheet to select A2:A16. Select the Array2 edit box, and click and drag to select B2:B16. Do not include the text labels in row 1 in either selection. Then click OK.

The population covariance value appears in cell D16. Alternatively, you could have entered the formula

=COVAR(A2:A16,B2:B16)
by typing or by a combination of typing and pointing. If the data values in A2:B16 are changed, the population covariance value in cell D16 will change. The covariance computed by Excel's COVAR function uses n in the denominator. To find the sample covariance multiply by n/(n - 1)

In this example, n=15, so the sample covariance is 15/14 * 853.2427 = 914.1886

1.4 Correlations for Several Variables

The Correlation analysis tool is most useful for determining pairwise correlations for three or more variables, often as an aid to selecting variables for a multiple regression model. The following steps describe how to obtain correlations for several variables.

1. Enter the data in cells A1:C16 as shown in the model spreadsheet (second worksheet). If the data for SqFt and Price are already in columns A and B, select A1:B16, copy to the clipboard (using the shortcut menu), select a new sheet, and paste into cell A1; then select column B, choose Insert from the shortcut menu, and enter the Assessed data.
```  SqFt  Assessed  Price

521     7.8     26.0
661    23.8     31.0
694    28.0     37.4
743    26.2     34.8
787    22.4     39.2
825    28.2     38.0
883    25.8     39.6
920    20.8     31.2
965    14.6     37.2
1011    26.0     38.4
1047    30.0     43.6
1060    29.2     44.8
1079    24.2     40.6
1164    29.4     41.8
1298    23.6     45.2```

2. Optional: Enter Analysis Tool: Correlation in cell E1.

3. From the Tools menu, choose Data Analysis. From the Data Analysis dialog box, select Correlation in the Analysis Tools list box and press OK. The Correlation dialog box appears.

4. In the Input section, specify the location of the data in the Input Range edit box, including the labels (A1:C16). Verify that the data is grouped in columns and be sure the labels box is checked.

5. In the output Options section, click the Output Range button, click the adjacent edit box, and specify the upper-left cell where the correlation output will be located (E3).

6. Click OK. The output appears in cells E3:H6.

The output shows three pairwise correlations. The highest correlation, 0.814651, is between SqFt and Price. The correlation between Assessed and Price, 0.67537, is smaller, indicating less of a linear relationship between these two variables. The lowest correlation, 0.424219, is between SqFt and Assessed.

If we must use a single explanatory variable to predict selling price in a linear regression model, these correlations suggest that SqFt is a better candidate than Assessed, because 0.814651 is higher than 0.67537. If we can use two explanatory variables to predict selling price in a multiple regression model, both SqFt and Assessed should be useful, and there shouldn't be a problem with multicollinearity because the correlation between these two explanatory variables is only 0.424219.

SIMPLE LINEAR REGRESSION

Single linear regression can be used to determine a straight-line equation describing the average relationship between two variables. Three methods are described here: the Add Trendline command, the Regression analysis tool, and Excel functions. Before fitting a line, it is important to examine a scatterplot as described in the first part of these notes. If the points on the scatterplot fall approximately on a straight line, the methods described herein are appropriate. If the points fall on a curve or have another pattern one should consider simple nonlinear regression methods (which are not discussed here).

The data analyzed in this chapter consists of selling price and living space for 15 real estate properties described in the example above. Because we expect that selling price might depend on square feet of living space, selling price becomes the dependent variable and square feet the explanatory variable. Some call the dependent variable the response variable or the y variable. Similarly, other terms for the explanatory variable are predictor variable, independent variable, or the x variable.

The first step is to examine the relationship between selling price, in thousands of dollars, and living space, in square feet, by constructing a scatterplot. The general approach is to arrange the data so that the x variable for the horizontal axis is in a column on the left and the y variable for the vertical axis is in a column on the right. Then select the data excluding the labels, click the Chart Wizard tool, and follow the steps for an XY (scatter) chart. Details of these steps with subsequent rescaling and formatting have been described above. The results should match what is presented in the sample spreadsheet that can be downloaded following the link at the bottom of this page, where the chart title is Arial 10 bold and the axes and axis titles are Arial 8 (worksheet 1).

2.1 Inserting a Linear Trendline

The points in the scatterplot follow an approximate straight line, so a linear trendline is appropriate. The method of ordinary least squares determines the intercept and slope for the linear trendline such that the sum of the squared vertical distances between the actual y values and the line is as small as possible. Such a line is often called the line of average relationship. The following steps describe inserting a linear trendline on the scatterplot and formatting the results.

1. Select the data series by clicking on one of the data points. The points are highlighted, the name box shows "Series 1," and the formula bar shows that the SERIES is selected.

2. From the Chart menu, choose the Add Trendline command. Alternatively, right-click the data series and choose Add Trendline from the shortcut menu.

3. Click the Type tab of the Add Trendline dialog box.

4. On the Add Trendline Type tab, click the Linear icon. (The nonlinear trend/regression types are not discussed here).

5. Click the Options tab of the Add Trendline dialog box.

6. On the Add Trendline Options tab, select the Automatic: Linear (Series1) button for Trendline Name. Be sure the checkbox for Set Intercept is clear. Click to put checks in the Display Equation on Chart and Display R-squared Value on Chart checkboxes. Then click OK. The trendline, equation, and R2 are inserted on the scatterplot.

Trendline Interpretation

We can answer the question "What is the average relationship?" by examining the fitted equation y = 0.021x + 18.789, which may be written as

Predicted Price = 18.789 + 0.021 * SqFt
The y-intercept or constant term in the equation is 18.789, measured in the same units as the y variable. Naively, the constant term says that that a property with zero square feet of living space has a selling price of 18.789 thousands of dollars. However, there are no properties with fewer than 521 square feet in our data, so this constant can be considered a starting point that is relevant for properties with living space between 521 and 1,298 square feet.

The slope or regression coefficient, 0.021, indicates the average change in the y variable for a unit change in the x variable. The measurement units in this example are 0.021 thousands of dollars per square foot, or \$21 per square foot. If two properties differ by 100 square feet of living space, we expect the selling prices to differ by 0.021 * 100 = 2.1 thousands of dollars, or \$2,100.

One popular way to answer the question "How good is the relationship?" is to examine the value for R2, which measures the proportion of variation in the dependent variable, y, that is explained using the x variable and the regression line. Here the R2 value of 0.6637 indicates that approximately 66% of the variation in selling prices can be explained by a linear model using living space. Perhaps the remaining 34% of the variation can be explained using other property characteristics in a multiple regression model.

Trendline Embellishments

If the equation displayed on the chart is used to calculate predicted selling prices, the results may be imprecise because the intercept and slope have only three decimal places. To display more decimal places, double click the chart to activate it and click on the region containing the equation and R2 value to select them for editing. Then click the Increase Decimal tool repeatedly to display more decimal places. (In the model below we used five decimals.) These changes affect both the equation and R2 value, and these changes must be made before any other editing.

With the equation and R2 value selected, you can move the entire text box by clicking and dragging near the edge of the box, and you can use the regular text editing options for rearranging the text. In our model you can see the result of such editing: variable names were substituted for x and y, terms were rearranged, and the last three significant figures of R2 were deleted. Once you begin any such editing, you are unable to use the Increase Decimal or Decrease Decimal tools to change the displayed precision.

2.2 Regression Analysis Tool

The Add Trendline command provides only the fitted line, equation and R2. To obtain additional information for assessing the relationship between the two variables, follow these steps to use the Regression Analysis tool.

1. Arrange the data in columns with the x variable on the left and the y variable on the right, as before (on a third worksheet). Make space for the results of the regression analysis to the right of the data. Allow at least 16 columns. (Delete the scatterplot or move it far to the right).

2. From the Tools menu, choose the Data Analysis command. In the Data Analysis dialog box, scroll the list box, select Regression, and click OK. The Regression dialog box appears.

In the Regression dialog box, move from box to box using the mouse or the tab key. For a box requiring a range, select the box and then select the appropriate range on the worksheet by pointing. To see cells on the worksheet, move the Regression dialog box by clicking on its title bar and dragging, or click the collapse button on the right side of each range edit box. Click the Help button for additional information.

3. Input Y Range: Point to or enter the reference for the range containing values of the dependent variable. Include the label above the data.

4. Input X Range: Point to or enter the reference for the range containing values of the explanatory variable. Include the label above the data.

5. Labels: Select this box, because the labels at the top of the Input Y Range and Input X Range were include in those ranges.

6. Constant is Zero: Select this box only if you want to force the regression line to pass through origin (0, 0).

7. Confidence Level: Excel automatically includes 95% confidence intervals for the regression coefficients. For an additional confidence interval, select this box and enter the level in the Confidence Level box.

8. Output Location: Click the Output Range button, click to select the range edit box on its right, and point to or type a reference for the top-left corner cell of a range 16 columns wide where the summary output and charts should appear. Choose \$D\$1. Alternatively, click the New Worksheet Ply button if you want the output to appear on a separate sheet and optionally type a name for the new sheet, or click the New Workbook button if you want the output in a separate workbook.

9. Residuals: Select this box to obtain the fitted values (predicted y) and residuals.

10. Residual plots: Select this box to obtain charts of residuals versus each x variable.

11. Standardized Residuals: Select this box to obtain standardized residuals (each residual divided by the standard deviation of the residuals). This output makes it easy to identify outliers.

12. Line Fit Plots: Select this box to obtain an XY (scatter) chart of the y input data and fitted y values versus the x variable. This chart is similar to the scatterplot with an inserted trendline described a bit earlier in this section.

13. Normal Probability Plots: Don't check this box.

14. After selecting all options and pointing to or typing references, click OK. (If you receive the error message "Cannot add chart to a shared workbook," click the OK button. Then click New Workbook under Output in the Regression dialog box. If desired, use the Edit | Move or Copy Sheet command to copy the results back to the original workbook.) The summary output and charts appear. (If you receive an error message about not being able to empty the clipboard click OK twice).

15. Optional: To change column widths so that all summary output is visible, make a nonadjacent selection. First select the cell containing the Adjusted R Square label (D6). Hold down the Control key while clicking the following cells: Significance F (I11), Coefficients (E16), Standard Error (F16), and Upper 95% (J16). From the Format menu, choose Column | AutoFit Selection. The formatted summary appears in our model spreadsheet at the bottom of this page.

16. Optional: The residual output appears below the summary output. To relocate the residuals to facilitate comparisons, select columns C:E and choose Insert (columns) from the shortcut menu. Select the residual output (H24:J39), including the row of labels but excluding the Observation numbers, and choose Cut or Copy from the shortcut menu. Select cell C1 and choose Paste from the shortcut menu. Adjust the widths of columns C:E and decrease the decimals displayed in cells C2:E16 to obtain the results (as shown in the model spreadsheet below).

Regression Interpretation

The intercept and the slope of the fitted regression line are in the lower-left section labeled "Coefficients" of the summary output in the model spreadsheet. The Intercept coefficient 18.7894675 is the constant term in the linear regression equation, and the SqFt coefficient 0.02101025 is the slope. The regression equation is

Predicted Price = 18.7894675 + 0.02101025 * SqFt
Refer to the Trendline Interpretation section above for an explanation of the intercept and slope.

In the residual output shown in the model spreadsheet below, the predicted prices, sometimes termed the fitted values, are the result of estimating the selling price of each property using this regression equation. The residuals are the difference between the actual and fitted values. For example, the first property has 521 square feet. On the average, we would expect this property to have a selling price of \$29,736, but its actual selling price is \$26,000. The residual for this property is \$26,000-\$29,736 -- that is, -\$3,736. Its actual selling price is \$3,736 below what is expected. The residuals are also termed deviations or errors.

The four most common measures to answer the question "How good is the relationship?" are the standard error, R2, t statistics, and analysis of variance. The standard error, 3.23777441, shown in cell I17 in the model spreadsheet below, is expressed in the same units as the dependent variable, selling price. As the standard deviation of the residuals, it measures the scatter of the actual selling prices around the regression line. This summary of the residuals is \$3,238. The standard error is often called the standard error of the estimate.

R The t statistics, shown in cells G17:G18 in the model spreadsheet are part of the individual hypothesis tests of the regression coefficients. For example, these 15 properties could be treated as a sample from a larger population. Thu null hypothesis is that there is no relationship: the population regression coefficient for living space is zero, implying that differences in living space don't affect selling price. With a sample regression coefficient of 0.02101025 and a standard error of the coefficient (an estimate of the sampling error) of 0.004148397, the coefficient is 5.064667 standard errors from zero. The two-tail p-value, 0.000217, shown in cell H18, is the probability of obtaining these results, or something more extreme, assuming the null hypothesis is true. Therefore we reject the null hypothesis and conclude that there is a significant relationship between selling price and living space.

The analysis of variance table, shown in cells D10:I14 in the model spreadsheet, is a test of the overall fit of the regression equation. Because it summarizes a test of the null hypothesis thhat all regression coefficients are zero, it can only be discussed in a chapter with multiple regression (not here).

Regression Charts

For simple linear regression the analysis tool provides two charts: residual plot and line fit plot. These charts are embedded near the top of the worksheet to the right of the summary output. In the real estate properties example, the charts are originally located in cells M1:S12; after relocating the residuals, the charts are in cells P1:V12.

The line fit plot is shown in the model spreadsheet below. This chart is similar to the scatterplot with inserted trendline, except that the predicted values in this chart are markers without a line. The following steps describe how to format the line fit plot.

1. Select the data series for Predicted Price by clicking one of the square markers that are in a straight line. (Alternatively, select any chart object and use the up and down arrow keys to make the selection.) The points are highlighted and
"=SERIES("Predicted Price", ...)"
appears in the formula bar. Right-click, choose Format Data Series from the shortcut menu, and click the Patterns tab. Select Automatic for Line and select None for Marker. Then click OK.

2. Select the x-axis by clicking on the horizontal line at the bottom of the plot area. A square handle appears at each end of the x-axis. Right-click, choose Format Axis from the shortcut menu, and click the Scale tab. Clear the Auto checkbox for Minimum and type 400 in its edit box; clear the Auto checkbox for Maximum and type 1400 in its edit box; clear the Auto checkbox for Major Unit and type 200 in its edit box. Then click OK.

3. Select the y-axis. Right-click, choose Format Axis from the shortcut menu, and click the Scale tab. Clear the Auto checkbox for Minimum, and type 20 in its edit box; clear the Auto checkbox for Maximum and type 50 in its edit box; clear the Auto checkbox for Major Unit and type 10 in its edit box. Click the Number tab, select Number in the Category list box, and set Decimal Places to 0. Then click OK.

4. Optional: To obtain the appearance desired select and enter more descriptive text for the chart title, x-axis title, and y-axis title. Resize the chart so that it is approximately 7 columns wide and 14 rows high. Select the chart title and choose Arial 10 bold from the formatting toolbar. For the legend, axes and axis titles, select each object and choose Arial 8.

The residual plot (after resizing to approximately 6 columns by 14 rows) is shown in the model spreadsheet. This type of chart is useful for determining whether the functional form of the fitted line is appropriate. If the residual plot is a random pattern, the linear fitted line is satisfactory; if the residual plot shows a pattern, additional modeling may be needed. When there is only one x variable (simple regression), the residual plot provides a view that is similar to making the fitted line horizontal. When there are several x variables (multiple regression), the residual plot is an even more valuable tool for checking model adequacy, because there is usually no way to view the fitted equation in three or more dimensions.

2.3 Regression Functions

A third method for obtaining regression results is worksheet functions. Five functions described here are appropriate for simple regression (one x variable), and four of these have identical syntax for their arguments. For example, the syntax for the INTERCEPT function is

INTERCEPT(known_y's, known_x's)
The same syntax applies to the SLOPE, RSQ (R square), and STEYX (standard error of estimate). These four functions are entered in cells H2:H5 (as seen in the model spreadsheet) and the values returned by these functions are shown in cells F2:F5.

To prepare the model spreadsheet Regression Using Functions output, the function results in column H are copied to the clipboard (Edit | Copy), and the values are pasted into column F (Edit | Paste Special | Values). The formulas are displayed in column H by choosing Options from the Tools menu, clicking the View tab, and checking the Formulas checkbox in the Window Option section.

Cells H9 and H11 show two methods for obtaining a predicted selling price for a property with 1,000 square feet of living space. If the intercept and the slope of the regression equation have already been calculated, the formula

"=intercept + slope * x"

can be entered into a cell (H9) using appropriate cell references. Here the predicted selling price is 39.7997169881321. in thousands of dollars, or approximately \$39,800.

Another method for obtaining a predicted value based on simple linear regression is the FORECAST function, with syntax

FORECAST(x, known_y's, known_x's)
This method, shown in cell H11, calculates the intercept and slope using least squares and returns the predicted value of y for the specified value of x.

Yet another method for obtaining predicted y values is the TREND function, which has the following syntax:

TREND(known_y's, known_x's, new_x's, const)
This function, unlike the FORECAST function, can also be used for multiple regression (two or more x variables). Because the TREND function is an array function, it must be entered in a special way, as descried in the following steps.

1. Enter the data for the x and y variables (A2:B16) and values of the x variable (D13:D16) for which predicted y values will be calculated.

2. Select a range where the predicted y values are to appear.

3. From the Insert menu, choose the Function command. Alternatively, click the Paste Function button (icon fx). In the Paste Function dialog box, select Statistical in the Function Category list box and select TREND in the Function Name listt box. Then click OK.

4. In the TREND dialog box, type or point (click and drag) to ranges on the worksheet containing the known y values (B2:B16), known x values (A2:A16), and new x values (D13:D16). Do not include the labels in row 1 in these ranges. In the edit box labeled "Const," type the integer 1, which is interpreted as true, indicating that an intercept term is desired. Then click OK.

5. With the function cells (H13:H16) still selected, press the F2 key (for editing). The word "Edit" appears int he status bar at the bottom of the screen. Hold down the Control and Shift keys and press Enter. The formula bar shows curly brackets around the TREND function, indicating that the array function has been entered correctly.

A companion function, LINEST, provides regression coefficients, standard errors, and other summary measures. Like TREND, this function can be used for multiple regression (two or more x variables) and must be array-entered. Its syntax is

LINEST(known_y's, known_x's, const, stats)
The "const" and "stats" arguments are true-or-false values, where "const" specifies whether the fitted equation has an intercept term and "stats" indicates whether summary statistics are desired.

To obtain the results shown in the model spreadsheet, select D1:E5, type or use the Paste Function tool to enter LINEST, press F2, and finally hold down the Control and Shift keys while you press Enter. Cells D7:E11 show the numerical results that appear in cells D1:E5, and cells D13:E17 describe the contents of those cells. These same values appear with labels in the Regression analysis tool summary output shown in the model spreadsheet below.

Last updated: Nov 26, 2002 by Adrian German for `A113`