Fall Semester 2002

Lecture Notes Nine: Correlation

Correlation is a topic that studies the

• relationship
between two variables. Interest centers on the
• direction and
• degree
of the relationship. To understand correlation we start from the concept of covariance.

Covariance is a measure of strength of the relationship between two variables. It provides a quantitative answer to the question:

"As the values observed for one variable rise (or fall), what tends to happen to the values observed on another variable?"
Given
and
we represent all points
graphically as a scatterplot. Then the lines
and
divide the scatterplot into four quadrants: A, B, C, D.

For a point

in quadrant A or C we have
For a point
in quadrant B or D we have
If x and y tend to move in the same direction then most products of deviations in x and deviations in y (as illustrated above) will be positive (quadrants A and C).

Alternatively, if x and y tend to move in opposite directions, thus populating mostly quadrants B and D, then the products of deviations in x and deviations in y will tend to be negative.

When there is no relationship between x and y, with observations appearing evenly in all four quadrants, positive and negative deviations from means will tend to cancel.

The quantity

is a measure of variability between x and y.

If we look carefully at this sum

we soon realize that the result depends on the sample size (N).

The covariance of x and y is defined as the average variability between x and y:

Together with the individual variances of x and y, the concept of covariance between x and y forms the basis for correlation and regression analysis. Let's end this section by re-emphasizing that the idea of covariance (and correlation) is based on deviations from the means.

Correlation

The sign of the covariance gives the direction of the relationship between two variables; its size gives some idea of the strength of that relationship. The actual magnitude of the covariance, however, depends on the units in which x and y are measured. The problem is removed by dividing each deviation from the variable's mean by the standard deviation of the variable:

and
where
and
The covariance based on these standardized variables is now written
This standardized covariance is called coefficient of correlation and is typically denoted by the symbol r. Since both standard deviations have the same square root of N in their denominators they will cancel with the multiplicative factor of N in the denominator of r, thus the coefficient of correlation can also be written as:
So we now have the following definitions:
Correlation of x and y
A standardized measure of the average cross variability between x and y

Coefficient of correlation
A standardized measure of the strength of the linear relationship between two variables.
Now you are ready for Homework Four.

Last updated: Nov 19, 2002 by Adrian German for `A113`