Week 3 Lesson 1: Correlation Between Two Variables

Slide 1

As you just read, there is a very important graphical tool we can use to see if two variables of interest are related to each other. This graphical tool is called a scatterplot (or an XY plot in Microsoft Excel). There is also a numeric measure called the correlation coefficient, which shows the strength of a linear relationship between two variables; that is, when the basic pattern of the relationship can be represented by a straight line.

In a scatterplot, values of two variables are shown on two perpendicular axes, one variable on each axis. Each point on the scatterplot is one observation in the sample, with its coordinates being the values of the two variables of interest. Now we will explore examples of scatterplots with different patterns in the data.

This scatterplot shows no apparent relationship between the two variables. Why? Because there is no distinct pattern in the way points are scattered in the graph.

On the other hand, this scatterplot shows an upward pattern as we move from left to right on the horizontal axis (or from bottom to top on the vertical axis). This general pattern can be represented with an upward sloping straight line. It can be described as a linear relationship between the two variables. It means that points with larger values on the horizontal axis tend to have larger values on the vertical axis. It does not mean that whenever a point that has a larger value on the horizontal axis it also has a larger value on the vertical axis. Just look at points A and B if you are not convinced.

Looking at the scatterplot on the left, we also see an upward sloping straight line pattern, but the points are scattered much closer to the line. Here, we say that relationship is stronger than that in the previous scatterplot we viewed. On the scatterplot to the right, we see a downward sloping pattern with considerable scatter, showing a weak linear relationship.

These last two scatterplots show two perfect linear relationships, one sloping upward and one sloping downward. In each of these two cases, all points in the scatterplot fall exactly on a straight line, with no scatter or deviation from the line. Only in these perfect relationships can we determine the value of one variable exactly with no error when the value of the other variable is given. To summarize, when we say two variables are linearly related, we are referring to a general straight-line pattern in their scatterplot. The general pattern may not be true for every pair of points in the scatterplot (except in a perfect relationship).

Slide 2

The Correlation Coefficient, or “r” can show whether and how strongly pairs of variables are related. In this example, height is our x variable and weight is our y variable for the points provided. X-bar is determined by the average of the x-values and Y-bar is determined by the average of the y-values.

Since the average is the “balance point” of the data set, We notice that the sum of the deviations of the variable x from its average (x-xbar) and the variable y from its average (y-ybar) are always equal to zero.

Since the sum of the deviations of the variable x from its average (x-xbar) and the variable y from its average (y-ybar) are always equal to zero, to measure the variance, we average the “square” of these deviations.

However, to get back to the same unit of measurement, we will take the square root of the variance, and that gives us the standard deviation Sx of the variable x, and the standard deviation Sy of the variable y.

Covariance is the same concept as the variance but instead of taking the average of the sum of the square of the deviations for an individual variable, we take the average of the product of the deviations of the variable x (x-xbar) by the deviations of the variable y (y-ybar).

N will represent the sample size number. In the example provided, that number was 5. Here is how each variable works with the data provided:

The computation of r is discussed in detail in your textbook. You may find this information helpful, so please read that section now. You will see that the simplest description of how the correlation coefficient is calculated is as follows:

To find the correlation coefficient between X and Y, multiply each standardized X value with the corresponding Y value, then average these products.

Once you develop some feel for the correlation coefficient, using computer software or an advanced calculator is the preferred way to perform the computations.

Click the button provided to see how the correlation coefficient can be obtained using Microsoft Excel.

Slide 10

Observe that r has a positive value in each scatterplot with an upward sloping pattern, and negative value in each downward sloping pattern. Also, the stronger the relationship, the farther the value of r is from 0 and the closer it is to -1 or +1. It so happens that, in magnitude, the largest possible values of r are +1 and -1 (both with absolute value 1), and its lowest possible magnitude is 0.

Note that when we say the correlation coefficient between two variables is high, moderate, or low, we are referring to the magnitude of r (i.e. its absolute value). So, a value of r near zero is low, but a value near -1 is high even though -1 is lower than 0 mathematically. The sign of r shows the direction of the linear pattern (positive for upward, negative for downward).

Very roughly speaking, absolute values of r between 0.0 to about 0.4 are usually considered to be low, 0.4 to 0.7 are usually considered moderate, and 0.7 to 1.0 are usually considered high.

To summarize, the correlation coefficient measures the strength of the linear association between two variables. Its possible values are between -1 and 1. Values near 0 indicate a weak linear relation, and values near -1 or +1 indicate a strong linear relation.

Slide 11

There are several things one needs to be aware of when analyzing their r-value and statistical data. These include, but are not limited to:

Curved Relationships
Correlation versus causation
Outliers
Rescaling

Suppose the scatterplot of a set of data looks this one.

Clearly, the pattern of relationship in this case is curved, and no straight line can represent it appropriately or adequately. The left half of the scatterplot is sloping down, and the right side is sloping up.

If we must fit a straight line to this clearly curved pattern, it would be a horizontal line, and the correlation coefficient is likely to be near or equal to zero. This is a case where the scatterplot is informative about the existence of a curved relationship, but the correlation coefficient is not useful; it can erroneously be interpreted as indicating a weak relationship. When you want to explore the relationship between two variables, do not just look at the correlation coefficient; also look at the scatterplot. As the saying goes, a picture is worth a thousand words.

Two variables X and Y may be related in various ways:

A change in X causes a change in Y.
A change in Y causes a change in X.
A different variable (or a set of variables) causes a change in both X and Y.

Because of the third possibility, we cannot conclude that the existence of a relationship indicates that a change in one variable causes a change in the other. In other words, an observed relationship can be due to causes other than the variables we are examining.

As you know, outliers are data points that stand apart and do not fit the pattern in the rest of the data set. To illustrate how they can affect the correlation coefficient, we shall use a famous data set in statistics, called the Anscombe data set, shown here.

This data set is available through the link Data Sets on the main menu. Here, we have four data sets of two variables. If you use the CORREL function in Excel, you will find that the correlation coefficient is the same between X and Y in each of the four sets.

Now let us look at the scatterplots of the four data sets.

You can see that the outliers in data sets c and d significantly affect the correlation coefficient. Without the outliers, r would have been equal to +1 in data set c, and 0 in data set d.

Note also that the correlation coefficient is reasonably high even in the curved relationship in Data Set b. This means that, if we did not know better, a straight line could be used to represent the relationship. This would obviously be a compromise, however, because a curve would fit the pattern perfectly, not just reasonably well.

Sometimes we have reason to ask "What would happen to the correlation coefficient if we scale the data up or down by the same constant, or if we multiply or divide the data by the same constant?"

The answer is nothing!

Scaling all data values up or down by the same constant, or multiplying or dividing all data values by the same constant does not change the correlation coefficient. Thus, adding 15 to all X values, subtracting 3 from all Y values, multiplying both X and Y values by 12, etc. would have no effect on the value of the correlation coefficient.