Okay, welcome to our next lecture. In the next two lectures, we are going to talk about correlation and regression. Let’s start with correlation. With correlation, we change the independent variable. The dependent variable has always been quantitative and of the independent variable has always been categorical up until now, but now we’re going to be comparing two groups where both variables are quantitative.
Instead of talking about a difference between two groups we’re going to change it and the change is subtle but important. We’re now going to talk about associations or relationships. We are going say variable are significantly related or significantly associated with one another. I’ll illustrate kind of the difference or what exactly that means.
If both variables are quantitative then we can put them on a matrix. Something you might remember from maybe a high school geometry class where we have an x-axis called the abscissa on the horizontal axis or on the horizontal plane and then a vertical axis called the ordinate or the y-axis. The intersection of the independent and dependent variable can be represented by a point on this plot and we’ll create a scattered plot. Now we start to look at what these plots or what these patterns of dots mean.
One classic pattern we see all the time in higher education is the pattern between or the relationship between ACT and grade point average. If we overlay a quadrant, four equals squares on a matrix, we will see not a random scattering of point but a pattern starts to develop. It a crude pattern but it is a pattern. You will see the majority of dots are found in the lower left or upper right. Those two quadrants. This would be an example of a positive relation or in a positive association. As ACT increases GPA increases as well. It’s not a random relationship. There appears to be a pattern, as one goes up so does the other.
In correlation, the placement of the variables on the y or the x-axis are arbitrary, so I can flip this one around. If ACT correlates to GPA, then GPA correlates to ACT. There’s no chronology in correlation. One does not come first, and then we can’t say one comes first and then the implication is it causes the second. We can’t say that with correlations, so if A correlates to B, B correlates to A.
Up until now, the coefficients that SPSS has calculated, we haven’t really done anything with them. They don’t really have any meaning. We really focus on their associated significance, so when we dealt with the z-score or a t-value or an f-value, we didn’t really look at that value. We looked at the associated significance. With correlation, there is an Pearson r value—an r value. That r value has some meaning; r values range between negative 1 and positive 1, and they can tell us two basic things; one, the direction, so if the r value is positive, that tells us we have a positive relationship. If it is a negative number, we have a negative relationship, and then the strength or the magnitude of the relationship. The closer that number is to 1 or a negative 1, the more tightly clustered, or the more that scatterplots starts to resemble a line.
Here’s some examples of some correlation coefficients. The three along the top are both pretty rare. We don’t see them very often, but we can manufacture artificial ones. The first one is the perfect negative correlation of −1, where the data points line up perfectly, almost in a stair step fashion in a downward trend, so the points are found in the upper left or lower right quadrant. As we move from left to right, the y goes down. The opposite as a perfect positive 1 correlation where it ladders up and forms a perfect line, and then the fine one on the top row there is a correlation of 0 where it just looks like one big blob. If we were to overlay the quadrants, we would see probably very close to an equal number in all 4 quadrants. Those three are pretty rare. We don’t see them often in the natural world. What we see commonly and what we are left to interpret are the ones at the bottom. Not perfect by any means, but still suggesting a relationship. The first one in the lower left, that’s what a positive r value of 0.6 looks like, and then next to that a −0.6. Again, not a perfect line, but certainly a trend moving from the upper left to the lower right.
With correlation, there are some important assumptions we need to deal with, so let’s look at them one at a time.
First one, correlation assumes a linear relationship so if we saw a scatterplot like this where you see the dots rising, hitting a peak, a point of diminishing return and then a decline so we saw both the positive and a negative relationship, that would show as a P value—excuse me, an r value of zero but clearly there’s a relationship there. It’s just not a linear one so this is a curvilinear relationship. This is the importance with correlation of always asking for a scatterplot because, if we were just to run the correlation, get a Pearson r of zero, we would just say there isn’t a relationship. Well, there clearly is one. It’s just not a linear one and so the Pearson r would be an inappropriate measure of that. There’s another statistic (we’re not going to deal with it in this class) that fits a curve line to a relationship and if you saw, you would need to probably contact the statistician and work on that. We see that sometimes in healthcare. We certainly see it in rehabilitation where practice or therapy can actually go too far and you’ll see increases to a certain point and then a person can actually overwork a muscle and they will hit a point of diminishing return and actually see the output decline. You see it in athletic training and high-end athletes. If you have a son or daughter or if you were involved in swimming or in cross country, you’re familiar with the concept of tapering. Tapering is a concept that tries to avoid that decline by cutting off practice and not increasing practice so you don’t see that curve, downward curve. You’re trying to avoid that.
The second assumption is your big word of the day, that is, homoscedasticity. Homoscedasticity just means that you have consistent variance or a consistent spread in the data on one variable as you move to lower values, as you move from lower to higher values on your other variables. This is kind of an exaggerated example, you’ll never see anything that looks like this. You’ll see that for low ACT scores, you get primarily low GPAs. For high ACT scores you get relatively high GPAs.
That previous slide would be an example of not violating the assumption of homoscedasticity. This would be a visualization that would present a violation of that assumption. You’ll get—with lower, you’ll see this example, again, with ACT and GPA with lower ACT scores, you get lower GPAs but with higher ACT scores, you get high and low GPAs. You see the data points exploding on GPA as you move from low ACTs to high ACT scores. You need to make sure your data is relatively consistent, has a consistent variation on one variable as you move from low to high on the other.
The third assumption is that you do not have a range restriction. Range restrictions usually will result in a artificial reduction in the r value. You need to avoid that. Here’s two classic examples. The first one would be to offer some type of a study skills program to see if hours in a study skill program would relate to an increase in GPA. If you offered that at a community college, you would have a full spectrum of academic performance there. You would have high performing students, and you would have low performing students. You would have a good shot at showing a relationship there. If you went to a highly selective university and offered a similar program, let’s say Harvard or Yale or MIT, you probably would have no low achieving students. It would be hard for a study skills program to improve GPA when you have such high achieving students. They don’t really have anywhere to go. They’re already at a high end, so that’s a range restriction. If you were to sample it at a school like Harvard, you would probably show that a study skills program wouldn’t have any effect. It probably wouldn’t not be a reflection on the value of the study skills program. It would just be a reflection on the fact that these students were all high achieving. They really can’t get much better. It would be hard to move that variable very much.
Another one we see in athletic training, in physical therapy or in sports medicine is trying to improve a performance outcome. Let’s say we’re looking at improving someone’s free throw percentage, so we’re looking that the relationship of practice to how many free throws they can make. How many out of 10 free throws could you make. If we get a group of average individuals, probably the more they practice, the better they’ll get, but if we get high school, maybe not high school, how about college level division 1 or MBA players, they’re already at the high end. Many of them already have free throw percentages at 80% or 90% or higher. Practice may not improve that enough to get a significant r value. Again, in PT, whenever we’re looking at interventions, if our students are selecting data at a division 1 college, for example, they might not get a significant r value because they have a range restriction. Those individuals are all high-end athletes, so it’s hard to show a relationship when the low end is not well represented.
When we get a Pearson r score, there are three classic ways we can interpret that score. There are three different areas we can talk about when we deal with a Pearson r score, so let’s look at those one at a time.
First, with the Pearson R value, we can interpret the strength or make some comment on the strength of the association. This is a chart that is found in some stats textbooks that just quantifies low correlations, moderate correlations, or high correlations. Your book doesn’t have this chart, and I’m glad they don’t because I think this can be a little misleading. These numbers are arbitrary. What we need to do is consider context. Context is very important. We can’t just say a number is low because a chart says so. We need to look at the context.
As an example of this considering context, pay close attention when you watch the video on correlation, the video link I give you of the “Against All Odds” program. They talk about a twin study that was done and they got a relatively low correlation. It was a significant correlation, but it was relatively low, but it was surprising given the context. These are twins that, certainly, for example, these two guys in the picture, they had high correlation in physical traits. Their hair—they’re both bald, they both need glasses, but they had a significant correlation in psychological traits. Remember, they were both firefighters or you’ll learn they’re both firefighters. They both prefer the same brand of beer. They both hold the beer can the same way. They’re both confirmed bachelors. These are all things for twin studies. These guys were raised apart. We wouldn’t expect them to be so similar in this psychological characteristics, so given the context of twins reared apart, that low correlation score that they’re going to report is surprising given the context.
The second thing we can do with the Pearson r score is to square it and this is kind of a preview for our next lesson but when we square it, it’s called the coefficient of determination and basically what that is is the percent of variance of one variable that is shared or can be explained by the other variable. We’ll go into this in a little more detail in our next lesson.
The third thing we can do with the r value is to look at the associated significance or the P value for the Pearson r. That just tell us like in our other, with F scores and T scores, the probability that chance can explain it. We can get chance relationships too. We have to look at our associated P value and say whether that Pearson r is statistically significant or not.
The one thing we need to be careful about with correlation is you can get some crazy nonsensical significant correlations out there. Sometimes they’re called fishing expeditions or searching for significance. If you correlate enough variables with each other, you will get some that correlate. You just will out of chance, so we have to be very careful that the correlations make sense. If there’s a study that’s throwing a lot of variables out there and correlating them with each other, something will eventually stick, so we have to be skeptical when we see those types of studies.
Finally, I covered this in the video introduction, but correlation is not causation. To say that two variables are correlated to one another, that doesn’t mean that one causes the other. We’ll talk about the challenges of establishing a causal relationship later on in a future lecture. I always think of the example, the nonsensical example, of the correlation between the length of your pants and height. If I looked at a grade school or a high school and correlated student height to student pant length, there would be a significantly positive correlation, so if parents wanted their sons to grow tall or their daughters to grow tall, should they buy them long pants? Of course not, it makes no sense, but it is a significant correlation.
Okay, let’s look at a correlation example. This is a phenomenon I remember learning about when I was in college. These people that say they don’t smoke, but they say they only smoke when they drink. They only smoke when they’re at bars or when they have a beer in their hand or when they’re in social settings with alcohol. We’re going to explore that relationship and see if it exists. In this study, 15 male college students who smoked and who drank were asked to keep track of how much they spend on these products per week.
Now the SPSS output is pretty simple here. It gives us a matrix where both variables are put on the horizontal axis. You’ll see alcohol and tobacco there and then across the top on a vertical axis, so we get some redundant information, some information we don’t really need. First on the horizontal, you’ll see when the variable is correlated with each other, so if we follow alcohol to alcohol or tobacco to tobacco, we get a correlation of one, which makes sense. A variable correlated to itself will result in a perfect correlation of one, but what we’re interested in is the other diagonal that looks at tobacco correlated to alcohol or alcohol correlated to tobacco. We can just pick either one. Again, it’s arbitrary, but we see a correlation of 0.632, so it’s positive, relatively strong, and the significance is less than 0.05, 0.011, so we say there is a positive relationship. It’s fairly well represented by a line, and we say it is significant. It’s likely not a chance relationship, so we should also look for a matrix, look at the picture, so we’ll ask for a plot of this.
Here’s what our scatter plot looks like, kind of a slightly upward tilting relationship. They are fairly tightly clustered, but you will see one data point there, all by itself. That is certainly an outlier. I would want to check that score, look at the raw data, see if maybe something was entered wrong. It looks like they weren’t drinking alcohol at all, so we would want to make sure. Again, we are only including people in this study who both drink and smoke. You know, if the person maybe was sick that week, or for some reason stopped drinking alcohol, that might be a reason to exclude them from a study. From the study, we might even get a stronger correlation if we eliminated that point. This just points out that we need to always ask for a scatterplot and look at the picture when dealing with correlation.
If you would like to go through this presentation again, simply click the Replay button, and you will be returned to the beginning.