Okay, and our previous week I told you that we were going to spend two lessons talking about tests of significance when both the independent and dependent variable are categorical. In our last lecture, we talked about correlation, which kind of begins this process of looking at relationships between two variables that are quantitative instead of significant differences. Now, we are going to move to the next lecture which really starts with what we have already talked about with correlation and takes it a little further. In this lecture, we are going to talk about regression.
So if you remember in correlation we introduced the scatterlot, and with correlation, the placement of the two variables on the x or y-axis was arbitrary, with regression it is not. We put the independent variable on the x-axis, or the horizontal axis, and in regression we will call that the explanatory variable, and then we put the dependent variable on the y-axis, or the vertical axis, and we will call that a response variable. We have a definite independent and dependent variable with regression.
One of the things we can do with regression is make predictions, so let’s start with an exercise that explains how regression works. If I wanted to predict an individual’s shoe size, let’s say an individual in this class. If I wanted to get their shoe size, one thing I could do is ask for the mean score for class, so ask for x̄ bar for the class, let’s say that 8.7 was mean shoe size. There is no shoe size that is 8.7, but I could guess 8.5 and I might be fairly accurate, but is there a way I could be more accurate?
Another way you could probably be more accurate is if you got the mean score for men, the mean shoe size for men. Let’s say that’s 9.5, and the mean shoe size for women, let’s say that was 6. If the person you were making the prediction about if they were male, you could guess that x̄, and if they were female, you would guess the female’s x̄ . That would probably be a better prediction, but is there a way we could even prove that further?
What if we had another quantitative variable, and we looked at the correlation between the two and got a scatterplot and then drew a line through the center of that scatterplot, and so through the center of that collection of dots, and the we could go along the bottom of the independent variable of head size and find the persons’ head size and then just go straight up until we hit the green line, that’s the line that goes through the center of collection of data points, and then moved over and guessed the shoe size that way. Would that be a more accurate way? Well it probably would, it would be a more precise estimate, it wouldn’t be perfect, but that line that we are drawing through the correlation is a regression line.
That line is the line that is drawn through the center, and is calculated by taking every data point and drawing the distance from the line to the data point and is tilting that line to make that the total length of those residuals as small as possible. If you look at the residuals on the top, will cancel out the residuals on the bottom, those distances, cause you will get negative and positive numbers, so in order to cancel that out, you’ll square both sides, so that line is called the least squares line or the regression line.
Okay, once we get that line, we can write in an equation for that line. If you remember from your high school geometry class, the equation for a line is y = bx + a. We can enter values for x the independent variable, calculate the slope, which is b, remember b is the slope of the line, a is the Y intercept, so a is the point where the regression line crosses the y-axis.
Now the slope of the line, remember again from your high school geometry class, rise over run, the b is the slope of the line, so it represents how much in a positive line or an upward sloping line, for every unit increase in x, how many units does y either increase or decrease, so for every distance of the line travels, how much does it either slope upward or downward, so it gives us an idea of the slope of the line, that’s b in the equation.
Now in the particular example, the values for b, 0.34, we have the Y intercept 1.5. We can enter that into our equation, so we get Y = 0.34 x + 1.5. We can enter head sizes and make predictions on a person’s shoe sizes, so we could solve for two values of x, let’s say 21 and 24, and draw a line between those two points, that would be our regression line. If we solved for a third value, let’s say 23, that point should fall on the line.
If you recall in the correlations lecture, we talked about squaring Pearson r, and that that is something we would use in regression, the regression coefficient R 2 is essentially an estimate of the usefulness of the regression line. Now more precisely what it will tell you is the percent of variation in the dependent variable, that can be explained by changes or variations in the independent variable. Remember we call the independent variable, the explanatory variable. It gives us the percent of variation in the dependent variable that can be explained by variation in the independent, it is also sometimes referred to as coefficient of determination.
Now like previous tests of significance we have a p-value that’s associated with it as well. We can interpret the significance of a regression line, and we’ll use the typical kind of default level of significance of 0.05, so if we get significance r2 , with a related significance of less than 0.05, we can say that that regression is likely not the product of chance; that it is likely the sign of a true relationship. We can’t say for sure, but we can certainly estimate the probably of being wrong.
With a significant regression equation, there are several things we can look at, we can look at r2 , I told you what that refers to. We can back and look at r, which is the correlation, so we can look at the magnitude and the direction of the relationship, and say whether it is positive or negative, and then we have associated significance, so we can say whether this linear relationship is likely the product of chance or not.
Now in regression research, you will often hear these terms predict or explain. We can use regression to predict future behavior, be we can also look at it kind of retrospectively to look back to try to explain what has already occurred. I will give you an example of where both are really used interchangeably.
In my department, the Physical Therapy department, we see a lot research on using regression, trying to predict or explain falls in the geriatric population. Right now, believe it or not, our best predictor of a fall is if they’ve fallen before, and that’s regrettable predictor, because often times that fall is very damaging, and, as you guys are aware can be the beginning of a kind of cascade of bad thing, so we want a better predicting model, certainly, physical therapy, the healthcare, the insurance industry would like to see a better predictive model. Certainly there are many other variables that can be entered: a person’s strength, their coordination, eye sight, cognitive ability, weight, height, age. We can actually use what is called a multiple regression model, where we add multiple factors together. We won’t cover that in this class. We’re just going to look at simple regression, considering one variable at a time, but more complex regression modeling can actually involve multiple factors, and the hope is in geriatric research, is often times we look back to see who were the fallers and what factors contributed to their falls. The hope is to be able to use it to move forward. The usefulness of this model lies in the ability to predict.
Let’s give you a practical example of regression problem. In healthcare, in rehab, often times before a patient has surgery, they ask about recovery time, and how much time they can anticipate before they can return to work, return to lifting for example, if this is shoulder surgery, ‘How long before I can lift a bag of groceries? How many days before I can lift my baby?’ Whatever. We know that one of the models, one of the factors that can influence that time is a person’s age. We are going to look at a regression equation where the dependent variable is recovery time after how many days shoulder surgery before they return to normal activity, and the independent variable or explanatory variable is going to be age.
I have a small sample here of 10 patients. These are patients who have had surgery in the past. There is their age, and then there’s how many days before they could return to lifting.
Let’s walk you through the SPSS printout. The first screen, or the first set of data on the printout will just tell you the variables that were entered, and we entered age. Then in kind of the model summary we look at R, so we will get the Pearson r correlation coefficient, so we see that it is a positive relationship so the older you are, the more days it will take to recover. That makes sense, and then R square, so that is telling us that 31% in the variation in days of recovery can be explained by a person’s age, so roughly a third of that variation, so it’s not perfect but it is a decent predictor.
As we keep working down the SPSS printout, the second set of data is titled ANOVA. We won’t go into this, but we use the principles of ANOVA to judge the significance of an R squared of a regression coefficient. In this case, it is not significant. It is at 0.094, so it is not significant. But let’s go ahead and proceed. Let’s assume it is significant, so we can use this information. Below that we get the coefficient, so these are the values we are going to use for our regression line, so remember y = bx + a. We are going to get our b value and our a value here. We have a choice. We can either use unstandardized or standardized coefficients, and really we want to use unstandardized coefficients. That means it is in the original unit of measure, so we want the number of days it is going to take. We don’t want to report it as a standardized unit. The patient is going to want to know how many days, so we are going to look at the unstandardized column. Our constant, that is our Y intercept—think of the constant as an anchor point and it anchors that regression line to the y-axis, so that is −5.054, and a person’s age is the slope of the line for age, and it is 0.272.
If we go to our scatterplot here (and this is our scatterplot here for the data), we can look at our equation for a line, enter the values we know for b and a, and then enter the values for x and get predicted values for y. With this equation, we can actually give someone a reasonable estimate of how many days before they can expect a return to lifting.
If you would like to go through this presentation again, simply click the Replay button, and you will be returned to the beginning.