Welcome to Week 4 lecture 1. Today we’re going to talk about significance testing.
Now, we’ve just completed a week of a lot of Z-score problems. So you’re familiar with this image with Mu and x, and then the area from Mu to x, or greater than x, you’re pretty comfortable with that by now. But now we’re just going to change things slightly, but the same rules apply.
With a test of significance or a significance test, we’re typically dealing with a sample. So instead of x, we have an x bar. And it still operates on a normal distribution so we can make the same judgments and use the same probabilities with tests of significance.
When we start with a test of significance, we start with a hypothesis. We’re going to make a statistical hypothesis and it’s going to be basically written about a population parameter. We’re going to make a judgment that our sample is going to be different than the population parameter. Now in reality, this hypothesis obviously may be true or may not be true.
Now we’re going to write the hypothesis two different ways. First we’re going to write it in null form stating that there is no difference. So we’re going to call this the null hypothesis. And the notation for that will be H0, so H and then a subscript 0. The null hypothesis just states the there is no difference between those two parameters. And those two parameters are typically the regular group, may be patients, and then a group of patients that are getting some type of special treatment, probably our independent variable.
And we will write and alternative hypothesis, which is the kind you’re probably used to seeing and it is notated with H1, and it states there is a difference between the two parameters.
So let’s use our wheeled walker study again as an example. Let’s say we have a nurse interested in finding out whether wheeled walkers result in more falls.
Now, let’s say we have access to data that suggests that the average number of falls per 100 patients is 8.2. So out H0, our alternative hypothesis is that wheeled walkers won’t make any difference so Mu would equal 8.2. And the alternative hypothesis is that Mu does not equal 8.2. Now we call this a two-tailed test.
Now, this is a two-tailed test because we’re going to accept evidence on either side of 8.2, so on the right tail or the left tail. If our actual data falls in this yellow area, then we will accept that alternative hypothesis and reject the null hypothesis. And then, the hypothesis is also for two-tailed tests are called nondirectional. So we will write it just saying there is a difference. We won’t indicate the direction of the difference. So it’s a non-directional hypothesis and it’s a twotailed test of significance. We’ll take an extreme score of either side of Mu.
Now, we can take the same scenario, a nurse interested in finding out whether wheeled walkers result in more falls, and we can set it up a different way.
In this scenario, instead of hypothesizing that your group will just have a different score than 8.2, we’re going to hypothesize that the treatment group has significantly greater, a greater number of falls than the population. So that is a directional hypothesis. Statistically, we will set this up with a one-tailed test of significance. And since our area of interest is in a significant increase, our tail will be on the right side.
Now, obviously one-tailed tests can also be setup the opposite way with our area of significance on the left side of a normal distribution. So let’s say we have someone that hypothesizes that wheeled walkers actually will reduce falls. So we could take the same idea and make a one-directional test, now with our area of significance being on the left side of the normal distribution. And so H0 would be that Mu is greater than 8.2 and the alternative hypothesis is that it is less than 8.2.
Okay, a question that may come up is, how do we determine whether we’d use a one or a two-tailed test? Well, in general we use the literature, past research as a guide. If we’re going to do a one-tailed test, we’ve got to be pretty certain that our x bar is going to move in the direction that we hypothesize. If you have results illustrated in the picture to your left where you’re saying significance wise on the left side and you get an x bar on the right side, for the most part those studies are thrown out. You kind of have to stop right there. So when in doubt, you’re going to do a two-tailed test. And you have to make this decision ethically before you collect the data. Now, why not just do two-tailed tests all the time? There’s a huge advantage to a one-tailed test that we’re going to cover here shortly. So most people prefer, statisticians anyway, will prefer one-tailed tests, but if you’re not certain that your intervention is going to move the dependent variable in the direction that you hypothesize, it’s probably safer, more conservative to use a two-tailed test.
So once we decide on our null and alternative hypothesis, whether we’re going to do a one or a two-tailed test, we identify that critical value that we’re going to use, that area shaded in yellow. Then, we collect our data and we run a statistical test. That test is going to take the data that we have in our sample and then it will be the criteria, it will essentially be the yard stick we’re going to use in order to determine whether or not to reject the null hypothesis.
There are three classic assumptions that go along with most tests of significance. One is that the observations are independent. You’ve made independent observations. That simply means that the order in which you collect the data should have no influence on the performance of the subjects. So in some cases, that means that the subjects shouldn’t watch you collect the data. So if you’re collecting subjects on ten individuals, subjects two through ten shouldn’t watch you collect data on subject one. They might actually learn from that individual and that might impact their performance. So those observations would be dependent, the people ahead of that person or who followed that person, their score would be artificially higher, would be artificially different—it could be lower I guess—because they observed that first person that you collected data on. So the observations need to be independent. We assume that we’re drawing from a normally distributed population and from our previous chapters, you know that’s a pretty fair assumption. Normal distributions are prevalent. And the third assumption is that we know the standard deviation of the population. Now, we won’t know it precisely, but we will be able to make a conservative estimate of that standard deviation.
So our test of significance will be able to tell us whether our sample falls within the critical value or not. In this example illustrated here, we have a two-tailed test of significance. It will be able to tell us whether our sample falls within that area that will allow us to reject the null hypothesis.
Let’s return to our wheeled walker example. We have it set up here as a twotailed test of significance. So our null hypothesis is that Mu equals 8.2 and our alternative hypothesis that Mu does not equal 8.2 falls per 100 patients.
Now let’s say we collect instead of one sample, let’s say we collect multiple samples. So here’s kind of an illustration of six samples. So let’s say the first time we collect it, we get a value slightly higher than 8.2; let’s say 8.7. Let’s say we collect it again and now we’re on, we get a value less than 8.2; let’s say 7 falls. Third time, a lot higher than 8.2. Fourth time, a lot lower than 8.2, and so on. Let’s say we do this 100 times or 1,000 times and we keep track of all of those mean scores. Let’s say we make a distribution of all of those collected mean scores.
If we collect all those mean scores and make a distribution out of those, guess what? It’s normally distributed, and the mean of that distribution will be the mean of the population. So the mean will be 8.2 and it will be normally distributed.
Now, this concept is a law of mathematics, it’s a statistical law. We call it the central limits theorem. It’s really a lynchpin concept for inferential statistics. Really all that we do is based on this assumption and this assumption just states that given a sampling distribution as that size of the sampling distribution increases, the mean will get closer and closer to the actual mean and the standard deviation will get smaller and smaller. It will be a normal distribution. You can think of it as you collect more and more data, it gets more and more accurate. You keep getting a smaller and smaller standard deviation and that will center on the true mean.
And the standard deviation of this theoretic sampling distribution is very important. It’s referred to as the standard error of the mean. If you remember, earlier I told you we don’t know the standard deviation of the population, but that we can make a conservative estimate of it. It will be this estimate called the standard error of the mean that we will use.
So all of this variation that we see in the sampling distribution, you know not all of the time will we get exactly 8.2 if we get multiple samples. We may get that sometimes, but actually we’re going to get some natural fluctuation. We refer to that as random error or sampling error. It’s that naturally occurring fluctuation when we take multiple samples.
Now, I know we call it sampling error or random error, but a common misconception is that is result of an error or a mistake. It is not. It is naturally occurring. It’s variation from Mu, the true population mean. So that variation statisticians will use the term error, but it is not the result of a mistake.
Okay now, let’s look at our picture agagin with our wheeled walker study. We have in the two-tailed example, we have our critical areas shown in yellow on either side, on both tails of the distribution: the right and the left. I haven’t spoken much about the area under those. I haven’t mentioned those at all. That area is important. The size of that area is called the P-value, and a common P-value is .05. It’s also called the level of significance. So in this case it’s a .05 level of significance. So both tails added together, that area equals .05, or 5% of the total area falls within those tales. We’re going to talk about this more in lecture two on this week, but I just wanted to introduce that concept that area we’re going to call the significance level. And that’s just like our Z-score exercises, it’s the area that’s under that curve that’s represented by that yellow area.
Now, if you remember to the beginning of this lecture, this whole process started with a conjecture, a hypothesis. So it’s a guess, it’s an educated guess, but we don’t know anything with absolutely certainty. So we must acknowledge that in reality our null hypothesis may be true, or it may not be true. And we have a decision to make as to whether to accept or reject that null hypothesis based under whether our sample falls within that critical yellow area or not. And we must also then acknowledge that our decision may be correct or incorrect.
So we can kind of setup a two by two matrix made up of four quadrants that kind of illustrate this. On the horizontal axis is reality, which we’ll never know for sure. Remember we’re just taking samples, we’ll never have access to population data. So in reality our null hypothesis may be true, there is not difference, or our reality may be that the null hypothesis is false, there is a difference. Then, I have decision to make, right? As a researcher, I have two positions: I can reject the null hypothesis or I can accept the null hypothesis. So on the horizontal axis, I have my decision… I’m sorry. On the vertical axis I have my decision. So if we kind of combine those together, I come up with four possibilities.
So again, let’s look at the idea of wheeled walkers causing more falls. So the null hypothesis is that there is not a difference. It’s possible then that I could accept the null hypothesis as my decision and the null hypothesis would be true. So I’ve made a correct decision there that’s represented in the lower left with the okay. And then it’s possible that the null hypothesis is false that wheeled walkers do create more falls. And in this case a possibility is, I could reject the null and again, I would have made an okay decision there, correct decision there represented by the word okay, and that’s in the upper right. So these are two examples of correct decisions. But then on the flip side, I can make two types of errors as well. The first kind is type 1 error. That is when I reject a null when it should be accepted, when I reject a null, when in reality it’s true. That’s a type 1. A type 2 is when I accept a null when a null is false. So I’ve accepted a null that should be rejected. That’s a type 2.
So here’s kind of a summary of what I just said. Again, the question in this study is this nurse interested in wheeled walkers, if wheeled walkers result in more patients falling. My test, if I reject the null, it’s possible that that was correct that there was no difference, but it’s possible that rejecting the null would result in an error. That would be a type 1 error. Or I could retain the null and it’s possible that that null is correct, or it’s possible that it’s incorrect and I should have rejected it. That is a type 2 error.
Now an important point here, my level of significance is the probability of committing a type 1 error. So here’s our illustration again of our wheeled walker study. It is possible that random variation, remember random error or sampling error, would create a difference that far away from 8.2, but I would only expect that to happen by chance 5% of the time. So the level of significance is also the probability of committing a type 1 error.
Now common levels of significance are .10, .05, and .01. We’re going to use .05 quite a bit. A level of significance of .10, I don’t see very often. You will in healthcare see a .01 level of significance. If you do a literature review, you can see with the topic you’re studying what is a common level of significance, but in general a rule of thumb is how dangerous the study is. So, for example, in drug studies they need to show, I believe to get FDA approval, you need to show effectiveness at a .01 level. And if you think about it, a type 1 error for using a particular drug, let’s say chemotherapy that has some significant side effects, some risky side effects, you want to be pretty confident that that drug is effective. Where I have more experience in statistics with is in physical therapy research.
For the most part, much of their interventions don’t involve a lot of risk. They’re pretty conservative treatments. They’re oftentimes done instead of surgery or often tried first in order to avoid more risky treatments. So much of physical therapy uses .05, and I think a lot of nursing studies will use .05. But again if it’s an intervention that involves more risk, where a type 1 error could have more serious ramifications, you’ll probably settle on a significance of .01.
Now I want to end this lecture with a couple of examples that may help this idea of type 1 or type 2 errors kind of sink in. It’s kind of an abstract concept .So these were some examples I learned in graduate school. You can use them or ignore them. Two of the topics are kind of controversial, but if it helps it stick in your mind the difference between type 1 and type 2 errors, you can use them or discard them at your own discretion.
Now, the first one is the OJ Simpson trial. Criminal trials are one way that you can look at type 1 and type 2 errors because in a way they kind of act like research because there are two possibilities: the person is guilty or innocent. They either did it or they didn’t do it. But a jury has a decision to make to find them guilty or not guilty. And if you notice juries do not use the term innocent. They find the defendant not guilty, and they’re kind of acknowledging they don’t know for sure whether their decision is completely innocent.
So let’s look at this. So if we look at the OJ murder trial, there’s two possibilities. Again, across the top on the horizontal axis, he either killed his ex-wife or he did not. He’s either guilty or he’s innocent. The jury had two possible decisions to make, guilty or not guilty.
Now let’s look at the guilty option first. It’s possible that they could have found him guilty and he actually killed his wife, they would have made a correct decision. It’s possible that they could have found him guilty when he didn’t kill his wife. That is a type 1 error. Then the other possibility, which this is what the jury actually decided, that he was not guilty. They could have made a correct decision and they could have also found him not guilty and made a type 2 error.
This example, I think, kind of illustrates that we have a preference really on the types of errors we make, and that is we are more worried about type 1 errors. If you think about it in the OJ example, our jury system or legal system is kind of setup the same way. If you look at the two types of errors, we really want to avoid finding an innocent person guilty. That’s worse. Now finding a guilty person not guilty, is bad, but incarcerating someone or punishing someone for a crime they didn’t commit is considered far worse. And in research that’s the same way. If, let’s say you decided to reject the null hypothesis, your decision was to reject the null. Well, that would indicate that current practice needs to chance and if that’s wrong you’ve told people to change their decisions as to whether to use wheeled walkers or not erroneously. So a more conservative error, even though it’s an error, a type 2 error would keep things the way they are. They would maintain the status quo a more conservative error. So in research, again, we are far more concerned with type 1 errors.
The other example of applying type 1 and type 2 errors, a non-research example, is belief in God. If you think about it, there are two possibilities: God exists…by the way I’m making no claim that you should or shouldn’t believe in God. We’re just using this as an example to help you learn the difference between type 1 and type 2 errors. But it’s possible that God exists or God doesn’t exist. I think we all can acknowledge that. And then we have two decisions as humans on earth as we are living. We can believe or we cannot believe.
So let’s do the first row fist, I choose not to believe. If I don’t believe and it’s correct, God doesn’t exist, then I have nothing to worry about. But if I don’t believe and God exists, I have some explaining to do when I die. That’s a bad decision, far worse than a type 2 error, which is I believe. Certainly I waste a lot of time going to church and doing, if I’m a Christian getting involved in Christian activities and it didn’t lead to anything because God doesn’t exist. That’s an error, but it’s not as bad as type 1. Or I could choose to believe and be correct: God does exist.
This is a classic illustration that is called Pascal’s Wager. You can Google it and you can see different examples of it. Again, I’m not arguing you should or shouldn’t believe in God, I’m just saying that Pascal was a mathematician that used this idea to illustrate type 1 and type 2 errors. So if it helps you understand it, you can use it.
Now the third example is a little sillier, but again if you can use this to help recognize the difference between type 1 and type 2 errors, feel free to use it. And this is a possibility that if you’re in the jungle and you’re looking at some bushes, is there a lion hiding in those bushes. There’s a reality: there is a lion in the bushes or there isn’t. You can stay or you can run.
Let’s look that the bottom row first. So choosing to run, I can be correct, that there is a lion in the bushes and I ran and I made the correct decision. Or I could have chosen to run when there wasn’t a lion in the bushes that is a type two error. Choosing to stay would be an example of rejecting the null. It’s two possibilities: I could stay when the lion wasn’t there, it’s correct; or I could choose to stay and he was there and I get eaten. That’s a type 1 error. So clearly you can see the preference: a type 1 error is a far more dangerous error as opposed to a type 2 error.
Again, any of those three examples, if it helps you learn the difference. In my mind I always remember both are errors, one is incorrectly accepting and one is incorrectly rejecting. Incorrectly rejecting is a type 1. So if you just keep that square in your mind that may help you keep track of the two as well.
End of Presentation