In this explainer, we will learn how to determine whether a sample is biased or unbiased.
In most statistical studies, where the size of the population is large, it is too costly and time consuming to collect data from the entire population, which is the method of mass population. To save time and money, researchers can collect data from a sample from this population, which is the method of samples. Using statistical inference, researchers can estimate a population characteristic from sample statistics.
This raises an important question. What if the selected sample does not accurately represent the entirety of the population? For instance, let us consider an example of collecting data to study per capita income trends in a city. Using the method of samples, say we select 100 random individuals from a certain neighborhood in the city and collect data from them. Would this data produce a good estimate for the mean per capita income of the whole city? To answer this question, we need to define what it means for a sample to be representative of the population.
Definition: Representative Sample
A sample is representative of the population if the sample and the population share similar distributions of individuals’ characteristics relevant to the variable of study.
In our example, we have selected our sample entirely from one neighborhood. We know that individuals living in the same neighborhood are likely to have similar incomes and that general income levels can greatly differ among different neighborhoods. In other words, the distribution of income figures in this neighborhood may not be similar to that in the entire population. Hence, this sample is likely not representative of the population.
In our first example, we will consider why a given sampling method would not lead to a sample that is representative of the population.
Example 1: Understanding Representative Samples
An animal rescue center wants to find out if people in its town think more money should be spent on animal welfare. They plan to ask a random sample of their visitors and volunteers to fill in a questionnaire. Why would their sample not be representative of the town’s population?
- There might be a high proportion of children in the sample, which will skew the result.
- We do not know how they will ensure that the sample is random.
- People will not want to spend time filling in the questionnaire.
- The center’s visitors and volunteers are likely to be supportive of spending on animal welfare.
Answer
In this example, the population under study is the group of individuals living in this town. A sample is formed from the visitors to and volunteers at the animal rescue center. However, a visitor or a volunteer is more likely to care about animal welfare than an individual from the general population of this town. This means that, at least concerning animal welfare, this sample would not be representative of the town’s population. This reason is given in option D.
The remaining options are also partially valid, but none of them is the best reason. Let us consider the remaining options.
Families with children are more likely to visit an animal rescue center, so option A is a valid concern for the study. However, this issue does not directly concern the issue of animal welfare, so this is not the best reason. Options B and C describe concerns present in any sampling method, and they are not particularly large concerns for this study.
Hence, the primary reason why this sample would not be representative of the town’s population is option D.
In the previous example, we considered why a given sampling method would not lead to a sample that is representative of the population. To form a sample that is representative of the entire population, researchers should ensure that each individual in the population has an equal chance of being selected for the sample. This does not guarantee a representative sample, since one can always end up with a box of bad apples by chance. Hence, it is also important to select a sufficiently large sample to reduce this effect. The required size of a sample depends both on the types of data and on the size of the population and will not be explicitly discussed in this explainer.
Let us return to the example of sampling for per capita income trends where we selected a random sample from a certain neighborhood. We can see that forming a random sample from a certain neighborhood does not give an equal chance for each individual in the city to be in the sample. In this sense, selecting a sample restricted to a smaller subset of the population is not a good method of obtaining a representative sample of the population.
Let us now consider an example where we will determine which of the given methods will lead to a representative sample.
Example 2: Understanding Representative Samples
Which of the following is a representative sample?
- A student wants to find out how much students at their school enjoy math classes, so they give a questionnaire to everyone at the math club.
- To find out how students travel to school, student representatives from each grade ask a random sample of 20 students from their grade.
- A hospital wants to investigate the reasons why people go to the emergency room, so questionnaires are handed out to a random sample of people waiting in the emergency room on a Monday morning.
- A market research company wants to find out how much waste people recycle, so they survey 100 people at the city recycling drop-off location.
Answer
Recall that a sample is representative of the population if the sample and the population share similar distributions of individuals’ characteristics relevant to the variable of study. To form a sample that is representative of the entire population, we must ensure that each individual in the population has an equal chance of being selected for the sample.
To determine whether a sampling method will lead to a representative sample, we need to first identify the population and the variable of study and ask whether each individual in the population has an equal chance of being selected for the sample. If each individual does not have an equal chance of being in the sample, we should then ask whether this discrepancy may influence the distribution of individuals’ characteristics relevant to the variable of study. Let us examine each option separately.
- The population of study is the group of students in this school, and the variable of study is a student’s enjoyment of math classes. In this option, the sample is the math club. Students who are not in the math club cannot be selected to be in the sample. In particular, we can see that students who belong to the math club will tend to enjoy math classes more than students who do not belong to the math club. Hence, the distribution of students who like mathematics in this sample is likely not similar to that in the entire population. Hence, this sample is not representative of the entire population.
- The population of study is the group of students from this school, and the variable of study is a student’s mode of transportation to school. In this option, a random sample of 20 students from each grade is selected. Assuming that the number of students in each grade is similar, this sampling method gives equal chance for each student in the school to be selected for the sample. Then, the distribution of students according to how they travel to school in this sample is likely to be similar to that in the entire population. Hence, this is a representative sample.
- The population of study is the group of people going to the emergency room, and the variable of study is the reason an individual goes to the emergency room. In this option, the sample is selected from people waiting in the emergency room on a Monday morning. According to this sampling method, people who go to the emergency room on any other day of the week cannot be counted. In particular, people coming to the emergency room on a weekday may be more likely going there because of work-related injuries compared to patients coming on the weekends. Hence, the distribution of the reason may be different for this sample than it is for the entire population. Hence, this is not a representative sample.
- The population of study is the group of people living in the city, and the variable of study is how much an individual recycles. In this option, the sample is selected from people at the city recycling drop-off location. According to this sampling method, people who do not go to the city recycling drop-off location cannot be counted. In particular, the large group of people who do not recycle or who recycle at other venues is excluded from this sample. Hence, the distribution of the amount of recycling may be different for this sample than it is for the entire population. Hence, this is not a representative sample.
The only representative sample is given in option B.
In the previous example, we determined whether a given sampling method is likely to produce a sample that is representative of the entire population. In many cases, when we impose arbitrary restrictions on the individuals we select for our sample, we will likely end up with a sample that is not representative of the population.
Definition: Biased Sampling
Biased sampling is a method of forming a sample that favors certain values of the variable of study. A sample obtained from a biased sampling method is likely not representative of the entire population.
Let us return to our example of sampling for per capita income trends in a city where our sample was selected from a certain neighborhood of the city. Here, the variable of study is an individual’s income figure. We have already observed that this sampling method would likely lead to a sample that is not representative of the entire population. Collecting random individuals from one neighborhood favors a specific income level that is prominent in that neighborhood, which means that this sampling method favors a specific range of values for our variable. Hence, the sampling method is biased.
A sample that is obtained using a biased sampling method is called a biased sample. A biased sample is highly likely not representative of the population. On the other hand, if a sampling method is not biased, then the resulting sample is called an unbiased sample. An unbiased sample is likely to lead to a representative sample. However, as discussed previously, an unbiased sampling method may not necessarily give us a representative sample, especially if the sample size is not sufficiently large.
Let us consider an example where we determine whether or not a given sampling method will lead to a biased sample.
Example 3: Selecting Samples
Nader wants to find out the proportion of seventh grade students who have already been abroad. There are 250 seventh grade students in his school. He decides to number all of them from 1 to 250, generate a random list of 40 numbers between 1 and 250, and then interview the corresponding students. Is his sample biased?
Answer
Recall that biased sampling is a method of forming a sample that favors certain values of the variable of study. The variable of study is whether or not a student has been abroad.
Nader’s sampling method does not favor a student who has been abroad over a student who has not been abroad, or vice versa. This sampling method gives each individual in the population an equal chance to be in the sample, which will likely lead to a sample that is representative of the population. Hence, his sampling method is unbiased.
Since his sample results from an unbiased sampling method, his sample is unbiased.
Let us consider another example where we determine whether or not a given method of sampling is biased.
Example 4: Selecting Samples
Mariam is doing a research project on whether or not students in her school eat healthy food. She decides to interview her friends who do gymnastics with her. Is her sample biased?
Answer
Recall that biased sampling is a method of forming a sample that favors certain values of the variable of study. Biased sampling will likely lead to a sample that is not representative of the population. A sample resulting from biased sampling is called a biased sample.
In this example, Mariam forms a sample, which is the group of her friends who do gymnastics with her. The variable of study in this example is whether or not students in her school eat healthy food. It is likely that her friends who do gymnastics with her are more likely to care about healthy lifestyles and food. This means that this method of forming a sample favors the individuals who eat healthy food. This is not representative of the population, which is the group of students in her school. Hence, this is a biased sampling method.
Since her sample resulted from a biased sampling method, her sample is biased.
In the previous example, the bias in the sampling method is evident because the experimenter restricted the sample to a group that evidently is related to the variable of study, which is healthy eating habits. But even if the sample was selected from a specific group that is not seemingly related to healthy eating habits, the sample could be biased because of other more obscure factors.
In practice, biased sampling is generally an unintended consequence of forgetting to consider a possible source of bias. Even when we take extra care to randomize our sample, it is always possible that we may be forgetting a seemingly minor detail. A small omission in this regard can result in the great consequence of corrupting and invalidating a collected data set.
One of the most common types of biased sampling methods is convenience sampling.
Definition: Convenience Sampling
Convenience sampling is a method of forming a sample by volunteers.
Voluntary surveys attract individuals who hold certain types of opinions on the variable of study. Hence, forming a sample by volunteers is considered biased sampling. Unfortunately, this is the most common type of data that we encounter. Product or movie reviews on websites, voluntary television or radio surveys, and polls posted on social media are all examples of convenience sampling. This method is very prominent because it is the easiest way to collect data. When dealing with data collected by convenience sampling, we must be aware that we are dealing with a biased sample, which is not representative of the population.
In our final example, we will consider various sampling scenarios and determine which of the resulting samples is unbiased.
Example 5: Selecting Samples
A school principal wants to find out what the students think about the teaching quality in the school. Which of these samples is unbiased?
- A questionnaire is available at the library for anyone who wants to take part in the survey.
- A list of male students to interview is randomly generated.
- A list of students to interview is randomly generated.
- All ninth grade students are interviewed.
- A list of female students to interview is randomly generated.
Answer
Recall that biased sampling is a method of forming a sample that favors certain values of the variable of study. The variable of study in this example is a student’s opinion on the teaching quality in the school. Let us consider the method in each option. We need to determine whether or not the method used for forming the sample favors students who have certain types of opinions about the teaching quality in the school.
- In this option, the sample is formed by volunteers among students who
come to the library. There are two reasons why this sampling method is
biased. Firstly, this is an example of convenience sampling. Remember that
convenience sampling is a method of forming a sample by volunteers. Convenience sampling is often biased because it attracts individuals who
already hold certain types of opinions on the variable of study. In this scenario, it is likely to attract students who either hate or love
the teaching quality in the school, leaving out the large part of the
population who do not hold strong opinions.
Another source of bias in this sampling method is due to the fact that the questionnaires are left at a library. A library attracts students who like to study, and these students are more likely to think highly of the teaching quality in the school. In other words, this sampling method favors students who think highly of the teaching quality in the school, which means that this leads to a biased sample. - In this option, the sample is randomly generated only among the male students. This leaves out the entire female population of the school. It is possible that male students hold a different type of opinion on the teaching quality compared to the female students. Hence, this sampling method favors the opinions that male students are more likely to hold, meaning that this leads to a biased sample.
- In this option, a random sample is chosen from the entire population so that each individual is equally likely to be in the sample. This sampling method does not favor a student holding a certain type of opinion on the teaching quality in the school, which means that this sampling method is unbiased. Hence, this sample is unbiased.
- In this option, only ninth grade students are selected for the sample. This would be a good sample if the variable of study was the teaching quality in the ninth grade classrooms, but not in the entire school. If the ninth grade teaching quality is worse compared to that of the other grades, this sampling method will favor students who hold low opinions of the teaching quality. This means that this sampling method is biased. Hence, this sample is biased.
- In this option, only female students are selected for the sample. Similar to option B, this sampling method is biased. Hence, this sample is biased.
The only unbiased sample is that in option C.
Let us finish by recapping a few important concepts.
Key Points
- A sample is representative of the population if the sample and the population share similar distributions of individuals’ characteristics relevant to the variable of study.
- To form a sample that is representative of the entire population, researchers should ensure that each individual in the population has an equal chance of being selected for the sample. This does not guarantee a representative sample, since one can always end up with a box of bad apples by chance. Hence, it is also important to select a sufficiently large sample to reduce this effect.
- Biased sampling is a method of forming a sample that favors certain values of the variable of study. A sample obtained from a biased sampling method is likely not representative of the entire population.
- Convenience sampling is a method of forming a sample by volunteers. Convenience sampling is a prominent example of biased sampling methods.