Video Transcript
In this video, we will learn how to express data in tables, analyze them using statistical methods, and present them in a graphical format. Science is all about asking questions. As a simple example, one question we can ask and attempt to answer at home or in the laboratory is how much sugar can dissolve in 100 milliliters of water. We could attempt to answer this question by adding sugar of known mass to a measured amount of water in a beaker. When no more sugar can be dissolved in the water, it will begin to gather at the bottom of the beaker. If we do this once, we might get an answer like 95 grams.
However, scientists like to repeat their trials multiple times to ensure that the number that they get is not a fluke. We could carry out eight more trials, giving us a set of useful numbers to help us answer this question. However, we need a way of simplifying these numbers. If we ask the question how much sugar can dissolve in 100 milliliters of water, it’s not a satisfactory answer to say sometimes 95 grams, sometimes 93 grams, sometimes 101 grams, and so on and so forth. We need to find a simpler way of expressing our results.
The title of this video, Handling Data, means summarizing or visualizing the results of an experiment. While our simple experiment here involves only nine trials, other experiments could involve hundreds or even thousands of trials, making the need for a summary extremely important.
In this video, we will look at averages or numbers that summarize a group of numbers. We’ll also look at various graphs and charts that provide a visual representation of sets of numbers. And we’ll also look at equations which can show the relationship between two sets of numbers.
Let’s start with averages. An average is one number that gives the central or typical value for a group of numbers. Instead of sharing all nine numbers here, we can pick an average to represent the group. One average we can use is called the mean. We calculate the mean by taking the sum of the terms and dividing it by the number of the terms, in this case taking the total amount of sugar dissolved and dividing it by the number of trials. The total sugar dissolved across all trials is 819 grams. And there are nine trials in the data set. So our calculation is 819 divided by nine. This gives us a mean of 91 grams. 91 grams is a central number that we’ve calculated that can represent the entire data set in one number.
Another type of average we can use is the median. The median is the middle trial or the middle value in the whole set of numbers. We can find the median by arranging the numbers in order. 82 is the lowest value, and 101 is the highest. We can write in the next lowest and next highest value again and again until we are left with one value in the middle. This value, 92 grams, is the median. Half of the numbers in the set are lower than 92, and half of the numbers in the set are higher than 92.
In this data set here, there were an odd number of entries, so our middle trial was a single value. It’s worth noting that for data sets with an even number of entries, we will converge on two middle values. How do we calculate the median when there are two middle values, for example, if the middle values are 92 and 93? In this case, the median is 92.5, the number directly in between 92 and 93.
To confirm our answer, or if the answer is too difficult to carry out in our heads, we can add the two numbers together and divide by two to find the number directly in between the two numbers. In this case, 92.5 is the median even though none of the numbers in the set equal 92.5. As a summary, the median gives us the middle value of the set.
The third type of average, the mode, is the most frequently occurring value in the set. In this data set, the number 95 shows up twice, while all other numbers show up only once. Since it is the most frequently occurring number in the data set, 95 is the mode of this data set. It’s worth noting that if no number appears more frequently than any other number, there is no mode. Also, if multiple numbers are tied for the most number of appearances in this set, there can be multiple modes to the data set.
Each of these three averages — mean, median, and mode — give us a number that represents the entire group. They do so in different ways, so the numbers are often but not necessarily different values. Mean is the most commonly used of the three. In fact, sometimes people say find the average, meaning find the mean, although technically mean, median, and mode are all types of averages.
We’ve summarized this data with an average. But we could also visualize it with a chart or table. One such table is called a frequency table. A frequency table looks something like this. The column on the left has ranges of values for the number of grams dissolved in any one trial. The column on the right has counts of the number of trials whose results fall within each range of the number of grams dissolved. This arrangement of the data lets us quickly see that in many of the trials, between 91 and 95 grams of sugar were dissolved. It also lets us see that more trials fell below this range of values than above it.
A frequency table doesn’t necessarily have to use ranges of values. As shown here, the frequency table on the right uses a single gram measurement for each row. The upside of this table is that it more precisely shows each point in the data set. The downsides are that it doesn’t group the trials together in visually useful ways, like the table on the left, and also it takes up more space. We only had room to produce a bit of the table here on the screen.
If we wanted to turn this frequency table into a chart, we would make a histogram. Just like our frequency table has ranges of values in the left-hand column, our histogram also has ranges of value, also known as bins, along the 𝑥-axis. And just like how the right-hand column of the frequency table gives us the number of trials that fall within the range, the height of the bar above each bin gives us the number of trials whose results fall within that range of values.
A histogram can make the patterns in the data set even easier to visualize, especially when there are many bins to look at. This data set lends itself to being visualized by a frequency table or histogram. For other data sets, there might be other charts or graphs that are useful. For example, if we carried out a similar experiment, where we tested not just sugar but salt and Epsom salt’s ability to dissolve in water, our results might fit well into a bar chart.
While these two charts look pretty similar, the biggest difference lies in the 𝑥-axis. In a histogram, the 𝑥-axis bins form a continuous range of quantities. One bin ends where the next picks up. So any data point in the data set will fall into one of the bins of the histogram.
A histogram helps us visualize the answer to the question, where are the data points along this continuous range? For a bar chart, the 𝑥-axis is divided into columns, where each column represents a unique category. A bar chart helps us answer the question, how do these categories compare to one another?
By looking at the bar chart below, we can see that sugar is the most soluble substance, followed by Epsom salt, followed by salt. While histograms and bar charts look similar, they do have different purposes.
Another useful chart is a pie chart. In a pie chart, the full pie represents 100 percent of something. The things that make up the pie, in this case water and Epsom salt making up the Epsom salt solution, will have larger or smaller slices corresponding to more of a presence in the solution or less of a presence in the solution. When visually estimating the percentages of a pie chart, it can be useful to remember some key values. Half of the pie is 50 percent, a quarter of the pie is 25 percent, and an eighth of the pie is 12.5 percent. For example, in this pie chart, water takes up a little over half of the pie, or 59 percent. Epsom salt takes up a little under half of the pie, or 41 percent.
The purpose of a pie chart is to show percentages or numbers as parts of a whole. It is useful for showing the percent composition of a material or substance, as well as the demographics of a population or the breakdown of a budget. Here, our pie chart shows percentages, but it could also show numbers. For example, it could tell us that our solution is 100 grams of water and 70 grams of Epsom salt. Bar charts and pie charts are ways to show different categories compared and as parts of a whole.
Sometimes instead of measuring one variable, we wanna determine the relationship between two variables. For example, we might wanna know how the solubility of sugar and water changes as the temperature of the water changes. To answer this question, we could take measurements of different amounts of sugar dissolved at different temperatures. We could then plot our data points on a coordinate graph. For example, if at 20 degrees Celsius one of our trials dissolved 91 grams of water, we could plot the point 20, 91 on the graph. Eventually, if we plotted all of our data points, it might look something like this. Plotting data points with two variables on a coordinate graph like this is called a scatter plot.
We can summarize a scatterplot with a line of best fit. Since all of the data points we have generally form a shape going up and to the right, we can draw a line through the shape that summarizes the data. In this case, since the line of best fit goes up and to the right, we can say that there is a positive correlation between these two variables. A positive correlation means there’s a direct relationship. As the temperature of the water increases, so does the amount of sugar that can be dissolved in it.
To take a look at the opposite relationship, we could investigate the relationship between volume and pressure of a fixed amount of gas in a container. The line that best fits this data set points down and to the right. A line of best fit that moves down and to the right is described as a negative correlation between the two variables. This means that there is an inverse relationship. In this case, as the volume increases, the pressure decreases. And as the volume decreases, the pressure increases.
Sometimes we’ll take a look at two variables that end up not being related at all. For example, the time of day should not have any effect on the solubility of water. In this case, we say there is no correlation. There is not an obvious line of best fit that summarizes the scatterplot. If the scatterplot looks like a cloud of points that does not point in any one specific direction, the two variables have no relationship.
If we investigate the relationship between two variables, we can make a scatterplot and sort it into one of these three categories. Either there will be a positive correlation, a negative correlation, or no correlation between the two variables. For our scatterplot, the line of best fit is a useful summary. It takes a set of disorganized data and arranges it cleanly in one straight line. When the relationship between two variables can be graphed as a straight line, we call that a linear relationship.
We’re going to take a closer look at linear relationships and graphed lines to see what information we can glean from them. For a simple example of a linear relationship, let’s imagine that we’re steadily heating up a sample of water. The water begins at 25 degrees Celsius, as represented by the point zero comma 25. After 10 minutes, the water has heated up to 85 degrees Celsius, as represented by the point 10 comma 85. If we connect the two points with a line, we can get a line that represents the linear relationship between the time and the temperature in this situation.
The temperature probably didn’t increase at a perfectly steady rate the whole time. But the line is a reasonable simplification, letting us make useful predictions about the temperature at times when we might not have a measured data point. We could’ve also arrived at the linear relationship by taking more measurements and using a computer or calculator to give us a line of best fit, like we did with the scatterplots earlier.
However we obtain it, the line on the graph lets us predict the temperature at various times along the course of the experiment. But eyeballing the graph to estimate the temperature or time can lead to inaccuracy or imprecision. To ensure consistency, we can describe a linear relationship with a mathematical equation, which will allow us to calculate, instead of visually estimate, the numbers involved. The basic formula for the equation for a linear relationship is 𝑦 equals 𝑚𝑥 plus 𝑐, where 𝑥 and 𝑦 are the variables involved, in this case time and temperature. The variable 𝑚 represents the slope or the gradient of the equation.
Another phrase for this is the rate of increase. It shows how much the temperature goes up by each minute. 𝑐 is the 𝑦-intercept. It’s the value of 𝑦 when 𝑥 equals zero or the value where the line hits the 𝑦-axis. Since 𝑐 is the value of 𝑦 when 𝑥 equals zero, we can think of it as the starting value. In this case, it’s the starting temperature or the temperature when the time is zero minutes.
Writing down the equation of a linear relationship is useful because it gives us these pieces of information in a nice condensed form. So what is the slope and the 𝑦-intercept of the line that we’ve drawn here? The 𝑦-intercept is relatively easy to find. At time equals zero, the temperature is 25 degrees. So our 𝑦-intercept is 25. But what is our slope or rate of increase?
If we look at how the temperature changes over the course of the experiment, we can see that, after 10 minutes, there’s a 60-degree increase in the temperature. A formula that we can use to calculate the slope is that the slope equals the rise over the run or the vertical change over the horizontal change. Another way of saying this is Δ𝑦 over Δ𝑥. In our example here, the vertical change or Δ𝑦 is 60 degrees and the horizontal change or Δ𝑥 is 10 minutes.
If we carry out the calculation, we get a slope of six degrees per minute. This value makes sense. If it takes 10 minutes for the temperature to increase by 60 degrees, the temperature was on average increasing six degrees every minute. So we can plug six in for 𝑚 in our equation, giving us a final equation of 𝑦 equals six 𝑥 plus 25. This equation relates 𝑥, the time in minutes, to 𝑦, the temperature of the water, in the experiment we carried out.
Much like the averages, charts, tables, and graphs that we looked at earlier in the video, an equation of a linear relationship that represents a line on a graph is another way of visualizing and simplifying a set of data.
Now that we’ve learned about handling data, let’s review some key points of the video. An average is a number that represents the middle or typical value of a set of numbers. The mean is the calculated middle value. It’s the total of all the values divided by the number of values. The median is the number in the middle position of the set. And the mode is the number that appears the most frequently.
Data can be visualized with a bar chart that shows the value of various categories or a histogram which shows the distribution of a variable across a range of numbers. We can also use a pie chart to show the percentages that make up a whole. A scatterplot lets us plot data points with measurements of two variables. Sometimes the relationship between the variables will be a linear relationship, in which case we can represent it with a linear equation. Linear equations take the general form 𝑦 equals 𝑚𝑥 plus 𝑐, where 𝑚 represents the slope and 𝑐 represents the 𝑦-intercept. We can use a linear equation when looking at a scatterplot if we apply a line of best fit that represents the data on the scatterplot.