Unit 1 One Variable Statistics — Unit Plan
| Title | Takeaways | Student Summary | Assessment |
|---|---|---|---|
Lesson 2 Data Representations | — | The table shows a list of the number of minutes people could intensely focus on a task before needing a break. Fifty people of different ages are represented.
There were quite a few people that lost focus at around 3, 7, 13, and 19 minutes, and nobody lost focus at 11, 12, or 15 minutes. Dot plots are useful when the data set is not too large and shows all of the individual values in the data set. In this example, a dot plot can easily show all of the data. If the data set is very large (more than 100 values, for example), or if there are many different values that are not exactly the same, it may be hard to see all of the dots on a dot plot. A histogram is another representation that shows the shape and distribution of the same data. Most people lost focus between 5 and 10 minutes or between 15 and 20 minutes, while only 4 of the 50 people got distracted between 20 and 25 minutes. When creating histograms, each interval includes the number at the lower end of the interval but not the number at the upper end. For example, the tallest bar displays values that are greater than or equal to 5 minutes but less than 10 minutes. In a histogram, values that are in an interval are grouped together. Although the individual values get lost with the grouping, a histogram can still show the shape of the distribution. Here is a box plot that represents the same data. Box plots are created using a five-number summary. For a set of data, the five-number summary consists of these five statistics: the minimum value, the first quartile, the median, the third quartile, and the maximum value. These values split the data into four sections, each representing approximately one-fourth of the data. The median of this data is indicated at 8 minutes, and about 25% of the data fall in the short second quarter of the data between 6 and 8 minutes. Similarly, approximately one-fourth of the data are between 8 and 17 minutes. Like the histogram, the box plot does not show individual data values, but other features such as quartiles, range, and median are seen more easily. Dot plots, histograms, and box plots provide three different ways to look at the shape and distribution while highlighting different aspects of the data. | Reasoning about Representations (1 problem) The dot plot, histogram, and box plot represent the distribution of the same data in 3 different ways.
Show SolutionSample response:
|
Section A Check Section A Checkpoint | |||
Lesson 5 Calculating Measures of Center and Variability | — | The mean absolute deviation, or MAD, and the interquartile range, or IQR, are measures of variability. Measures of variability tell you how much the values in a data set tend to differ from one another. A greater measure of variability means that the data are more spread out, while a smaller measure of variability means that the data are more consistent and are closer to the measure of center. To calculate the MAD of a data set:
To calculate the IQR, subtract the value of the first quartile from the value of the third quartile. Recall that the first and third quartile are included in the five-number summary. | Calculating MAD and IQR (1 problem)
mean: 12
Show Solution
|
Section B Check Section B Checkpoint | |||
Lesson 10 The Effect of Extremes | — | Is it better to use the mean or median to describe the center of a data set? The mean gives equal importance to each value when finding the center. The mean usually represents the typical values well when the data have a symmetric distribution. On the other hand, the mean can be greatly affected by changes to even a single value. The median tells you the middle value in the data set, so changes to a single value usually do not affect the median much. So, the median is more appropriate for data that are not very symmetrically distributed. We can look at the distribution of a data set and draw conclusions about the mean and the median. Here is a dot plot showing the amount of time a dart takes to hit a target in seconds. The data produce a symmetric distribution. When a distribution is symmetric, the median and mean are both found in the middle of the distribution. Since the median is the middle value (or the mean of the two middle values) of a data set, you can use the symmetry around the center of a symmetric distribution to find it easily. For the mean, you need to know that the sum of the distances away from the mean of the values greater than the mean is equal to the sum of the distances away from the mean of the values less than the mean. Using the symmetry of the symmetric distribution you can see that there are four values 0.1 second above the mean, two values 0.2 seconds above the mean, one value 0.3 seconds above the mean, and one value 0.4 seconds above the mean. Likewise, you can see that there are the same number of values the same distances below the mean. Here is a dot plot using the same data, but with two of the values changed, resulting in a skewed distribution. When you have a skewed distribution, the distribution is not symmetric, so you are not able to use the symmetry to find the median and the mean. The median is still 1.4 seconds since it is still the middle value. The mean, on the other hand, is now about 1.273 seconds. The mean is less than the median because the lower values (0.3 and 0.4) result in a smaller value for the mean. The median is usually more resistant to extreme values than is the mean. For this reason, the median is the preferred measure of center when a distribution is skewed or if there are extreme values. When using the median, you would also use the IQR as the preferred measure of variability. In a more symmetric distribution, the mean is the preferred measure of center, and the MAD is the preferred measure of variability. | Shape and Statistics (1 problem)
Show Solution
|
Lesson 11 Comparing and Contrasting Data Distributions | — | The mean absolute deviation, or MAD, is a measure of variability that is calculated by finding the mean distance from the mean of all the data points. Here are two dot plots, each with a mean of 15 centimeters, displaying the length of sea scallop shells in centimeters. Notice that both dot plots show a symmetric distribution so the mean and the MAD are appropriate choices for describing center and variability. The data in the first dot plot appear to be more spread apart than the data in the second dot plot, so you can say that the first data set appears to have greater variability than does the second data set. This is confirmed by the MAD. The MAD of the first data set is 1.18 centimeters and the MAD of the second data set is approximately 0.94 cm. This means that the values in the first data set are, on average, about 1.18 cm away from the mean, and the values in the second data set are, on average, about 0.94 cm away from the mean. The greater the MAD of the data, the greater the variability of the data. The interquartile range, IQR, is a measure of variability that is calculated by subtracting the value for the first quartile, Q1, from the value for the third quartile, Q3. These two box plots represent the distributions of the lengths in centimeters of a different group of sea scallop shells, each with a median of 15 centimeters. Notice that neither of the box plots have a symmetric distribution. The median and the IQR are appropriate choices for describing center and variability for these data sets. The middle half of the data displayed in the first box plot appear to be more spread apart, or show greater variability, than the middle half of the data displayed in the second box plot. The IQR of the first distribution is 14 cm, and the IQR is 10 cm for the second data set. The IQR measures the difference between the median of the second half of the data, Q3, and the median of the first half, Q1, of the data, so it is not affected by the minimum or the maximum value in the data set. It is a measure of the spread of the middle 50% of the data. The MAD is calculated using every value in the data set, and the IQR is calculated using only the values for Q1 and Q3. | Which Menu? (1 problem) A restaurant owner believes that it is beneficial to have different menu items with a lot of variability so that people can have a choice of expensive and inexpensive food. Several chefs offer menus and suggested prices for the food they create. The owner creates dot plots for the prices of the menu items and finds some summary statistics. Which menu best matches what the restaurant is looking for? Explain your reasoning. Italian: mean: $9.03 median: $9 MAD: $2.45 IQR: $3.50 Diner: mean: $3.36 median: $2 MAD: $2.12 IQR: $4 Japanese: mean: $10.35 median: $10 MAD: $5.55 IQR: $9.50 Steakhouse: mean: $11.51 median: $10.50 MAD: $3.69 IQR: $4.50 Show SolutionJapanese. The variability, whether measured with IQR or MAD, is greater than any of the other menus available. |
Lesson 12 Standard Deviation | — | We can describe the variability of a distribution using the standard deviation. The standard deviation is a measure of variability that is calculated using a method that is similar to the one used to calculate the MAD, or mean absolute deviation. A deeper understanding of the importance of standard deviation as a measure of variability will come with a deeper study of statistics. For now, know that the standard deviation is mathematically important and will be used as the appropriate measure of variability when the mean is an appropriate measure of center. Like the MAD, the standard deviation is large when the data set is more spread out, and the standard deviation is small when the variability is small. The intuition you gained about MAD will also work for the standard deviation. | True or False: Reasoning with Standard Deviation (1 problem) The low temperature in degrees Celsius for some cities on the same days in March are recorded in the dot plots.
Decide if each statement is true or false. Explain your reasoning.
Show Solution
|
Lesson 14 Outliers | — | In statistics, an outlier is a data value that is unusual in that it differs quite a bit from the other values in the data set. Outliers occur in data sets for a variety of reasons including, but not limited to:
Outliers can reveal cases worth studying in detail or errors in the data collection process. In general, they should be included in any analysis done with the data. A value is an outlier if it is
In this box plot, the minimum and maximum are at least two outliers.
It is important to identify the source of outliers because outliers can affect measures of center and variability in significant ways. The box plot displays the resting heart rate, in beats per minute (bpm), of 50 athletes taken five minutes after a workout. Some summary statistics include:
It appears that the maximum value of 112 bpm may be an outlier. Beacuse the interquartile range is 14 bpm () and , we should label the maximum value as an outlier. Searching through the actual data set, it could be confirmed that this is the only outlier. After reviewing the data collection process, it is discovered that the athlete with the heart rate measurement of 112 bpm was taken one minute after a workout instead of five minutes after. The outlier should be deleted from the data set because it was not obtained under the right conditions. Once the outlier is removed, the box plot and summary statistics are:
The mean decreased by 0.86 bpm and the median remained the same. The standard deviation decreased by 1.81 bpm which is about 17% of its previous value. Based on the standard deviation, the data set with the outlier removed shows much less variability than the original data set containing the outlier. Because the mean and standard deviation use all of the numerical values, removing one very large data point can affect these statistics in important ways. The median remained the same after the removal of the outlier and the IQR increased slightly. These measures of center and variability are much more resistant to change than the mean and standard deviation are. The median and IQR measure the middle of the data based on the number of values rather than the actual numerical values themselves, so the loss of a single value will not often have a great effect on these statistics. The source of any possible errors should always be investigated. If the measurement of 112 beats per minute was found to be taken under the right conditions and merely included an athlete whose heart rate did not slow as much as the other athletes' heart rate, it should not be deleted so that the data reflect the actual measurements. If the situation cannot be revisited to determine the source of the outlier, it should not be removed. To avoid tampering with the data and to report accurate results, data values should not be deleted unless they can be confirmed to be an error in the data collection or data entry process. | Expecting Outliers (1 problem) A group of 20 students are asked to report the number of pets they keep in their house. The results are: 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 4, 21
Show Solution
|
Lesson 15 Comparing Data Sets | — | To compare data sets, it is helpful to look at the measures of center and measures of variability. The shape of the distribution can help choose the most useful measure of center and measure of variability. When distributions are symmetric or approximately symmetric, the mean is the preferred measure of center and should be paired with the standard deviation as the preferred measure of variability. When distributions are skewed or when outliers are present, the median is usually a better measure of center and should be paired with the interquartile range (IQR) as the preferred measure of variability. Once the appropriate measure of center and measure of variability are selected, these measures can be compared for data sets with similar shapes. For example, let’s compare the number of seconds it takes football players to complete a 40-yard dash at two different positions. First, we can look at a dot plot of the data to see that the tight-end times do not seem distributed symmetrically, so we should probably find the median and IQR for both sets of data to compare information. The median and IQR could be computed from the values, but can also be determined from a box plot.
This shows that the tight-end times have a greater median (about 4.9 seconds) compared to the median of wide-receiver times (about 4.5 seconds). The IQR is also greater for the tight-end times (about 0.5 seconds) compared to the IQR for the wide-receiver times (about 0.25 seconds). This means that the tight ends tend to be slower in the 40-yard dash when compared to the wide receivers. The tight ends also have greater variability in their times. Together, this can be taken to mean that, in general, a typical wide receiver is faster than a typical tight end is, and the wide receivers tend to have more similar times to one another than the tight ends do to one another. | Comparing Mascots (1 problem) A new pet food company wants to sell their product online and use social media to promote themselves. To determine whether to use a dog or a cat as their mascot, they research the number of clicks on links with an image of a dog or a cat. mean: 1,263.5 clicks median: 1,282 clicks standard deviation: 357.4 clicks IQR: 409 clicks mean: 1,105.4 clicks median: 1,125.5 clicks standard deviation: 239.3 clicks IQR: 312.5 clicks
Show Solution
|
Section D Check Section D Checkpoint | |||
Unit 1 Assessment End-of-Unit Assessment | |||