Unit 3 Two Variable Statistics — Unit Plan

TitleTakeawaysStudent SummaryAssessment
Lesson 2
Relative Frequency Tables

Converting two-way tables to relative frequency tables can help reveal patterns in paired categorical variables. Relative frequency tables are created by dividing the value in each cell in a two-way table by the total number of responses in the entire table, or the total responses in a row or a column. Depending on what patterns are important, different types of relative frequency tables are used. To examine how individual combinations of the categorical variables relate to the whole group, divide each value in a two-way table by the total number of responses in the entire table to find the relative frequency.

For example, this two-way table displays the condition of a certain textbook and its price for 120 of the books at a college bookstore. 

$10 or less more than $10 but less than $30 $30 or more
new 3 9 27
used 33 36 12

A two-way relative frequency table is created by dividing each number in the two-way table by 120 because there are 120 values (3+9+27+33+36+12) in this data set. The resulting two-way relative frequency table can be represented using fractions, decimals, or percents. so inexpensive and that 10% of the books in the bookstore are both expensive and in used condition.

$10 or less more than $10 but less than $30 $30 or more
new 0.025 0.075 0.225
used 0.275 0.300 0.100

This two-way relative frequency table allows you to see what proportion of the total is represented by each number in the two-way table. The number 33 in the original two-way table represents the number of used books that also sell for $10 or less, which is 27.5% of all the books in the data set. Using this two-way relative frequency table, we can see that there are very few (2.5%) new books that are al

In other situations, it makes sense to examine row or column proportions in a relative frequency table. For example, to convert the original two-way table to a column relative frequency table using column proportions, divide each value by the sum of the column. 

$10 or less more than $10 but less than $30 $30 or more
new 0.083 0.2 0.692
used 0.917 0.8 0.308

This shows that about 91.7% (333+330.917\frac{33}{3+33} \approx 0.917) of the books that are sold for $10 or less are in used condition. Notice that each column of this column relative frequency table reveals the proportions of the books in each price category that are in each condition and that the relative frequencies in each column sum to 1. In particular, this shows that most of the inexpensive and moderately priced books are used, and most of the expensive books are new.

Writing Choices (1 problem)

Eighty students are asked if they prefer manual or electric pencil sharpeners and if they prefer mechanical or wood pencils.

mechanical pencils wood pencils
manual sharpeners 5 10
electric sharpeners 34 31
  1. Complete the relative frequency table with the correct proportions so that it could be used to answer the following question: “Among students who like manual pencil sharpeners, what proportion also prefer mechanical pencils?” 
     

    mechanical pencils wood pencils
    manual sharpeners
    electric sharpeners
  2. Use the table to determine the percentage of people who prefer electric sharpeners and wood pencils.
Show Solution
  1. mechanical pencils wood pencils
    manual sharpeners 0.33 0.67
    electric sharpeners 0.52 0.48
  2. 48%

Lesson 3
Associations in Categorical Data

An association between two variables means that the two variables are statistically related to each other. For example, we might expect that ice cream sales would be higher on sunny days than on snowy days. If sales were higher on sunny days than on snowy days, then we would say that there is a possible association between ice cream sales and whether it is sunny or snowing. When dealing with categorical variables, row or column relative frequency tables are often used to look for associations in the data.

Here is a two-way table displaying ice cream cone sales and weather conditions for 41 days for a particular creamery. 

sunny day snowy day total
sold fewer than 50 cones 8 7 15
sold 50 cones or more 22 4 26
total 30 11 41

Noticing a pattern in the raw data can be difficult, especially when the row or column totals are not the same for different categories, so the data should be converted into a row or column relative frequency table to better compare the categories. For the creamery, notice that the number of days with low sales is about the same for the two weather types, which contradicts our intuition. In this case, it makes sense to look at the percentage of days that sold well under each weather condition separately. That is, consider the column relative frequencies. 

sunny day snowy day
sold fewer than 50 cones 27% 64%
sold 50 cones or more 73% 36%
total 100% 100%

From the column relative frequency table, it is clear that most of the sunny days resulted in sales of at least 50 cones (73%), while most of the snowy days resulted in fewer than 50 cones sold (64%). Because these percentages are quite different, this suggests there is an association between the weather condition and the number of cone sales. A bakery might wonder if the weather conditions impact their muffin sales as well.

sunny day snowy day
sold fewer than 50 muffins 32% 35%
sold 50 muffins or more 68% 65%
total 100% 100%

For the bakery, it seems there is not an association between weather conditions and muffin sales since the percentage of days with low sales are very similar under the different weather conditions, and the percentages are also close on days when they sold many muffins.

Using row or column relative frequency tables helps organize data so that columns (or rows) can be easily compared between different categories for a variable. This comparison can be accomplished using a two-way table, but the differences in the number of data values in a given category must be accounted for.

Graduate Debt (1 problem)

The table summarizes data about the median debt for a sample of students graduating from universities in California and New York. 

median debt less than $9,000 median debt at least $9,000 total
California universities 130 445 575
New York universities 72 271 343
total 202 716 918

Is there an association between the state and the amount of median debt for graduates? Explain your reasoning.

Show Solution

Sample response: There is not enough evidence to support a claim of association between state universities and median debt. Of California universities, 77% (4455750.77\frac{445}{575} \approx 0.77) have students who graduate with a median debt of at least $9,000, which is very similar to the 79% (2713430.79\frac{271}{343} \approx 0.79) of New York universities that also have a large debt.

Section A Check
Section A Checkpoint
Lesson 4
Linear Models

While working in math class, it can be easy to forget that reality is somewhat messy. Not all oranges weigh exactly the same amount, beans have different lengths, and even the same person running a race multiple times will probably have different finishing times. We can approximate these messy situations with more precise mathematical tools to better understand what is happening. We can also predict or estimate additional results as long as we continue to keep in mind that reality will vary a little bit from what our mathematical model predicts.

For example, the data in this scatter plot represents the price of a package of broccoli and its weight. The data can be modeled by a line given by the equation y=0.46x+0.92y = 0.46x + 0.92. The data does not all fall on the line because there may be factors other than weight that go into the price, such as the quality of the broccoli, the region where the package is sold, and any discounts happening in the store.

y=0.46x+0.92y = 0.46x+0.92

<p>A scatter plot with a line of best fit.</p>
A scatterplot. Horizontal, from 0 to 3, by 0 point 5's, labeled weight in pounds. Vertical, 0 to 2 point 5, by 0 point 25s, labeled price in dollars. 12 dots trending upward and to the right. A line of best fit passes through the y axis at 0 comma 0 point 92, and trends upwards and to the right, passing through three dots.  

We can interpret the yy-intercept of the line as the price for the package without any broccoli (which might include the cost of things like preparing the package and shipping costs for getting the vegetable to the store). In many situations, the data may not follow the same linear model farther away from the given data, especially as one variable gets close to zero. For this reason, the interpretation of the yy-intercept should always be considered in context to determine if it is reasonable to make sense of the value in that way.

We can interpret the slope as the approximate increase in price of the package for the addition of 1 pound of broccoli to the package.

The equation also allows us to predict prices of packages of broccoli that have weights near the weights observed in the data set. For example, even though the data does not include the price of a package that contains 1.7 pounds of broccoli, we can predict the price to be about $1.70 based on the equation of the line, since 0.461.7+0.921.700.46 \boldcdot 1.7 + 0.92 \approx 1.70.

On the other hand, it does not make sense to predict the price of 1,000 pounds of broccoli with this data because there may be many more factors that influence the pricing of packages that far away from the data presented here.

Roar of the Crowd (1 problem)

The scatter plot shows the maximum noise level when different numbers of people are in a stadium. The linear model is given by the equation y=1.5x+22.7y = 1.5x + 22.7, where yy represents maximum noise level and xx represents the number of people, in thousands, in the stadium.

y=1.5x+22.7y = 1.5x+22.7

<p>Scatter plot.</p>
A scatterplot. Horizontal, from 60 to 80, by 5's, labeled number of people, thousands. Vertical, 105 to 140, by 5’s, labeled maximum noise level, decibels. 12 dots, straight line trending upward and to the right.  
​​​​​​

  1. The slope of the linear model is 1.5. What does this mean in terms of the maximum noise level and the number of people?
  2. A sports announcer states that there are 65,000 fans in the stadium. Estimate the maximum noise level. Is this estimate reasonable? Explain your reasoning.
  3. What is the yy-intercept of the linear model given? What does it mean in the context of the problem? Is this reasonable? Explain your reasoning.
Show Solution
  1. Sample response: For every additional thousand people in the stadium, the noise level increases by about 1.5 decibels.
  2. 120.2 decibels. Sample reasoning: It is a reasonable value since the data seem to fit a linear model well.
  3. The yy-intercept is (0,22.7)(0,22.7), which means a stadium with no people in it will have a maximum noise level of 22.7 decibels. Sample reasonings:
    • This is actually reasonable since a whisper is about 20 decibels.
    • This is not reasonable since it should be silent with no people in the stadium.
    • This is not reasonable because the point is so far from the data that it is unlikely that the linear model will be accurate.
Section B Check
Section B Checkpoint
Lesson 7
The Correlation Coefficient

While residuals can help pick the best-line to fit the data among all lines, we still need a way to determine the strength of a linear relationship. Scatter plots of data that are close to the best-fit line are better modeled by the line than are scatter plots of data that are farther from the line.

The correlation coefficient is a convenient number that can be used to describe the strength and direction of a linear relationship. Usually represented by the letter rr, the correlation coefficient can take values from -1 to 1. The sign of the correlation coefficient is the same as the sign of the slope for the best-fit line. The closer the correlation coefficient is to 0, the weaker the linear relationship. The closer the correlation coefficient is to 1 or -1, the better a linear model fits the data. 

<p><strong>Graph of a scatter plot, origin O. Horizontal axis labeled r = -1. The data has linear model with a negative slope.</strong></p>

<p>Graph of a scatter plot, origin O. Horizontal axis labeled r = negative zero point 7. The data is slightly scattered and trends downward with a negative slope.<br>
 </p>

<p>Graph of a scatter plot, origin O. Horizontal axis labeled r = negative zero point 4. The data is a scattered cloud that trends slightly downward.</p>

<p>Graph of a scatter plot, origin O. Horizontal axis labeled r = zero point zero 2. The data is a scattered cloud with no visible trend.</p>

<p>Graph of a scatter plot, origin O. Horizontal axis labeled r = zero point 3. The data is a scattered cloud that trends slightly upward.</p>

<p>Graph of a scatter plot, origin O. Horizontal axis labeled r = zero point 8. The data is slightly scattered and trends upward with a positive slope.</p>

<p></p>

<p>Graph of a scatter plot, origin O. Horizontal axis labeled r = 1. The data has linear model with a positive slope.</p>

While it is possible to try to fit a linear model to any data, we should always look at the scatter plot to see if there is a possible linear trend. The correlation coefficient and residuals can also help determine whether the linear model makes sense to use to estimate the situation. In some cases, another type of function might be a better fit for the data, or the two variables we are examining may be uncorrelated, and we should look for connections using other variables.

What Is a Correlation Coefficient? (1 problem)
  1. What information does a correlation coefficient tell us about the data in a scatter plot?
  2. Which value best estimates the value for the correlation coefficient of the scatter plot:
    -1, -0.8, -0.2, 0.2, 0.8, or 1? Explain your reasoning.

<p>Graph of a scatter plot, xy-plane, origin O.</p>
Graph of a scatter plot, xy-plane, origin O. Horizontal axis scale 0 to 14, by 2’s. Vertical axis scale 0 to 32, by 4’s. Best fit line from approximately (4 comma 30) to near (zero point 5 comma 14). The data is slightly scattered and trends downward with a negative slope.

Show Solution

Sample response:

  1. The sign of the correlation coefficient matches the sign of the slope of the best-fit line. The closer the correlation coefficient value is to 0, the worse the fit of the best-fit line. The closer the correlation coefficient is to 1 or -1, the better the best-fit line fits the data.
  2. -0.8, since the data appears to be decreasing and a line is an okay fit for the data, but not perfect
Section C Check
Section C Checkpoint
Unit 3 Assessment
End-of-Unit Assessment