Outliers

Student Summary

In statistics, an outlier is a data value that is unusual in that it differs quite a bit from the other values in the data set.

Outliers occur in data sets for a variety of reasons including, but not limited to:

  • Errors in the data that result from the data collection or data entry process.
  • Results in the data that represent unusual values that occur in the population.

Outliers can reveal cases worth studying in detail or errors in the data collection process. In general, they should be included in any analysis done with the data.

A value is an outlier if it is

  • More than 1.5 times the interquartile range greater than Q3 (if x>Q3 +1.5 IQRx > \text{Q3 } + 1.5 \boldcdot \text{ IQR}).
  • More than 1.5 times the interquartile range less than Q1 (if x<Q1 1.5 IQRx < \text{Q1 } - 1.5 \boldcdot \text{ IQR}).

In this box plot, the minimum and maximum are at least two outliers.

<p>Box plot</p>
Box plot from 1 to 25 by 1’s. Whisker from 1 to 9. Box from 9 to 13 with vertical line at 10. Whisker from 13 to 24. Above the box plot, 2 horizontal segments from 3 to 9 and from 13 to 19, each labeled 1.5 dot IQR.

It is important to identify the source of outliers because outliers can affect measures of center and variability in significant ways. The box plot displays the resting heart rate, in beats per minute (bpm), of 50 athletes taken five minutes after a workout.

<p>Box plot from 50 to 120 by 10’s. Heartbeats per minute. Whisker from 55 to 62. Box from 62 to 76 with vertical line at 70. Whisker from 76 to 112. Dotted line, labeled 1.5 times IQR, from 76 to 97.</p>

Some summary statistics include:

  • mean: 69.78 bpm
  • standard deviation: 10.71 bpm
  • minimum: 55 bpm
  • Q1: 62 bpm
  • median: 70 bpm
  • Q3: 76 bpm
  • maximum: 112 bpm

It appears that the maximum value of 112 bpm may be an outlier. Beacuse the interquartile range is 14 bpm (7662=1476 - 62 = 14) and Q3 +1.5 IQR =97\text{Q3 }+ 1.5 \boldcdot \text{ IQR } = 97, we should label the maximum value as an outlier. Searching through the actual data set, it could be confirmed that this is the only outlier.

After reviewing the data collection process, it is discovered that the athlete with the heart rate measurement of 112 bpm was taken one minute after a workout instead of five minutes after. The outlier should be deleted from the data set because it was not obtained under the right conditions.

Once the outlier is removed, the box plot and summary statistics are:

<p>Box plot from 50 to 120 by 10’s. Heartbeats per minute. Whisker from 55 to 61. Box from 61 to 75.5 with vertical line at 70. Whisker from 75.5 to 85.</p>

  • mean: 68.92 bpm
  • standard deviation: 8.9 bpm
  • minimum: 55 bpm
  • Q1: 61 bpm
  • median: 70 bpm
  • Q3: 75.5 bpm
  • maximum: 85 bpm

The mean decreased by 0.86 bpm and the median remained the same. The standard deviation decreased by 1.81 bpm which is about 17% of its previous value. Based on the standard deviation, the data set with the outlier removed shows much less variability than the original data set containing the outlier. Because the mean and standard deviation use all of the numerical values, removing one very large data point can affect these statistics in important ways.

The median remained the same after the removal of the outlier and the IQR increased slightly. These measures of center and variability are much more resistant to change than the mean and standard deviation are. The median and IQR measure the middle of the data based on the number of values rather than the actual numerical values themselves, so the loss of a single value will not often have a great effect on these statistics.

The source of any possible errors should always be investigated. If the measurement of 112 beats per minute was found to be taken under the right conditions and merely included an athlete whose heart rate did not slow as much as the other athletes' heart rate, it should not be deleted so that the data reflect the actual measurements. If the situation cannot be revisited to determine the source of the outlier, it should not be removed. To avoid tampering with the data and to report accurate results, data values should not be deleted unless they can be confirmed to be an error in the data collection or data entry process.

Visual / Anchor Chart

Standards

Building On
HSS-ID.A.1

S-ID.1

S-ID.1

S-ID.1

HSS-ID.A.1

S-ID.1

S-ID.1

S-ID.1

Addressing
HSS-ID.A.1

S-ID.1

S-ID.1

S-ID.1

HSS-ID.A.2

HSS-ID.A.3

S-ID.2

S-ID.2

S-ID.2

S-ID.3

S-ID.3

S-ID.3

HSS-ID.A.3

S-ID.3

S-ID.3

S-ID.3

Building Toward
HSS-ID.A.3

S-ID.3

S-ID.3

S-ID.3