litterrocksDatagraphics.png

 Appendix: The Anatomy of a Box Plot

Box plots are a graphical statistical method to display the distribution of a data set across the range of values. They are also a method to help identify whether any outliers exists in the data set.

Figure 1 shows a box plot of the population of U.S. states and the District of Columbia. The data is based on an estimate of 2019 populations from the United States Census Bureau.

The least populated state, Wyoming, is represented by the first data point on the left of the plotted set of points. California, the most populated state, is identified as the right-most data point. All other state populations fall in between and are represented by gray data points except for several on the right hand end of the chart.

Figure 1. Anatomy of a box plot. Click to enlarge.

The term “box plot” refers to the box in the chart which encloses approximately half of the data points plotted along a horizontal or vertical axis. The lower boundary (or left in a horizontal orientation) is called the lower quartile. This boundary line is located at the approximate 25th percentile of the range of data values. The upper boundary (or right boundary in a horizontal orientation) is called the upper quartile and is placed at approximately the 75th percentile of the range of data values.

The range between the lower and upper boundaries is referred to as the interquartile range. A third line or point is drawn within the box boundaries. This line or point is drawn at the median value of the data set.

Whiskers often extend from the boundary box. The location of the whiskers can vary depending on the plotting application The whisker endpoints are also called the lower and upper adjacent values respectively. If the interquartile range value is represented as r, in most instances, the whisker endpoints extend to:

  • the largest data point that is less than or equal the value of the upper quartile plus 1.5r; and

  • the smallest data point that is greater than or equal to the value of the lower quartile minus 1.5r.

Sometimes data points extend past the respective whisker boundaries. When this occurs, these data points are often considered outliers.

***

NOTE:
In most box plots, the data points are not plotted except those located beyond the whiskers or adjacent values. These are the outlier values. I included the data points in this example for several reasons:

  • I think for this discussion, they help those new to box plots in understanding where the data points for all 51 states and districts lie within the sections of the diagram; and

  • they are easy to include with the Datagraph for macOS application I use. I think they provide a little more clarity to the data spread and if they are easily added, don’t add excessive clutter, help in understanding for novices, and do this without distracting from the overall function of the chart, why not?


REFERENCES:

[1] Cleveland, William S.; Visualizing Data, pp.25-28; Summit, NJ; Hobart Press; 1993.
[2] Bruce, Peter, and Bruce, Andrew; Practical Statistics for Data Scientists, pp.19-21; Sebastopol, CA; O’Reilly Media; 2017.
[3] Tukey, John; Exploratory Data Analysis; pp. 39-42; Reading, MA; Addison-Wesley Publishing Co.; 1977.