Box plots

Figure 1 below shows the distribution in the normalized frequencies for each of the eight swear/taboo word forms displayed as box plots, visualizing the variation in swear/taboo word usage between different episodes/series.1 Frequencies were normalized per 10,000 words per episode – for instance, a raw frequency of three in an episode that contains 3,110 words corresponds to a normalized frequency of 9.6.

The X-axis shows the word form, while the Y-axis displays the normalized frequency values of the word forms in each episode. The rectangle or box represents the interquartile range (IR), while the thick line inside the rectangle shows the median value of the data.2 The horizontal lines outside, but linked to the rectangle, show those maximum (line above rectangle) and minimum (line below rectangle) values which are no greater than 1.5 times the interquartile range. Circles above the rectangle indicate outliers, while stars show extremes.3

The larger the box and extent between the minimum and maximum lines (the ‘whiskers’), the wider the range of variation in the data; i.e., the more varied is the dataset. Thus, in Figure 1 fuck and fucking show most variation and bitch, damn and hell show least variation. In addition, the episode from Eastbound & Down (with a ‘sweary’ character in the form of Kenny) is an outlier for two of the forms (ass and shit) and an extreme case for fucking, while the episode from True Blood is an outlier for bitch, damn and hell.

Figure 1 Eight swear/taboo word forms (boxplots)

In sum, the boxplots (and the histograms) show that there is a considerable amount of variation in the data (between episodes) and that different swear/taboo words behave very differently. Generalizations about all swear/taboo words in TV series should not ignore or downplay such variation.

Notes

1 Zero occurrences are not included in this dataset – in other words, this visualization relates to those episodes where the respective word occurs at least one. This results in a variable number of episodes being included in the box plot for each word form. All calculations are based on normalized frequency (occurrence of the word form per 10,000 words in the episode) – after data were analysed as described in this document. Visualizations were produced using SPSS (simple boxplot); data were imported into SPSS from Excel using the default settings. I am very grateful to Dr Kayla Maloney, Data Analysis Officer at the University of Sydney, for help and advice throughout this process. For information on SPSS boxplots I also consulted http://web.mit.edu/11.220/www05/brushup/spss/boxplot.htm.

2 The interquartile range is based on dividing the values in a dataset into four sections (quartiles), and contains the middle 50% of records. The interquartile range is based on the dataset itself and depends on arranging the values in order of magnitude:

The lower quartile is found one quarter of the way along a dataset when the values have been arranged in order of magnitude; the upper quartile is found three quarters along the dataset. Therefore, the upper quartile lies half way between the median and the highest value in the dataset whilst the lower quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile range is found by subtracting the lower quartile from the upper quartile.

(https://www2.le.ac.uk/offices/ld/resources/numerical-data/variability)

The median = ‘the central score of the distribution, with half of the scores being above the median and half falling below.’ (Oakes 1998: 2). It can be considered a proxy for overall frequency. That is, a dataset with a higher median is likely to have a higher overall normalized frequency of occurrence.

3 Outliers = values between 1.5-3 times the interquartile range; extremes = values more than 3 times the interquartile range.

References

Oakes, Michael 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.