Home Artificial Intelligence When Is It Improper to Use Bar Charts? The issue The fix Conclusion

When Is It Improper to Use Bar Charts? The issue The fix Conclusion

5
When Is It Improper to Use Bar Charts?
The issue
The fix
Conclusion

…and possible ways to repair it

Image generated by Canva text to image tool

Don’t get me improper, bar charts could be a terrific tool for data visualization, especially when used for displaying counts or proportions. Nonetheless, using them within the improper way can result in unintentional (and even worse, intentional) data misinterpretation. The actual issue I will likely be talking about today is using bar charts to present aggregated summary statistics similar to means or medians.

The most important problem here is the lack of detail as bar charts can oversimplify, leaving out necessary information similar to variance, distribution, outliers, and trends. On this post I’ll illustrate this problem using a series of examples and propose potential solutions. In an effort to not interfere with the flow of the post, the code for the charts will likely be specified at the top for many who have an interest 🙂

The wine quality dataset

Photo by Kym Ellis on Unsplash

For this post, I will likely be using the wine quality dataset¹, available through the UCI ML repository. Although the dataset incorporates many wine properties, we’ll deal with the entire sulfur dioxide measurements.

Sulfur, commonly added to wine as sulfur dioxide, plays an important role in winemaking as a consequence of its preservative qualities. Acting as an antioxidant, it helps prevent the wine’s oxidation, safeguarding it from discoloration and undesired flavor alterations. Its antimicrobial characteristics also protect the wine against spoilage from bacteria and yeasts, preserving the intended taste and quality.

Let’s illustrate the problem by plotting a straightforward bar chart comparing total sulfur dioxide levels between red and white wines.

Image by Writer

Okay, perhaps it isn’t fair to bash on bar charts using the above example because the basic chart looks so ugly, it’s off-putting with none further argument needed. Let’s first make it a bit prettier by tweaking some aesthetic properties.

Image by Writer

A lot better. Now, back to the problem a hand. What does the chart tell us? Well, obviously, the sulfur levels appear to be much higher for white wines. This was to be expected as a consequence of the differences within the winemaking process between red and white wines.

Red wines are fermented with their skins, providing natural antioxidants that help protect the wine from oxidation. In contrast, white wines are typically made by pressing the grapes and removing the skins prior to fermentation. This leaves them more vulnerable to oxidation, requiring additional protection in the shape of sulfur dioxide.

Although the common effect is discernible, the bar chart gives us no information in regards to the distribution of values in each group, or the variety of observations per each group.

This may partially be addressed by adding the variety of observations above the bars and adding errorbars to indicate the usual deviations of every group.

Image by Writer

This could be enough if the underlying distribution of values is symmetrical, but that doesn’t must be the case, making standard deviation a poor selection as a dispersion statistic. Nothing more could be added to bar charts to repair this without making it closer to a very different sort of chart. This means that bar charts usually are not ideal for presenting the sort of data.

So, what are the possible alternatives? I’ll undergo a pair in the rest of the post.

Here I offer 4 possible alternatives I believe are a greater and more transparent solution.

1. Jittered points

The primary possibility is so as to add the actual individual observations to the chart.

Image by Writer

This could be a terrific alternative if the variety of observations is comparatively small. Nonetheless, on this specific case, it feels quite cumbersome by itself as a consequence of a really large variety of wines within the dataset.

2. Boxplots with specified means

The second alternative is using boxplots with an added twist of specifying the means in addition to medians (that are displayed by a flat line within the central box by default). Although boxplots give us an idea of the underlying distribution by specifying quartiles, I like the extra information which the mean offers. It’s because a big and simply visible difference between the mean and the median immediately tells us whether the distribution is skewed and during which direction.

Image by Writer

3. Violin plots with medians

Violin plots are great because they allow us to in on the form of the underlying distributions, making it possible to simply detect anomalies similar to bimodalities or data skewness. One might argue that boxplots do that implicitly as well. Although I conform to a certain point, we also must bear in mind that an individual needs to be taught methods to read a boxplot, whereas that’s not the case with violin plots.

I also prefer to add the knowledge on the median because the violins leave lots of unused space, so why not 🙂

Image by Writer

4. Violin plots with jittered points

Okay, this one isn’t really a standalone option, but quite a mix of options 1 and three. For our specific case, this could be my pick, but that doesn’t mean it will be ideal for all possible scenarios, as that is determined by specifics of the issue similar to the variety of groups for comparison, total variety of points, group dispersions, …

Notice that I didn’t attempt to mix boxplots with specific points. That is intentional, as I feel that such a mix would defeat the aim of the boxplot. Namely, the boxplot charts display specific points only in the event that they are 1.5 interquartile ranges above the upper border of the central box. This could be used as a straightforward method for outlier detection, and can be obscured by adding too many other points as well.

Image by Writer

This post talks about a selected issue of using bar charts to present aggregate group statistics using a wine quality dataset to supply hand-on examples. After illustrating the problem, 4 possible alternatives are presented and their benefits and downsides are discussed.

Finally, remember, the first goal of any data visualization is to accurately and effectively convey information. At all times select the sort of visualization that most accurately fits the information and the message you ought to communicate.

I hope you can see the post useful. If you may have any comments, be at liberty to go away a reply to the post. And, in fact, should you liked what you read, please clap and follow me for more similar content.

Footnotes

¹P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009. (CC BY 4.0)

Code for generating the charts

library(tidyverse)

wine <- read_delim("winequality-red.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE) %>%
mutate(Type = "Red") %>%
bind_rows(read_delim("winequality-white.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE) %>%
mutate(Type = "White")) %>%
mutate(Type = factor(Type)) %>%
pivot_longer(`fixed acidity`:`quality`,
names_to = "Parameter", values_to = "Value") %>%
filter(Parameter == "total sulfur dioxide") %>%
select(-Parameter)

wine_summary <- wine %>%
group_by(Type) %>%
summarise(Median = median(Value), Mean = mean(Value),
SD = sd(Value), N = n())
#basic bar chart
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_col() +
labs(x = "Wine type", y = "Total sulfur levels")

#aesthetically pleasing bar chart
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_col(aes(fill = Type), width = 0.8) +
labs(x = "Wine type", y = "Total sulfur levels") +
scale_fill_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#bar chart with errorbars and specified variety of observations per group
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_col(aes(fill = Type), width = 0.8) +
geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.15) +
geom_label(aes(y = 200, label = N), fill = "gray97") +
scale_fill_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#jittered points chart
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_jitter(data = wine, aes(x = Type, y = Value, col = Type), alpha = 0.4) +
geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.15) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#boxplot with added information in regards to the mean
wine_summary %>%
ggplot(aes(Type, Mean)) +
geom_boxplot(data = wine, aes(x = Type, y = Value, col = Type)) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#violin plot with information in regards to the median
wine_summary %>%
ggplot(aes(Type, Median)) +
geom_violin(data = wine, aes(x = Type, y = Value, col = Type)) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

#violin plot with added jittered points
wine_summary %>%
ggplot(aes(Type, Median)) +
geom_violin(data = wine, aes(x = Type, y = Value), fill = "gray92") +
geom_jitter(data = wine, aes(x = Type, y = Value, col = Type), alpha = 0.1) +
geom_point(shape = 4, size = 2, stroke = 2) +
geom_label(aes(y = 450, label = N), fill = "gray97") +
scale_color_manual(values = c("#b11226", "#F4E076")) +
labs(x = "Wine type", y = "Total sulfur levels") +
theme_bw() +
theme(legend.position = "none")

5 COMMENTS

  1. … [Trackback]

    […] Read More: bardai.ai/artificial-intelligence/when-is-it-improper-to-use-bar-chartsthe-issuethe-fixconclusion/ […]

LEAVE A REPLY

Please enter your comment!
Please enter your name here