The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics

-up to my earlier article: The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines. My first article focused on how will be used to mislead, diving right into a form of knowledge presentation widely utilized in public matters.

In this text, I am going a bit deeper, how a misunderstanding of statistical ideas is breeding ground for being deceived by data. Specifically, I’ll walk through how correlation, base proportions, summary statistics, and misinterpretation of uncertainty can lead people astray.

Let’s get right into it.

Correlation ≠ Causation

Let’s start with a classic to get in the best mind set for some more complex ideas. From the earliest statistics classes in grade school, we’re all told that correlation just isn’t equal to causation.

For those who do a little bit of Googling or reading, you will discover “statistics” that show a high correlation between cigarette consumption and average life expectancy [1]. Interesting. Well, does that mean we must always all start smoking to live longer?

In fact not. We’re missing a confounding factor: buying cigarettes requires money, and countries with higher wealth understandably have higher life expectancies. There isn’t any causal link between cigarettes and age. I like this instance since it is so blatantly misleading and highlights the purpose well. Basically, it’s vital to be wary of any data that only shows a correlational link.

From a scientific standpoint, a correlation will be identified via commentary, however the only technique to claim causation is to truly conduct a randomized trial controlling for potential confounding aspects—a reasonably involved process.

I selected to start out here because while being introductory, this idea also highlights a key concept that underpins understanding data effectively: The information only shows what it shows, and nothing else.

Keep that in mind as we move forward.

Remember Base Proportions

In 1978, Dr. Stephen Casscells and his team famously asked a gaggle of 60 physicians, residents, and students at Harvard Medical School the next questions:

“If a test to detect a disease whose prevalence is 1 in 1,000 has a false positive rate of 5%, what’s the prospect that an individual found to have a positive result actually has the disease, assuming you already know nothing in regards to the person’s symptoms or signs?”

Though presented in medical terms, this query is absolutely about statistics. Accordingly, it also has connections to data science. Take a second to take into consideration your individual answer to this query before reading further.

Photo by Getty Images on Unsplash

The reply is (roughly) 2%. Now, if you happen to looked through this quickly (and aren’t in control along with your statistics), you’ll have guessed significantly higher.

This was definitely the case with the medical school folks. Only 11/60 people appropriately answered the query, with 27/60 going as high as 95% of their response (presumably just subtracting the false positive rate from 100).

It is straightforward to assume that the actual value ought to be high resulting from the positive rest result, but this assumption incorporates an important reasoning error: It fails to account for the extremely low prevalence of the disease within the population.

Said one other way, if only one in every 1,000 people has the disease, this must be taken into consideration when calculating the probability of a random person having the disease. The probability doesn’t rely only on the positive test result. As soon because the test accuracy falls below 100%, the influence of the bottom rate comes into play quite significantly.

Formally, this reasoning error is often known as the base rate fallacy.

To see this more clearly, imagine that only one in every 1,000,000 people had the disease, however the test still has a false positive rate of 5%. Would you continue to assume that a positive test result immediately indicates a 95% probability of getting the disease? What if it was 1 in a billion?

Base rates are extremely vital. Do not forget that.

Statistical Measures Are NOT Corresponding to the Data

Let’s take a have a look at the next quantitative data sets (13 of them, to be precise), all of that are visualized as a scatter plot. One is even in the form of a dinosaur.

Image By Creator. Generated using code available under MIT license at https://jumpingrivers.github.io/datasauRus/

Do you see anything interesting about these data sets?

I’ll point you in the best direction. Here’s a set of summary statistics for the info:

X-Mean	54.26
Y-Mean	47.83
X-SD (Standard Deviation)	16.76
Y-SD	26.93
Correlation	-0.06

For those who’re wondering why there is just one set of statistics, it’s because they’re all the identical. Each considered one of the 13 Charts above has the identical mean, standard deviation, and correlation between variables.

This famous set of 13 data sets is often known as the [5], and was published some years ago as a stark example of why summary statistics cannot at all times be trusted. It also highlights the worth of visualization as a tool for data exploration. Within the words of renowned statistician John Tukey,

“The best value of an image is when it forces us to note what we never expected to see.“

Understanding Uncertainty

To conclude, I would like to discuss a slight variation of deceptive data, but one which is equally vital: mistrusting data that is definitely correct. In other words, false deception.

The next chart is taken from a study analyzing the feelings of headlines taken from left-leaning, right-leaning, and centrist news outlets [6]:

“Average yearly sentiment of headlines grouped by the ideological leanings of reports outlets” by Authors of the study: David Rozado, Ruth Hughes, Jamin Halberstadt is licensed under CC BY 4.0. To view a duplicate of this license, visit https://creativecommons.org/licenses/by/4.0/?ref=openverse.

There is kind of a bit happening within the chart above, but there may be one particular aspect I would like to attract your attention to: the vertical lines extending from each plotted point. You might have seen these before. Formally, these are called , they usually are a method that scientists often depict uncertainty in the info.

Let me say that again. In statistics and Data Science, “error” is synonymous with “uncertainty.” Crucially, it doesn’t mean something is fallacious or incorrect about what’s being shown. When a chart depicts uncertainty, it depicts a rigorously calculated measure of the range of a price and the extent of confidence at various points inside that range. Unfortunately, many individuals just take it to mean that whoever made the chart is basically guessing.

This can be a serious error in reasoning, for the damage is twofold: Not only does the info at hand get misinterpreted, however the presence of this misconception also contributes to the harmful societal belief that science just isn’t to be trusted. Being upfront about the restrictions of information should actually increase our confidence in a claim’s reliability, but mistaking that limitation as admission of foul play results in the alternative effect.

Learning how one can interpret uncertainty is difficult but incredibly vital. On the minimum, an excellent place to start out is realizing what the so-called “error” is definitely attempting to convey.

Recap and Final Thoughts

Here’s a cheat sheet for being wary of deceptive data:

Correlation ≠ causation. Search for the confounding factor.
Remember base proportions. The probability of a phenomenon is influenced by its prevalence within the population, regardless of how accurate your test is (except 100% accuracy, which is rare).
Beware summary Statistics. Means and medians will only take you thus far; you could explore your data.
Don’t misunderstand uncertainty. It isn’t an error; it’s a rigorously considered description of confidence levels.

Remember these, and also you’ll be well positioned to tackle the following data science problem that makes its technique to you.

Until next time.

References

[1] , Alberto Cairo

[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC4955674

[3] https://data88s.org/textbook/content/Chapter_02/04_Use_and_Interpretation.html?utm_source=chatgpt.com

[4] https://visualizing.jp/the-datasaurus-dozen

[5] https://dl.acm.org/doi/abs/10.1145/3025453.3025912?casa_token=AU6PWgCWQuMAAAAA:5a9-oA38RxxzmVGZiIFJdrNdOMII2kmsFLJK22WJgaAk37PECCmAQjwVzAiapGiV4MAOPTJ8-uax0g

[6] https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276367

The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics

Correlation ≠ Causation

Remember Base Proportions

Statistical Measures Are NOT Corresponding to the Data

Understanding Uncertainty

Recap and Final Thoughts

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Are Foundation Models Ready for Your Production Tabular Data?

Unlocking AI’s full potential requires operational excellence

Sora 2 breaks the web

OpenAI’s Sora 2 is INCREDIBLE

Actual Intelligence within the Age of AI

The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics

Correlation ≠ Causation

Remember Base Proportions

Statistical Measures Are NOT Corresponding to the Data

Understanding Uncertainty

Recap and Final Thoughts

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.