The Problem with Statistics

5 min readAug 14, 2020

Statistics can be one of the most divisive and harmful misinformation tools, and I have seen it all over Facebook. I have attempted to make sense of the apparent conflict of reality that statistics represent. I knew nothing about data science when I began to write this, and after researching, I realized that I am woefully unprepared for this attempt. That said, I decided to give it a try. (Note: I know that I have a bias towards the existence and ubiquitousness of systemic racism. This article is a general critique, but on second reading, my examples betray that bias.)

The Impetus

Candace Owens claimed that 75% of black homes are without a father. She said that if you believe that there are strong male role models in Black America, you are entering “the land of the delusional.”

That’s a damning statistic. Larry Elder recently tweeted: “Assume there’s a vaccine against white racism. Would 70% of black kids STILL be raised in fatherless homes?”

The problem is, this statistic is based on census data. The statistic is technically correct; 75% of black homes have unmarried parents. But that’s not the same as a fatherless home. You see, 45% of children are in black dwellings with unmarried parents that live in the same household. That statistic is a little less damning and on par with white figures. How about shared custody? That’s 25% of children living in that situation. In reality, the percentage of fatherless homes is higher in black America, but not to the extent presented to us. The statistics, while technically correct, are used to deceive us.

Whether the statistic is police brutality, the number of out of state protestors, or job numbers, they can easily be used to support whatever side you are on. How? Because statistics are, by definition, nuanced and subject to interpretation. They NEVER speak for themselves. This is a feature, not a bug.

The Nerdy Data

Statistical Significance is the first presentation of data from a study that a researcher observed. They have to decide what it means and report it. A study may show a thing, but something being Statistically significant doesn’t necessarily mean it is feasible or of any practical importance.

A good example is the previously mentioned fatherless homes statistic. Both people referred to applied significance to the part of the statistic that backed up their narrative. In reality, that statistic had no Statistical Significance at all.

Secondly, there are Irrelevant Plots. This often happens when the researcher enters into a study with a bias in need of confirmation. An irrelevant plot is similar to a large banana photo, which becomes a normal-sized banana when a quarter is introduced into the image for scale. We need a baseline to determine the relevance of a statistic. Unfortunately, benchmarks can be impossible to agree upon.

Let’s take use of non-lethal force amongst minorities. Is it on the rise or not? Well, depends, are we looking at the black population alone? Then no, it’s steady. Do we include “unknown race” and “other”? Then yes, it’s definitely on the rise. Many of the unknown race is black. Many are not. Jamaicans are often listed as “other” when they could quickly put in the “black” category. The baseline is so squishy that I could right now quote you statistics that support improvements in rates of non-lethal force, or make it appear to be rising egregiously. If I’m Larry Elder or Candace Owens, then it’s going to be improving. If I’m a Black Lives Matter representative, then it’s a call to arms. See how easily one statistic is manipulated for my cause?

The third is Correlation Does Not Equal Causation. If you see a stat that says crime increases in areas where black people live, it’s easy to assume that the cause of the crime is the black population. Unfortunately, causation is the most challenging task a data scientist must decipher. Data alone can rarely do it. Determining causation often requires an in-depth investigation — boots on the ground. Interviews, historical research, control groups, and experiments are not the purview of most data scientists. Data is.

What if crime is higher in a predominantly black neighborhood because that neighborhood has low-income housing? And what if there was a time that the neighborhood wasn’t predominantly black, and the crime was still higher in that area? What if the area became mostly black because the more impoverished minority community relocated there because of the affordable housing? If all of these are considered, the conclusion could easily be reported: “Black populations are not solving a historic crime problem they inherited in certain neighborhoods.” Much less damning, and much more ridiculous.

The fourth factor is the Yule-Simpson Effect. Defined, it’s when several groups of data suggest one thing, but the conclusion is reversed when combined. “In 1973. Admission rates were investigated at the University of Berkeley’s graduate schools. Women sued the university for the gender gap in admissions. With each school examined separately (law, medicine, engineering, etc.), women were admitted higher than men! However, the average suggested that men were actually admitted at a much higher rate than women.” (https://www.statisticshowto.com/what-is-simpsons-paradox/)

It’s confusing as hell. But it exists. When applied to the issue at hand, I see a correlation currently being made by pundits that black Governors see the highest rates of police shootings of minorities. Really? Please take a second and think about it in terms of Correlation Does not Equal Causation. Think about Irrelevant Plots and Statistical Significance. Now let’s add in the Yule-Simpson effect and ask- did you look closely at the data? When you combine the data points, it turns out; the inverse is true of police shootings as a whole in a few of those cities.

Finally, there’s Sampling. Sampling relates to the integrity of the one compiling the data. I can sample only the vegan community and come back with a statistic that says nobody eats meat. How do I protect that finding? I hide the specifics of the sampled population. This one speaks for itself, and can be the most damning proof of confirmation bias.

Conclusion

Our reaction to a statistic shouldn’t be outrage, action, or use it as a Facebook post. It should be to ask questions about the statistic. If the statistic proves reliable- such as prison inmate population statistics- then do it. But if you share a statistic that backs your view up without asking the questions first, that very act is harmful promulgation of a potentially false narrative.

Don’t be that guy.

The Problem with Statistics

The Impetus

The Nerdy Data

Conclusion

Written by Jonathan Taylor