New York Times Misleads on COVID-19 Cases in Small U.S. Counties

By Aaron O. Ellis

Monday, October 26, 2020

Last week, the New York Times published an excellent series of visualizations tracking the COVID-19 outbreak in rural areas across the United States. At the very end of the article, however, they added a misleading table of the counties with the highest average daily case rates. The intent was to show that the most recent cases are hitting rural areas harder than urban and suburban areas of the country, and they did so by highlighting the counties with less than 10,000 people.

In reaction, Professor Scott E. Page of the University of Michigan tweeted:

Professor Page is exactly correct. Unfortunately, while his ire is appropriate when directed at the New York Times - they should know better - a more detailed explanation is warranted for the underlying statistics.

But first, a disclaimer: I am merely demonstrating a property of statistics using COVID-19 data as a background. The spread of COVID-19 is far from random, and cases should never be treated as a simplistic probability distribution.

A Simulation

The U.S. Census Bureau maintains an estimate of county populations, most recently for 2019. The 2020 data will no longer be an estimate, as the census was taken this year, but the data is still being tabulated.

Let’s take a look at this dataset using pandas:

import pandas
df = pandas.read_csv('co-est2019-alldata.csv', encoding='ISO-8859-1')
df.shape
# (3193, 164)

counties = df[df['SUMLEV'] == 50]
counties = counties.filter(['CTYNAME', 'STNAME', 'POPESTIMATE2019'])
counties.rename(columns={"CTYNAME": "County", "STNAME": "State", "POPESTIMATE2019": "Population"}, inplace=True)
counties.set_index(['County', 'State'], inplace=True)
counties.shape
# (3142, 1)

The dataset contains a lot of information, including population estimates for prior years and state-level aggregations. Narrowing the dataset down, we can determine some basic statistics:

  • 3,142 total counties
  • Total population of 328,239,523
  • The least populated is Kalawao County, Hawaii, with 86 people
  • The most populated is Los Angeles County, California, with 10,039,107 people
  • The counties have a mean population of 104,468 and median of 25,726 indicating a heavy right-hand skew
  • There are 718 counties - about 23% - with less than 10,000 people

Shown visually in a histogram with a bar for each 10,000 people:

Counties

When this was written, there had been 468,793 new COVID-19 cases in the United States in the past week, which is about 0.1428% of the total population. That is a daily average of 66,970 - or 2.04 daily cases per 10,000 people.

For our simulation, we’ll distribute these cases randomly throughout the United States. Once again: this is not how COVID-19 spreads; this is to simply illustrate a property of statistics.

import random
sample = random.sample(range(328239523), 468793)

We can then create bins that represent each county:

by_population = counties.sort_values(by=['Population'])
last = 0
intervals = []
for value in by_population['Population'].values:
    intervals.append((last, last+value))
    last += value
bins = pandas.IntervalIndex.from_tuples(intervals)

And place the sample set into those bins, determining an average daily case rate per 10,000 people:

out = pandas.cut(sample, bins, right=False)
by_population['Weekly Cases'] = out.value_counts().values
by_population['Avg Daily/10k'] = by_population['Weekly Cases'] / by_population['Population'] * 10000 / 7.0
by_population.sort_values(by=['Avg Daily/10k'], ascending=False, inplace=True)

Here’s the 20 counties with the highest case rates from our simulation - with full names redacted so no one confuses this for real data. Notice anything interesting?

County State Population Weekly Cases Avg Daily/10k
L* County Nebraska 664 4 8.61
L* County Texas 169 1 8.45
S* Municipality Alaska 1,183 6 7.24
H* County New Mexico 625 3 6.86
H* County Nebraska 682 3 6.28
G* County Texas 1,409 6 6.08
K* County Texas 762 3 5.62
P* County Montana 1,077 4 5.31
G* County Washington 2,225 8 5.14
A* County California 1,129 4 5.06
Q* County Georgia 2,299 8 4.97
L* County Texas 3,233 11 4.86
C* County Montana 1,252 4 4.56
P* County West Virginia 8,247 26 4.50
S* County South Dakota 6,376 20 4.48
W* County North Dakota 3,834 12 4.47
F* County Nebraska 2,979 9 4.32
P* County West Virginia 6,969 20 4.10
C* County Oklahoma 2,137 6 4.01
F* County North Dakota 3,210 9 4.01

Shown visually, with counties that have less than 10,000 people in orange:

Case rate by county

Note that the striations are caused by case counts being discrete and not continuous - you can’t have half a case!

So, our simulation shows that smaller population counties have both the highest and lowest case rates, proving Professor Page’s point.

But Why?

Let’s look at the county that had the highest case rate in our simulation. With a population of just 664, in order to have greater than the national average of of 2.04 daily cases per 10,000 people, the county would only need to have a single case the entire week! We can determine the chance of that occurring in our simulation using a binomial distribution given that there was a 0.1428% of any resident having a case in the preceding week:

from scipy.stats import binom

p = binom(664, 0.001428)
binom(664, 0.001428).pmf(1)
# 0.3676444783149298

So there was about a 36.8% percent chance of the county having one new case during the week. Once again, this is possible because our cases are randomly determined - in reality, cases are far from random.

But that’s the probability of exactly one case. We can find the probability of one or more cases with:

1 - binom(664, 0.001428).pmf(0)
# 0.6128215783303926

Which increases the chance that this county would have higher than the national average of cases to 61.3%.

And the chance of 4 or more cases?

rv = binom(664, 0.001428)
1 - (rv.pmf(0) + rv.pmf(1) + rv.pmf(2) + rv.pmf(3))
# 0.01589406080799649

So there was a 1.6% chance of this county having more than 4 times the national average!

To represent all these likelihoods visually:

Small county case count likelihoods

To have that same rate, the largest county in America, Los Angeles, would have to have 60,476 cases or more in a week. And what is the probability of that occurring if the cases are random? Once again we can use the binomial distribution:

rv = binom(10039107, 0.001428)
1 - sum([rv.pmf(x) for x in range(0, 60477)])
# 1.2370320878751784e-08

Or about 0.000001237%. In actuality, Los Angeles had 10,801 cases over the previous week, which was below the 14,335 we’d expect if the county had the national case rate.

The more sample points we have, the easier it is to determine if a conclusion is significant. And that’s what the New York Times forgot: if you’re going to compare small populations to large populations, then naturally your small populations will be more varied to the extremes - both high and low!

Outbreak Likelihoods

What if we can use these likelihoods to re-examine the original question: are recent COVID-19 cases hitting rural areas harder? Thankfully, the New York Times does an excellent job in regards to data provenance, and uploaded the case counts used in the original article to a GitHub repository.

Let’s create a table with the counties that have the least likely case rates above the national average:

County State Weekly Cases Population Likelihood Avg Daily/10k
Norton Kansas 279 5,361 ~0% 74.35
Grand Forks North Dakota 699 69,451 ~0% 14.38
Burleigh North Dakota 925 95,626 ~0% 13.82
Dodge Wisconsin 836 87,839 ~0% 13.60
Sheboygan Wisconsin 1,027 115,340 ~0% 12.72
Marathon Wisconsin 1,125 135,692 ~0% 11.84
Winnebago Wisconsin 1,339 171,907 ~0% 11.13
El Paso Texas 6,494 839,238 ~0% 11.05
Outagamie Wisconsin 1,451 187,885 ~0% 11.03
Fond du Lac Wisconsin 783 103,403 ~0% 10.82
Minnehaha South Dakota 1,312 193,134 ~0% 9.70
Mobile Alabama 2,632 413,210 ~0% 9.10
Brown Wisconsin 1,626 264,542 ~0% 8.78
Cass North Dakota 1,075 181,923 ~0% 8.44
Lubbock Texas 1,510 310,569 ~0% 6.95
Waukesha Wisconsin 1,789 404,198 ~0% 6.32
Milwaukee Wisconsin 4,145 945,726 ~0% 6.26
Utah Utah 2,394 636,235 ~0% 5.38
Salt Lake Utah 4,078 1,160,437 ~0% 5.02
Cook Illinois 11,597 5,150,233 ~0% 3.22

This table is not a ranking by likelihood, since all these counties have likelihoods approaching zero! These zero values are a strong indication that the random distribution we’ve chosen for our model is completely inappropriate. We’ve also run into an unexpected limit: the probabilities are so small that they’ve reached the minimum supported float value of our computer.

We can visualize these likelihoods for all counties, with those with less than 10,000 people once again in orange:

Likelihoods

In the above visualization, the counties with case rates near the national average will be in the center of the y-axis. Counties with rates higher than expected will be towards the top of the y-axis, and counties with lower rates towards the bottom.

We could also re-examine what it means for a county to be rural. Instead of using a population limit, we can use the definition of the Federal Office of Rural Health Policy: a rural county is any county that is not part of a metropolitan area.

We can create a new visualization, with our newly defined rural counties still in orange:

Likelihoods with rural counties

It’s difficult to draw any conclusion from this visualization. While there appears to be more rural counties with likelihoods above the national average and urban counties below it, a more rigorous numerical analysis is needed. If anything, we’ve just introduced new biases by using a naive model.

Perhaps with a more appropriate probability distribution these results could have more meaning, but our analysis will always suffer from the original mistake committed by the New York Times: in order to make definitive conclusions about the spread of COVID-19 we need more sample points, and in the case of a county-by-county analysis the smaller population counties will always have less. Forgetting this underlying property of statistics led them to publish a misleading table of data.