Excel spreadsheet limit leads to 16,000 Covid-19 cases left off daily count

Microsoft Excel is useful for many things, but it has its limitations (like all software), which led to an undercount of 15,841 Covid-19 positive tests recorded by Public Health England. For the Guardian, Alex Hern reports:

In this case, the Guardian understands, one lab had sent its daily test report to PHE in the form of a CSV file – the simplest possible database format, just a list of values separated by commas. That report was then loaded into Microsoft Excel, and the new tests at the bottom were added to the main database.

But while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.

The gut reaction seems to dunk on Excel, but the whole infrastructure sounds off. Excel wasn’t meant to handle that many rows of data, and as a non-Excel person, I think it’s been like that forever.

Why are these records manually entered and transferred to a database? Why is the current solution to work off this single file that holds all of the data?

I bet the person (or people) tasked with entering new rows into the database aren’t tasked with thinking about the data. Who eventually noticed no new records were recorded after a week?

Such important data. So many questions.

It’s not so much an Excel problem as it is a data problem, and what looked like downward trend was actually going up.

Tags: , ,

FDA commissioner corrects his misinterpretation of reduced mortality

Talking about a possible plasma treatment for Covid-19, the Food and Drug Administration Commissioner Stephen Hahn misinterpreted results from the study. The study from the Mayo Clinic notes a possible 35% reduction in mortality rate, and Hahn said that if 100 people were sick with Covid-19, 35 lives would be saved.

For The Washington Post, Aaron Blake discusses why the interpretation is incorrect:

The vast majority of people who get the virus will recover with or without plasma. The 35 percent figure comes into play among those who die — a much smaller group. That would still be a huge development if borne out. But strictly speaking, the treatment would have saved about 3 out of 100 coronavirus patients, not 35. And given the smaller numbers we’re talking about, the finding is much closer to the margin of error — even as the preliminary study finds the effect to be statistically significant.

And even then, the claim doesn’t make sense. The data that he and Trump were referring to compared those receiving plasma treatments not to a control group, but between higher and lower levels of plasma treatments. The group with lower levels died at a rate of 11.9 people out of 100 died, while 8.7 percent died with higher levels.

Hahn later corrected himself.

See also Christopher Ingraham’s quick explanation of relative versus absolute risk. And this visual explainer from 2015 by NYT’s The Upshot should also be helpful in understanding the difference.

Tags: , , ,

Algorithm leads to arrest of the wrong person

Even though there was supposedly a person in the decision-making process and a surveillance photo wasn’t actually Robert Julian-Borchak Williams, he still ended up handcuffed in front of his own home. Kashmir Hill reporting for The New York Times:

This is what technology providers and law enforcement always emphasize when defending facial recognition: It is only supposed to be a clue in the case, not a smoking gun. Before arresting Mr. Williams, investigators might have sought other evidence that he committed the theft, such as eyewitness testimony, location data from his phone or proof that he owned the clothing that the suspect was wearing.

In this case, however, according to the Detroit police report, investigators simply included Mr. Williams’s picture in a “6-pack photo lineup” they created and showed to Ms. Johnston, Shinola’s loss-prevention contractor, and she identified him. (Ms. Johnston declined to comment.)

Tags: , , ,

Face depixelizer with machine learning, and some assumptions

In crime shows, they often have this amazing tool that turns a low-resolution, pixelated image of a person’s face to a high-resolution, highly accurate picture of the perp. Face Depixelizer is a step towards that with machine learning — except it seems to assume that everyone looks the same.

There might still be some limitations.

Tags: , ,

Bad bar chart

Welcome to whose bar chart is it anyway: where the geometries are made up and the numbers don’t matter. [via @dannypage]


Bad denominator

With coronavirus testing, many governments have used the percentage of tests that came back positive over time to gauge progress and decide whether or not it’s time to reopen. To calculate percentage, they divide confirmed cases by total tests. The denominator — total tests — often comes from the CDC, which apparently hasn’t done a good job calculating that denominator, because not all tests are the same.

Alexis C. Madrigal and Robinson Meyer for The Atlantic:

Mixing the two tests makes it much harder to understand the meaning of positive tests, and it clouds important information about the U.S. response to the pandemic, Jha said. “The viral testing is to understand how many people are getting infected, while antibody testing is like looking in the rearview mirror. The two tests are totally different signals,” he told us. By combining the two types of results, the CDC has made them both “uninterpretable,” he said.


Tags: , , , ,

Poor comparison between two bar charts

A chart from Business Insider makes a poor attempt to compare the death rates, by age, for the common flu against Covid-19:

The age groups on the horizontal axes are different, so you can’t make a fair side-by-side comparison. For example, the flu chart has a 50-64 age group. The Covid-19 chart has a 50-59 group and a 60-69 group.

Ann Coulter’s interpretation of the chart might be worse than the chart itself:


The values for people under 60, other than for the “under 30” group, are greater for Covid-19 than for the flu. Coulter’s interpretation is wrong no matter which way you cut it. Also, the article that the chart comes from points out the opposite.

I get it. It’s Twitter. There will be mistakes. But at least correct or delete them, instead of dangling it out there for people to spread.

For those making charts, please think about how others will interpret them. These are weird times and we don’t need to add more confusion. For those sharing charts, please think for a second before you put it out there.

Tags: ,

Misinterpreted or misleading fire maps

With all of the maps of fire in Australia, be sure to check out this piece by Georgina Rannard for BBC News on how some of the maps can easily be misinterpreted when seen out of context.

Tags: ,

Study retracted after finding a mistaken recoding of the data

A study found that a hospital program significantly reduced the number of hospitalizations and emergency department visits. Great. But then the researchers realized that the data was recoded incorrectly, and the program actually increased hospitalizations and emergency department visits. Not so great.

They retracted their paper:

The identified programming error was in a file used for preparation of the analytic data sets for statistical analysis and occurred while the variable referring to the study “arm” (ie, group) assignment was recoded. The purpose of the recoding was to change the randomization assignment variable format of “1, 2” to a binary format of “0, 1.” However, the assignment was made incorrectly and resulted in a reversed coding of the study groups. Even though the data analyst created and conducted some test analysis programs, they were of the type that did not show any labeling of the arm categories, only the “arm” variable in a regression, for example.

Here’s the original, now-retracted study. And here’s the revised one.

Data can be tricky and could lead to unintended consequences if you don’t handle it correctly. Be careful out there.

Tags: ,

The ‘impeach this’ map has some issues

Philip Bump explains why the “impeach this” map is a bit dubious:

By now, this criticism of electoral maps is taught in elementary schools. Or, at least, it should be. Those red counties in Montana, North Dakota, South Dakota and Wyoming, for example, are home to 1.6 million 2016 voters — fewer than half of the number of voters in Los Angeles County. Trump won 1 million votes in those states, beating Hillary Clinton by a 580,000-vote margin. In Los Angeles, Clinton beat Trump by 1.7 million votes.

As Alberto Cairo already went into, it’s not so much that the map itself is incorrect. It’s a bivariate map. It shows which counties voted more for one person versus another. It’s more about the context of how the map is used. It’s the visualization equivalent of pulling a quote out of context and people seeing what they want to see.