Simpson’s Paradox in vaccination data

This chart, made by someone who is against vaccinations, shows a higher mortality rate for those who are vaccinated versus those who are not. Strange. It shows real data from the Office of National Statistics in the UK. As explained by Stuart McDonald, Simpson’s Paradox is at play:

[W]ithin the 10-59 age band, the average unvaccinated person is much younger than the average vaccinated person, and therefore has a lower death rate. Any benefit from the vaccines is swamped by the increase in all-cause mortality rates with age.

If you’re unfamiliar, Simpson’s Paradox is when a trend appears in separate groups but then disappears or reverses when you combine the groups. In this case, the confounding factors of age and vaccine uptake makes the above chart useless.

Tags: , ,

Scientists with bad data

Tim Harford warns against bad data in science:

Some frauds seem comical. In the 1970s, a researcher named William Summerlin claimed to have found a way to prevent skin grafts from being rejected by the recipient. He demonstrated his results by showing a white mouse with a dark patch of fur, apparently a graft from a black mouse. It transpired that the dark patch had been coloured with a felt-tip pen. Yet academic fraud is no joke.

Tags: , ,

‘Less than 10 percent’ outdoors

The CDC said that “less than 10 percent” of coronavirus cases were from outdoor transmissions. David Leonhardt for The New York Times argues why in all likelihood that number is way too high and leads to public confusion:

If you read the academic research that the C.D.C. has cited in defense of the 10 percent benchmark, you will notice something strange. A very large share of supposed cases of outdoor transmission have occurred in a single setting: construction sites in Singapore.

In one study, 95 of 10,926 worldwide instances of transmission are classified as outdoors; all 95 are from Singapore construction sites. In another study, four of 103 instances are classified as outdoors; again, all four are from Singapore construction sites.

This obviously doesn’t make much sense. It instead appears to be a misunderstanding that resembles the childhood game of telephone, in which a message gets garbled as it passes from one person to the next.

Tags: , , ,

Excel spreadsheet limit leads to 16,000 Covid-19 cases left off daily count

Microsoft Excel is useful for many things, but it has its limitations (like all software), which led to an undercount of 15,841 Covid-19 positive tests recorded by Public Health England. For the Guardian, Alex Hern reports:

In this case, the Guardian understands, one lab had sent its daily test report to PHE in the form of a CSV file – the simplest possible database format, just a list of values separated by commas. That report was then loaded into Microsoft Excel, and the new tests at the bottom were added to the main database.

But while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.

The gut reaction seems to dunk on Excel, but the whole infrastructure sounds off. Excel wasn’t meant to handle that many rows of data, and as a non-Excel person, I think it’s been like that forever.

Why are these records manually entered and transferred to a database? Why is the current solution to work off this single file that holds all of the data?

I bet the person (or people) tasked with entering new rows into the database aren’t tasked with thinking about the data. Who eventually noticed no new records were recorded after a week?

Such important data. So many questions.

It’s not so much an Excel problem as it is a data problem, and what looked like downward trend was actually going up.

Tags: , ,

FDA commissioner corrects his misinterpretation of reduced mortality

Talking about a possible plasma treatment for Covid-19, the Food and Drug Administration Commissioner Stephen Hahn misinterpreted results from the study. The study from the Mayo Clinic notes a possible 35% reduction in mortality rate, and Hahn said that if 100 people were sick with Covid-19, 35 lives would be saved.

For The Washington Post, Aaron Blake discusses why the interpretation is incorrect:

The vast majority of people who get the virus will recover with or without plasma. The 35 percent figure comes into play among those who die — a much smaller group. That would still be a huge development if borne out. But strictly speaking, the treatment would have saved about 3 out of 100 coronavirus patients, not 35. And given the smaller numbers we’re talking about, the finding is much closer to the margin of error — even as the preliminary study finds the effect to be statistically significant.

And even then, the claim doesn’t make sense. The data that he and Trump were referring to compared those receiving plasma treatments not to a control group, but between higher and lower levels of plasma treatments. The group with lower levels died at a rate of 11.9 people out of 100 died, while 8.7 percent died with higher levels.

Hahn later corrected himself.

See also Christopher Ingraham’s quick explanation of relative versus absolute risk. And this visual explainer from 2015 by NYT’s The Upshot should also be helpful in understanding the difference.

Tags: , , ,

Algorithm leads to arrest of the wrong person

Even though there was supposedly a person in the decision-making process and a surveillance photo wasn’t actually Robert Julian-Borchak Williams, he still ended up handcuffed in front of his own home. Kashmir Hill reporting for The New York Times:

This is what technology providers and law enforcement always emphasize when defending facial recognition: It is only supposed to be a clue in the case, not a smoking gun. Before arresting Mr. Williams, investigators might have sought other evidence that he committed the theft, such as eyewitness testimony, location data from his phone or proof that he owned the clothing that the suspect was wearing.

In this case, however, according to the Detroit police report, investigators simply included Mr. Williams’s picture in a “6-pack photo lineup” they created and showed to Ms. Johnston, Shinola’s loss-prevention contractor, and she identified him. (Ms. Johnston declined to comment.)

Tags: , , ,

Face depixelizer with machine learning, and some assumptions

In crime shows, they often have this amazing tool that turns a low-resolution, pixelated image of a person’s face to a high-resolution, highly accurate picture of the perp. Face Depixelizer is a step towards that with machine learning — except it seems to assume that everyone looks the same.

There might still be some limitations.

Tags: , ,

Bad bar chart

Welcome to whose bar chart is it anyway: where the geometries are made up and the numbers don’t matter. [via @dannypage]


Bad denominator

With coronavirus testing, many governments have used the percentage of tests that came back positive over time to gauge progress and decide whether or not it’s time to reopen. To calculate percentage, they divide confirmed cases by total tests. The denominator — total tests — often comes from the CDC, which apparently hasn’t done a good job calculating that denominator, because not all tests are the same.

Alexis C. Madrigal and Robinson Meyer for The Atlantic:

Mixing the two tests makes it much harder to understand the meaning of positive tests, and it clouds important information about the U.S. response to the pandemic, Jha said. “The viral testing is to understand how many people are getting infected, while antibody testing is like looking in the rearview mirror. The two tests are totally different signals,” he told us. By combining the two types of results, the CDC has made them both “uninterpretable,” he said.


Tags: , , , ,

Poor comparison between two bar charts

A chart from Business Insider makes a poor attempt to compare the death rates, by age, for the common flu against Covid-19:

The age groups on the horizontal axes are different, so you can’t make a fair side-by-side comparison. For example, the flu chart has a 50-64 age group. The Covid-19 chart has a 50-59 group and a 60-69 group.

Ann Coulter’s interpretation of the chart might be worse than the chart itself:


The values for people under 60, other than for the “under 30” group, are greater for Covid-19 than for the flu. Coulter’s interpretation is wrong no matter which way you cut it. Also, the article that the chart comes from points out the opposite.

I get it. It’s Twitter. There will be mistakes. But at least correct or delete them, instead of dangling it out there for people to spread.

For those making charts, please think about how others will interpret them. These are weird times and we don’t need to add more confusion. For those sharing charts, please think for a second before you put it out there.

Tags: ,