Statistical limits

Reviewing Deborah Stone’s Counting and Tim Harford’s The Data Detective, Hannah Fry discusses the usefulness of data and its limitations for The New Yorker:

Numbers are a poor substitute for the richness and color of the real world. It might seem odd that a professional mathematician (like me) or economist (like Harford) would work to convince you of this fact. But to recognize the limitations of a data-driven view of reality is not to downplay its might. It’s possible for two things to be true: for numbers to come up short before the nuances of reality, while also being the most powerful instrument we have when it comes to understanding that reality.

This builds on Fry’s similarly themed article from a couple of years ago, as well as her book Hello World.

Data is limited, and the better we understand those limitations, the better use we can get out of what’s there.

Tags: ,

Excel spreadsheet limit leads to 16,000 Covid-19 cases left off daily count

Microsoft Excel is useful for many things, but it has its limitations (like all software), which led to an undercount of 15,841 Covid-19 positive tests recorded by Public Health England. For the Guardian, Alex Hern reports:

In this case, the Guardian understands, one lab had sent its daily test report to PHE in the form of a CSV file – the simplest possible database format, just a list of values separated by commas. That report was then loaded into Microsoft Excel, and the new tests at the bottom were added to the main database.

But while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.

The gut reaction seems to dunk on Excel, but the whole infrastructure sounds off. Excel wasn’t meant to handle that many rows of data, and as a non-Excel person, I think it’s been like that forever.

Why are these records manually entered and transferred to a database? Why is the current solution to work off this single file that holds all of the data?

I bet the person (or people) tasked with entering new rows into the database aren’t tasked with thinking about the data. Who eventually noticed no new records were recorded after a week?

Such important data. So many questions.

It’s not so much an Excel problem as it is a data problem, and what looked like downward trend was actually going up.

Tags: , ,