Scraping public data ruled legal

For TechCrunch, Zack Whittaker reporting:

In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law.

The Ninth Circuit’s decision is a major win for archivists, academics, researchers and journalists who use tools to mass collect, or scrape, information that is publicly accessible on the internet. Without a ruling in place, long-running projects to archive websites no longer online and using publicly accessible data for academic and research studies have been left in legal limbo.

Tags: ,

Spatula, a Python library for maintainable web scraping

This looks promising:

While it is often easy, and tempting, to write a scraper as a dirty one-off script, spatula makes an attempt to provide an easy framework that most scrapers fit within without additional overhead.

This reflects the reality that many scraper projects start small but grow quickly, so reaching for a heavyweight tool from the start often does not seem practical.

The initial overhead imposed by the framework should be as light as possible, providing benefits even for authors that do not wish to use every feature available to them.

Although, without my dirty one-off scripts, what will I put in my tmp data folder?

Tags: ,

Practical tips for scraping data

scrapey scrapey

It’s an unpleasant feeling when you have an idea for a project and the data you need is sitting right in front of you on a bunch of random-looking webpages instead of a nice, delimited file. You could either forget about your idea (which is what most people do), you can record manually, or you can take an automated route with a bit of scraping know-how.

I often find myself taking the tedious, manual route out, but sometimes scraping is clearly the best option. David Eads from the NPR Visuals Team describes how they use a model-control approach to scraping data.

Step 1: Find the data and figure out the HTML and/or JavaScript format and pattern. Step 2: Setup a way to parse and spit out the formatted data. Step 3: Optimize.

Oh, and before all that, make sure it’s legal.

Tags: ,