News organizations blocking OpenAI

Ben Welsh has a running list of the news organizations blocking OpenAI crawlers:

In total, 532 of 1,147 news publishers surveyed by the homepages.news archive have instructed OpenAI, Google AI or the non-profit Common Crawl to stop scanning their sites, which amounts to 46.4% of the sample.

The three organizations systematically crawl web sites to gather the information that fuels generative chatbots like OpenAI’s ChatGPT and Google’s Bard. Publishers can request that their content be excluded by opting out via the robots.txt convention.

On the web, it used to be that you would write or make something and there would be a link to the thing. Other websites could link to the thing, and people would go to the place with the thing. With this recent AI wave, a lot of the thing ends up elsewhere and no one sees the original place.

Fun times ahead.

Tags: , ,

Scraping data without programming

Maybe you’ve wished you could quickly grab the data on a webpage and instantly have it in a structured format. But you don’t know how to program or you do, but don’t want to go through the trouble of writing another one-off script. Samantha Sunne provides a short guide for scraping without programming, mainly with Google Sheets.

Tags: , ,

Scraping public data ruled legal

For TechCrunch, Zack Whittaker reporting:

In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law.

The Ninth Circuit’s decision is a major win for archivists, academics, researchers and journalists who use tools to mass collect, or scrape, information that is publicly accessible on the internet. Without a ruling in place, long-running projects to archive websites no longer online and using publicly accessible data for academic and research studies have been left in legal limbo.

Tags: ,

Spatula, a Python library for maintainable web scraping

This looks promising:

While it is often easy, and tempting, to write a scraper as a dirty one-off script, spatula makes an attempt to provide an easy framework that most scrapers fit within without additional overhead.

This reflects the reality that many scraper projects start small but grow quickly, so reaching for a heavyweight tool from the start often does not seem practical.

The initial overhead imposed by the framework should be as light as possible, providing benefits even for authors that do not wish to use every feature available to them.

Although, without my dirty one-off scripts, what will I put in my tmp data folder?

Tags: ,

Practical tips for scraping data

scrapey scrapey

It’s an unpleasant feeling when you have an idea for a project and the data you need is sitting right in front of you on a bunch of random-looking webpages instead of a nice, delimited file. You could either forget about your idea (which is what most people do), you can record manually, or you can take an automated route with a bit of scraping know-how.

I often find myself taking the tedious, manual route out, but sometimes scraping is clearly the best option. David Eads from the NPR Visuals Team describes how they use a model-control approach to scraping data.

Step 1: Find the data and figure out the HTML and/or JavaScript format and pattern. Step 2: Setup a way to parse and spit out the formatted data. Step 3: Optimize.

Oh, and before all that, make sure it’s legal.

Tags: ,