19Apr / 2023

Scraping data without programming

Maybe you’ve wished you could quickly grab the data on a webpage and instantly have it in a structured format. But you don’t know how to program or you do, but don’t want to go through the trouble of writing another one-off script. Samantha Sunne provides a short guide for scraping without programming, mainly with Google Sheets.

Tags: Google Sheets, Samantha Sunne, scraping

Permalink

19Apr / 2022

Scraping public data ruled legal

For TechCrunch, Zack Whittaker reporting:

In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law.

The Ninth Circuit’s decision is a major win for archivists, academics, researchers and journalists who use tools to mass collect, or scrape, information that is publicly accessible on the internet. Without a ruling in place, long-running projects to archive websites no longer online and using publicly accessible data for academic and research studies have been left in legal limbo.

Tags: public, scraping

Permalink

17Jun / 2021

Spatula, a Python library for maintainable web scraping

This looks promising:

While it is often easy, and tempting, to write a scraper as a dirty one-off script, spatula makes an attempt to provide an easy framework that most scrapers fit within without additional overhead.

This reflects the reality that many scraper projects start small but grow quickly, so reaching for a heavyweight tool from the start often does not seem practical.

The initial overhead imposed by the framework should be as light as possible, providing benefits even for authors that do not wish to use every feature available to them.

Although, without my dirty one-off scripts, what will I put in my tmp data folder?

Tags: Python, scraping

Permalink

13Jan / 2021

Mining Parler data

Just before the social network Parler went down, a researcher who goes by the Twitter username @donk_enby scraped 56.7 terabytes of data from the site via a less-than-secure API. Motherboard reports on what some researchers are doing with the data:

One technologist took the scraped Parler data, took every file that had GPS coordinates included within it, formatted that information into JSON, and plotted those onto a map. The technologist then shared screenshots of their map with Motherboard, showing Parler posts originating from various countries, and then the United States, and finally in or around the Capitol itself. In other words, they were able to show that Parler users were posting material from the Capitol on the day of the rioting, and can now go back into the rest of the Parler data to retrieve specific material from that time.

I’ve only seen some quick maps so far, but I imagine there’s much more to come in terms of closer of analysis and visualization.

Tags: Capitol, metadata, Motherboard, Parler, scraping

Permalink

04Aug / 2016

Practical tips for scraping data

It’s an unpleasant feeling when you have an idea for a project and the data you need is sitting right in front of you on a bunch of random-looking webpages instead of a nice, delimited file. You could either forget about your idea (which is what most people do), you can record manually, or you can take an automated route with a bit of scraping know-how.

I often find myself taking the tedious, manual route out, but sometimes scraping is clearly the best option. David Eads from the NPR Visuals Team describes how they use a model-control approach to scraping data.

Step 1: Find the data and figure out the HTML and/or JavaScript format and pattern. Step 2: Setup a way to parse and spit out the formatted data. Step 3: Optimize.

Oh, and before all that, make sure it’s legal.

Tags: NPR, scraping

Permalink

Reader

Tag Archives: scraping

Scraping data without programming

Scraping public data ruled legal

Spatula, a Python library for maintainable web scraping

Mining Parler data

Practical tips for scraping data

Meta

Reader

Tag Archives: scraping

News organizations blocking OpenAI

Scraping data without programming

Scraping public data ruled legal

Spatula, a Python library for maintainable web scraping

Mining Parler data

Practical tips for scraping data

Meta