Analysis of compound curse words used on Reddit

As you know, Reddit is typically a sophisticated place of kind and pleasant conversation. So Colin Morris analyzed the usage of compound pejoratives in Reddit comments:

The full “matrix” of combinations is surprisingly dense. Of the ~4,800 possible compounds, more than half occurred in at least one comment. The most frequent compound, dumbass, appears in 3.6 million comments, but there’s also a long tail of many rare terms, including 444 hapax legomena (terms which appear only once in the dataset), such as pukebird, fartrag, sleazenozzle, and bastardbucket.

Stay classy.

Tags: , , ,

AI says if you’re the a**hole

There’s a subreddit where people share a story and ask if they’re the asshole. WTTDOTM and Alex Petros trained AI models based on the responses so that you can enter your own story and see what the AI outputs as responses:

AYTA responses are auto-generated and based on different datasets. The red model has only been trained on YTA responses and will always say you are at fault. The green model has only been trained on NTA responses and will always absolve you. And the white model was trained on the pre-filtered data. Have fun!

Unfortunately you only get three responses from your input, one from each model. It would’ve been fun if the AI tried to make a final call.

Tags: ,

Posted by in AI, Reddit, statistics

Tags: ,

Permalink

Wanna know somefing?

From Reddit user wequiock_falls, “What I’m about to learn about after my kid says, ‘Wanna know somefing?’ Data collected over the course of 7 days.”

Sounds about right.

Tags: , ,

Celebrity name spelling test

Colin Morris culled common misspellings on Reddit and made the data available on GitHub. For The Pudding, Russell Goldenberg and Matt Daniels took it a step further so that you too can see how bad you are at spelling celebrity names.

Tags: , ,

Looking for common misspellings

Some words are harder to spell than others, and on the internet, sometimes people indicate the difficulty by following their uncertainty with “(sp?)”. Colin Morris collected all the words in reddit threads with this uncertainty. Download the data on GitHub.

Tags: ,

A story of humanity in the pixels of a Reddit April Fool’s experiment

On April Fool’s Day, Reddit launched a blank canvas that users could add a colored pixel every few minutes. It ran for 72 hours, and the evolution of the space as a whole was awesome.

What if you look more closely at the individual images, edits, and battles for territory? Even more interesting. sudoscript looks closer, breaking participants into three groups — the creators, protectors, and destroyers — who fight for the ideal Place. In the process, among the Dickbutt variations, penis jokes, and Pokémon characters, it’s a story of humanity. [via Moritz]

Tags: ,

Time-lapse of community-edited pixels

For April Fool’s Day, Reddit ran a subreddit, r/place, that let users edit pixels in a 1,000 by 1,000 blank space for 72 hours. Users could only edit one pixel every ten minutes, which forced patience and community effort. This is the time-lapse of the effort.

Kind of great. It’s fun to watch the edits of thousands converge. It’s a complete hodgepodge but it all fit together in the relatively small space somehow.

See also the edit heatmap by Reddit user JorgeGT that shows the number of edits per pixel.

Tags: ,

Subreddit math with r/The_Donald helps show topic breakdowns

Trevor Martin for FiveThirtyEight used latent semantic analysis to do math with subreddits, specifically r/The_Donald.

We’ve adapted a technique that’s used in machine learning research — called latent semantic analysis — to characterize 50,323 active subreddits based on 1.4 billion comments posted from Jan. 1, 2015, to Dec. 31, 2016, in a way that allows us to quantify how similar in essence one subreddit is to another. At its heart, the analysis is based on commenter overlap: Two subreddits are deemed more similar if many commenters have posted often to both. This also makes it possible to do what we call “subreddit algebra”: adding one subreddit to another and seeing if the result resembles some third subreddit, or subtracting out a component of one subreddit’s character and seeing what’s left.

Hm.

Tags: ,

I’m doing a Reddit AMA

I'm doing a Reddit AMA tomorrow hosted by the DataIsBeautiful subreddit. It'll be at 1:30pm EST on August 27, 2015.

In case you're unfamiliar with the AMA (ask me anything), it's just a fun Q&A thing, where you ask me questions on Reddit, and I pause to think of something good to say. I might type some answers.

Ask me about visualization, data, blogging, graduate school, my hate of commuting, my kid's poop habits, beer, or whatever else. I'm game.

Tags:

Download data for 1.7 billion Reddit comments

There's been all sorts of weird stuff going on at Reddit lately, but who's got time for that when you can download 1.6 billion comments left on Reddit, since 2007 through May 2015?

This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues.

Timestamp, comment ids, controversiality score, and of course the comment text. It's 5 gigabytes compressed and available over torrent.

Git er done.

Tags: ,