Guide for working with machine learning datasets

As part of the Knowing Machines research project, A Critical Field Guide for Working with Machine Learning Datasets, by Sarah Ciston, offers advice for working through the life cycle of complex and large datasets:

Machine learning datasets are powerful but unwieldy. They are often far too large to check all the data manually, to look for inaccurate labels, dehumanizing images, or other widespread issues. Despite the fact that datasets commonly contain problematic material — whether from a technical, legal, or ethical perspective — datasets are also valuable resources when handled carefully and critically. This guide offers questions, suggestions, strategies, and resources to help people work with existing machine learning datasets at every phase of their lifecycle. Equipped with this understanding, researchers and developers will be more capable of avoiding the problems unique to datasets. They will also be able to construct more reliable, robust solutions, or even explore promising new ways of thinking with machine learning datasets that are more critical and conscientious.

Plus points for framing the guide in a spreadsheet layout.

Tags: , , ,

Generating music from text

Researchers at Google built a model that generates music based on brief text descriptions:

We introduce MusicLM, a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.

I’m not entirely sure I like where this road goes, but the results are impressive.

Tags: , ,

Visual explanations for machine learning

As part of a teaching initiative by Amazon, MLU-Explain is a series of interactive explainers on core machine learning concepts. Learn about training sets, decision trees, random forests, and more. Seems like a good way to spend a Friday night if you ask me.

Tags: ,

Joke machine learning projects to advance your career

In an automated job climate that analyzes resumes and inspects social profiles, it can be a challenge to find the job that’s right for you. Luckily, Jess Peter for The Pudding put together a satirical set of tools to combat the recruiting bots. Generate a fake resume with a specified level of experience, define a profile pic for your socials, and then use that fake image of your face for the video interview.

I wonder if someone has ever done this in real life. This had to have happened at least once, right?

Tags: , , ,

Analysis of Facebook groups before January 6

The Washington Post and ProPublica analyzed Facebook group posts that disputed election results:

To determine the extent of posts attacking Biden’s victory, The Post and ProPublica obtained a unique dataset of 100,000 groups and their posts, along with metadata and images, compiled by CounterAction, a firm that studies online disinformation. The Post and ProPublica used machine learning to narrow that list to 27,000 public groups that showed clear markers of focusing on U.S. politics. Out of the more than 18 million posts in those groups between Election Day and Jan. 6, the analysis searched for words and phrases to identify attacks on the election’s integrity.

The more than 650,000 posts attacking the election — and the 10,000-a-day average — is almost certainly an undercount. The ProPublica-Washington Post analysis examined posts in only a portion of all public groups, and did not include comments, posts in private groups or posts on individuals’ profiles. Only Facebook has access to all the data to calculate the true total — and it hasn’t done so publicly.

Read more about the methodology behind the analysis.

Tags: , , , ,

New paper: Machine learning to predict the source of campylobacteriosis using whole genome data

This study, published in October in PLOS Genetics, brings together machine learning, large bacterial isolate collections and whole genome sequencing to address the general problem of how to trace the source of human infections.

Specifically, we investigated campylobacteriosis, a common infection of animal origin causing ~1.5 million cases of gastroenteritis and 10,000 hospitalizations every year in the United States alone. We show that our combined machine learning/genomics analyses:

  • Improve the accuracy with which infections can be traced back to farm reservoirs.
  • Identify evolutionary shifts in bacterial affinity for livestock host species.
  • Detect changes in human infection capability within related strains.

These results will improve understanding not only of Campylobacter, but more generally as these technologies can readily be applied to other important bacterial pathogen species.

This paper builds on previous work published by the group, including our well cited Tracing the source of campylobacteriosis (Wilson et al 2008, PLOS Genetics 4:e1000203). The use of these methods for tracing infection has influenced public health policy and contributed to reducing disease burden.

This work demonstrates the potential for modern genomics and artificial intelligence approaches to address common and serious problems that affect our everyday lives. The awareness of the importance of infection to society has rarely been higher than in 2021, and while the current pandemic imposes an acute global problem, other infections continue to present long-term threats to health and productivity.

This work was led by Nicolas Arning, in collaboration with David Clifton and Sam Sheppard.

Machine learning explained at five difficulty levels

For their 5 Levels series, Wired brought in Hilary Mason to explain machine learning at five levels of difficulty. Mason’s explanations are super helpful at every level.

Tags: , ,

Noah Kalina’s averaged face over 7,777 days

Noah Kalina has been taking a picture of himself every day since January 11, 2000. He posted time-lapse videos in 2007, 2012, and 2020. Last year was the 20th of the project.

Usually Kalina’s videos are a straight up time-lapse using every photo. But in this collaboration with Michael Notter, 7,777 Days shows a smoother passage of time. Notter used machine learning to align the face pictures, and then each frame shows a 60-day average, which focuses on an aging face instead of everything else in the background.

Tags: , , , ,

Machine learning to find movie ideas

Speaking of A.I. and fiction, Adam Epstein for Quartz reported on how Wattpad, the platform for people to share stories, uses machine learning to find potential movies:

Wattpad uses a machine-learning program called StoryDNA to scan all the stories on its platform and surface the ones that seem like candidates for TV or film development. It works on both macro and micro levels, analyzing big-picture audience engagement trends to identify the genres picking up steam, while also looking at the specific stories that got popular quickly and calculating what made them so appealing.

The tool can break stories down to their vocabularies and sentence structures (a story’s “DNA,” if you will) and then compare those to other stories to deduce what really makes a work of fiction popular. It also looks at how often users comment on stories and, when they do, what exactly they’re saying. Its goal is to examine all these clues to uncover the precise combination of story elements—genre, emotion, grammar, the list goes on—that hooks audiences to the point they’ll follow its journey onto a visual medium.

Maybe I’m just getting old, but this sounds terrible.

Tags: , , ,

Machine learning to find a recipe for a baked good that’s half cake and half cookie

Last year, around the time when people were baking a lot of things, Sarah Robinson used machine learning to find a recipe for a “cakie”:

Like many people, I’ve been entertaining myself at home by baking a ton and talking about my sourdough starter as if it were a real person. I’m pretty good at following recipes, but I decided I wanted to take things one step further and understand the science behind what differentiates a cake from a bread or a cookie. I also like machine learning so I thought: what if I could combine it with baking??!

Robinson provides the final recipe at the end, so first, I need to try this recipe. Second, what other foods and beverages can this apply to?

Tags: , , ,