Jobs of a data scientist

Roger Peng outlines four main roles of a data scientist:

If you’re reading this and find yourself saying “I’m not an X” where X is either scientist, statistician, systems engineer, or politician, then chances are that is where you are weak at data science. I think a good data scientist has to have some skill in each of these domains in order to be able to complete the basic data analytic iteration.

The good thing about data science is that you can apply the skills to different fields and tasks. It’s also one of the challenges when you’re in the early phases of learning, because you have to figure out what to work on. This should point you in the right direction.

See also: Peng’s tentpoles of data science.

Tags: , ,

Audit advanced data science course online

Jeff Leek and Roger Peng started their course Advanced Data Science at Johns Hopkins University. It’s meant for JHU students, but you can learn from the weekly course material for free:

The class is not designed to teach a set of statistical methods or packages – there are a ton of awesome classes, books, and tutorials about those things out there! Rather the goal is to help you to organize your thinking around how to combine the things you have learned about statistics, data manipulation, and visualization into complete data analyses that answer important questions about the world around you.

So you know the methods and tools (or how to learn them on your own), but you want to learn more about putting it all together.

Nice. I could probably use a refresher.

You can get the weekly updates here.

Tags: , , , ,

What is R, what it was, and what it will become

Roger Peng provides a lesson on the roots of R and how it got to where it is now:

Chambers was referring to the difficulty in naming and characterizing the S system. Is it a programming language? An environment? A statistical package? Eventually, it seems they settled on “quantitative programming environment”, or in other words, “it’s all the things.” Ironically, for a statistical environment, the first two versions did not contain much in the way of specific statistical capabilities. In addition to a more full-featured statistical modeling system, versions 3 and 4 of the language added the class/methods system for programming (outlined in Chambers’ Programming with Data).

I’m starting feel my age, as some of the “history” feels more like recent experience.

You can also watch Peng’s keynote in the video version.

Tags: , ,

Exploring data to form better questions

Feeding off the words of John Tukey, Roger Peng proposes a search for better questions in analysis:

The goal in this picture is to get to the upper right corner, where you have a high quality question and very strong evidence. In my experience, most people assume that they are starting in the bottom right corner, where the quality of the question is at its highest. In that case, the only thing left to do is to choose the optimal procedure so that you can squeeze as much information out of your data. The reality is that we almost always start in the bottom left corner, with a vague and poorly defined question and a similarly vague sense of what procedure to use. In that case, what’s a data scientist to do?

Story of my life.

Tags: , , ,

Following your gut, following the data

The Wall Street Journal highlighted a disagreement between data and business at Netflix. Ultimately, the business side “won.” However, maybe that’s the wrong framing. Roger Peng describes the differences between analysis and the full truth:

There’s no evidence in the reporting that the content team didn’t believe the data or the analysis. It’s just that their fear of damaging a relationship with an actor overruled whatever desire they might have had to maximize clicks or views. The logic was probably along the lines of “We may take a hit in the short-run but we will benefit from this relationship in the long-run.” Whether that’s true or not is unclear, but it’s a tricky question to answer with data. It’s not even clear to me how you would formulate that question.

Data often pitches itself as the path to definitive answers, but most of the time it gives you possibilities and weighted suggestions. Follow blindly, and you end up with creepy, algorithmically-generated YouTube videos.

Tags: , ,

Past and future of data analysis

Roger Peng, a biostatistics professor at John Hopkins University, talks about the past and future of data analysis, using music as a metaphor for the path.

Tags: ,