Data Science

Buy me a coffeeBuy me a coffee


I separate this section from the statistics one because when I needed to learn how to run statistics in R, I would feel frustrated when the documents I found were not directly telling me about specific tests I need to learn about. It may not be generalizable but it makes sense to me now. Sorry!

A ModernDive into R & the Tidyverse

By Chester Ismay and Albert Y. Kim. Foreword by Kelly S. McConville

What is this?

Excerpt from ebook: Over the course of this book, you will develop your “data science toolbox,” equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.

In particular, this book will lean heavily on data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are used to convey relationships within data. In general, we’ll use visualization as a way of building almost all of the ideas in this book.

  1. Link for free ebook here: https://moderndive.com/
  2. Link to repo here: https://github.com/moderndive/ModernDive_book
  3. Links to buy it here: Amazon or CRC Press using promo code ASA18 for a discounted price.

Advanced Data Science 2020

ByJeff Leek and Roger D. Peng

Added Sep 12th, 2020

What is this?

This course is designed for PhD students at Johns Hopkins Bloomberg School of Public Health. We are usually pretty flexible about permitting outside students but we want everyone to be aware of the goals and assumptions so no one feels like they are surprised by how the class works.

The primary goal of the course is to teach you how to deconstruct, perform, and communicate professional data analyses across diverse media.

The goal is to help you to organize your thinking around how to combine the things you have learned about statistics, data manipulation, and visualization into complete data analyses that answer important questions about the world around you.

  1. Link to ebook here: http://jtleek.com/ads2020/

Building reproducible analytical pipelines with R

By Bruno Rodrigues

Added Mon Apr 24th, 2023

What is this?

Excerpt from e-book: The aim of this book is to teach you how to use some of the best practices from software engineering and DevOps to make your projects robust, reliable and reproducible. It doesn’t matter if you work alone, in a small or in a big team. It doesn’t matter if your work gets (peer-)reviewed or audited: the techniques presented in this book will make your projects more reliable and save you a lot of frustration!

  1. Link to ebook here: https://raps-with-r.dev/

Data Analysis for the Life Sciences with R

By Rafael A. Irizarry and Michael I Love

What is this?

Excerpt from ebook: This book will cover several of the statistical concepts and data analytic skills needed to succeed in data-driven life science research. We go from relatively basic concepts related to computing p-values to advanced topics related to analyzing high-throughput data. While statistics textbooks focus on mathematics, this book focuses on using a computer to perform data analysis. Instead of explaining the mathematics and theory, and then showing examples, we start by stating a practical data-related challenge. This book also includes the computer code that provides a solution to the problem and helps illustrate the concepts behind the solution.

  1. Link to ebook: https://leanpub.com/dataanalysisforthelifesciences

Data Science for Economists Seminar

By Grant McDermott at the University of Oregon

What is this?

Excerpt from site: This is a graduate economics seminar taught by Grant McDermott at the University of Oregon.

Please read the syllabus before you go through any of the lectures. This will detail software requirements and installation, and give you a better sense of the aims and scope of the course. I also have an “FAQ” section at the end that covers frequently asked questions (or, at least, potentially asked questions). Speaking of which, here follow answers to some questions that are more specifically related to this repo.

  1. Link to lectures repo: https://github.com/uo-ec607/lectures#data-science-for-economists

Data Science for Psychologists: A Refreshed Exploratory & Graphical Data Analysis in R

By S. Mason Garrison

Added Fri Apr 28th, 2023

What is this?
Excerpt from site: This website is designed to accompany Mason Garrison’s Data Science for Psychologists (DS4P). DS4P is a graduate-level quantitative methods course at Wake Forest University. This class assumes zero knowledge of programming, computer science, linear algebra, probability, or really anything fancy. I encourage anyone who is quant-curious to work their way through these course notes. The course notes include lectures, worked examples, readings, activities, and labs.

  1. Link to website: https://datascience4psych.github.io/DataScience4Psych/

Data Science in a Box

By Mine Çetinkaya-Rundel

What is it?
Excerpt from site: The core content of the course focuses on data acquisition and wrangling, exploratory data analysis, data visualization, inference, modelling, and effective communication of results. Time permitting, the course also introduces additional concepts and tools like interactive visualization and reporting, text analysis, and Bayesian inference.

  1. Link to site: https://datasciencebox.org/
  2. Link to repo: https://github.com/rstudio-education/datascience-box

Data Science in Education using R

By Ryan A. Estrellado, Emily A. Bovee, Jesse Mostipak, Joshua M. Rosenberg, & Isabella C. Velásquez

Added on Apr 15th, 2021

What is it?

Excerpt from site: We wrote this book assuming you’re at the start of your journey learning R and using data science in your education job. The book takes you from installing R to practicing more advanced data science skills like text analysis.

If you’ve never written a line of R code, we welcome you to the community! We wrote this book for you. Consider reading the book cover to cover and doing all the analysis walkthroughs. Remember that you’ll get more from a few minutes of practice every day than you will from long hours of practice every once in awhile. Typing code every day, even if it doesn’t always run, is a daily practice that invites learning and “a-ha” moments.

  1. Link to e-book: https://datascienceineducation.com/

Data Science with R

By Danielle Navarro

What is this?

Data Science course by the wonderful Danielle Navarro.

  1. Link to e-course here: https://robust-tools.djnavarro.net/

Data Skill for Reproducible science

What is it?

Excerpt from site: This course provides an overview of skills needed for reproducible research and open science using the statistical programming language R. Students will learn about data visualisation, data tidying and wrangling, archiving, iteration and functions, probability and data simulations, general linear models, and reproducible workflows. Learning is reinforced through weekly assignments that involve working with different types of data.

  1. Course here: https://psyteachr.github.io/msc-data-skills/

Data Wrangling in the Tidyverse

By Nick Huntington-Klein

Date added Fri AprMay 5th, 2023

What is this?
Excerpt from the site: This video series covers the basics of data wrangling using the tidyverse, aimed at my data communications class. I assume some very, very basic R background, but really not much.

This is a simplified and adapted version of my data wrangling workshops, which are available in full on my channel for tidyverse (R), data.table (R), and pandas (Python).

  1. Link to videos are here: https://www.youtube.com/watch?v=EOpb3VsWmck
  2. Slides to follow along with the videos are here: https://nickch-k.github.io/DataCommSl...
  3. Code instructor works on in the videos can be found here: https://github.com/NickCH-K/DataCommS...

Exploratory Data Analysis with R 💯

By Roger D. Peng

What is this?

Excerpt from e-booK: This book covers the essential exploratory techniques for summarizing data with R. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data you have. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing informative data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data.

  1. https://bookdown.org/rdpeng/exdata/

Getting Started with naniar

By Nicholas Tierney

What is this?

Excerpt from site: this vignette aims to work with the following three questions, using the tools developed in naniar and another package, visdat. Namely, how do we: Start looking at missing data?, Explore missingness mechanisms?, Model missingness?

  1. Link to site here: https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html

grateful: Facilitate citation of R packages 💯

By Francisco Rodriguez-Sanchez & Connor P. Jackson

Added Mon Apr 24th, 2023

What is it?
Excerpt from package: The goal of grateful is to make it very easy to cite R and the R packages used in any analyses, so that package authors receive their deserved credit. By calling a single function, grateful will scan the project for R packages used and generate a BibTeX file containing all citations for those packages.

grateful can then generate a new document with citations in the desired output format (Word, PDF, HTML, Markdown). These references can be formatted for a specific journal, so that we can just paste them directly into our manuscript or report.

  1. Link to package: https://pakillo.github.io/grateful/

Introduction to Data Science Fall 2019

By University of Edinburgh

What is this?

Excerpt from site: Gain experience in data collection, wrangling, and visualization, exploratory data analysis, predictive modeling, and effective communication of results while working on problems and case studies inspired by and based on real-world questions. The course will focus on the R statistical computing language.

  1. Link to e-course here: https://introds.org/

Introduction to Data Science

By Rafael A. Irizarry

What is this

Excerpt from e-book: This book introduces concepts from probability, statistical inference, linear regression and machine learning and R programming skills.

  1. Link to e-book here: https://leanpub.com/datasciencebook

Mastering Spark with R

By Javier Luraschi, Kevin Kuo, & Edgar Ruiz

What is this?

Excerpt from ebook: This chapter presented Spark as a modern and powerful computing platform, R as an easy-to-use computing language with solid foundations in statistical methods, and sparklyr as a project bridging both technologies and communities. In a world in which the total amount of information is growing exponentially, learning how to analyze data at scale will help you to tackle the problems and opportunities humanity is facing today. However, before we start analyzing data, Chapter 2 will equip you with the tools you will need throughout the rest of this book. Be sure to follow each step carefully and take the time to install the recommended tools, which we hope will become familiar resources that you use and love.

  1. Link to free ebook: https://therinspark.com/

R for data Science

By Garrett Grolemund & Hadley Wickham

What is this?
Excerpt from site: This is the website for “R for Data Science”. This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualize it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you’ll learn how to clean data and draw plots—and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You’ll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You’ll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.

  1. Web-book here: https://r4ds.had.co.nz/
  2. Want to buy it?: Amazon Link
  3. What to give back? you can donate here: https://www.doc.govt.nz/kakapo-donate

R Programming for Data Science

By Roger D. Peng

What is this?

Excerpt from site: This book is about the fundamentals of R programming. You will get started with the basics of the language, learn how to manipulate datasets, how to write functions, and how to debug and optimize code. With the fundamentals provided in this book, you will have a solid foundation on which to build your data science toolbox.

  1. Web-book here: https://bookdown.org/rdpeng/rprogdatascience/

Reproducible science workshop: A one-day workshop with R and RStudio

By Olivier Gimenez

Added Fri Oct 8th, 2021

What is this?
One day Workshop that includes the following topics: Motivations, Manipulating data in the tidyverse, Visualising data in the tidyverse, Writing dynamic and reproducible documents with R Markdown, Versioning with Git and GitHub in RStudio, and Take-home messages.

  1. Link to materials and videos here: https://oliviergimenez.github.io/reproducible-science-workshop/

Social Data Science with R

By Daniel Anderson, Brendan Cullen, & Ouafaa Hmaddi

Added Thu Dec 31st, 2020

What is this?

Excerpt from e-book: Here’s an intro about why R is great and the cool things you can do with it and new problems you can address.

  1. Link to e-book here: https://www.sds.pub/index.html

Text Mining with R A Tidy Approach

By Julia Silge and David Robinson

What is this?

Excerpt from ebook: This book serves as an introduction of text mining using the tidytext package and other tidy tools in R. The functions provided by the tidytext package are relatively simple; what is important are the possible applications. Thus, this book provides compelling examples of real text mining problems.

  1. Link to free ebook here: https://www.tidytextmining.com/
  2. Buy the book here: Amazon
  3. Link to repo here: https://github.com/dgrtwo/tidy-text-mining

Using the tidyverse with Databases

By Vebash Naidoo

What is this?

Excerpt from site: You know R, especially the dplyr 📦. Even though the dplyr 📦 is so well written to mimic the SQL syntax - select(), group_by(), left_join() etc. there is still a cognitive load when you switch between using R syntax, and SQL syntax (ask me, who has often written == in SQL syntax on Athena only to wonder why I am getting an error 🤐).

You only have so much memory in your local environment, and may want your RDBMS to do the heavy lifting (most of the computation), and only pull data into R when you need to (e.g. pull in aggregated data to create plots for a report).

In this tutorial you will learn how to use dbplyr, which is a database back-end of dplyr, to execute queries directly in your RDBMS all the while writing R tidyverse syntax 😮 ⭐.

  1. Blog Part 1 here: https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/
  2. Blog Part 2 here: https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/