Dplyr Cheat Sheet R



Dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles. ( Previous version) Updated January 17. Before we get started, get the Cheat Sheet. Datapasta is great for importing raw data from HTML tables on the web. But, you’ll still need to learn how to transform / wrangle the data and produce visualizations. For those topics, I’ll use the Ultimate R Cheat Sheet to refer to dplyr and ggplot2 code in my workflow.

4 min read2020/04/16

Motivation

I use R to access data held in Microsoft SQL Server databases on a daily basis. As a result of running into problems, I’ve realized I don’t have an understanding of the roles different components, notably dplyr, dbplyr, odbc and dbi each play in the process. This contributes to the fact that my efforts to resolve or mitigate issues are often an inefficient combination of Google searches and trial and error. Additionally, as I find or develop workarounds, I am unwilling to promote them with others because I don’t fully understand the cause of the issue and, as a result, I am not confident I have addressed the problem at the appropriate level.

Specifically, this is most motivated by the, “Invalid descriptor index,” error documented here.

I am writing what I learn to solidly my thinking and to help others who may experience the same challenge.

Scope

Given that the source of my motivation is encountering problems when working with data in Microsoft SQL Server databases, and that I prefer to use packages in the tidyverse, this investigation will be focused on how these packages work together to collect data from databases managed by Microsoft SQL server.

I won’t be writing about other options like RODBC, RJDBC or database-specific packages.

I will also not include comments about database management systems (DBMS) other than Microsoft SQL Server.

While it’s possible to generalize many of the concepts I write about here to other DBMS systems, I will not explicitly call them out. There are plenty of resources that do that. I aim to be very focused on how these components interact in a tidyverse and Microsoft SQL Server environment in the hopes it will help paint a simpler, clearer picture for others working in that same configuration.

Additionally, I almost never write data to a DBMS, and I suspect this is the case for many people working as analysts in Enterprise environments. In light of that I will be focused on how these components work together to extract data from SQL, and not how they write data to it.

Role of dplyr and dbplyr packages

dbplyr is the database back-end for dplyr - it does not need to be loaded explicitly, it is loaded by dplyr when working with data in a database.

dbplyr translates dplyr syntax into Microsoft SQL Server specific SQL code so dplyr can be used to retrieve data from a database system without the need to write SQL code.

dbplyr relies on the DBI and odbc packages as an intermediaries for connections with a SQL Server database.

Dplyr Cheat Sheet R

Dplyr can also pass on explicitly written (not translated from dplyr) SQL code to DBI.

dbplyr generates, or captures, the SQL code that is then passed into the front-end of the database stack provided by DBI, odbc and the ODBC driver

Role of the DBI package

DBI segments the connectivity to the SQL database into a, “front-end,” and a, “back-end.”

DBI implements a standardized front-end to dbplyr, and the odbc package acts as a driver for DBI to interface with SQL Server.

An example of front-end functionality provided by DBI…

  • connect/disconnect to the database
  • create and execute statements in the DBMS
  • extract results/output from statements
  • error/exception handling
  • information (meta-data) from database objects
  • transaction management (optional)

I think of the DBI package as the front end for interactive user at the R console, a script or package, into the other components, odbc and an ODBC driver, that make it possible to extract data from SQL server.

Dplyr Cheat Sheet RDplyr Cheat Sheet RCheat

Role of the odbc package

The odbc package provides the DBI back-end to any odbc driver connection, including those for Microsoft SQL Server.

Tidyr Cheat Sheet

This enables a connection to any database with ODBC drivers available.

Cheat

I think of the odbc package as the “back-end” of DBI and the “front-end” into the ODBC driver.

Role of ODBC drivers

Open Database Connectivity (ODBC) drivers are the last leg of the link between dplyr and SQL Server. They are what enable the odbc package to interface with SQL server.

I think of the SQL Server ODBC driver as the “front-end” into SQL server, and again, back to the user, script or package

Summary

  • Dbplyr translates dplyr syntax into SQL code, or captures explicitly written SQL, and hands it off to DBI
  • DBI uses odbc to interface with the SQL Server ODBC driver
  • ODBC driver communicates with SQL server and retrieves results

User or package code -> DBI -> odbc -> SQL Server ODBC driver -> SQL server

Tidyverse Cheat Sheet Pdf

Sources and notes

Data Wrangling Cheat Sheet

I haven’t written anything new here, just focused it on the configuration I use day to day in the hopes it helps someone else.

Most was gleaned from the following and I’d recommend reviewing them for a broader perspective, and deeper insights into specific areas:

Rstudio Cheat Sheet Dplyr

  • vignette('DBI', package = 'DBI')
  • vignette('dbplyr', 'package = 'dbplyr)