Connecting to PAD using R/RStudio

Modified on Wed, 5 Feb at 11:07 AM

Many analysts and data practitioners like to explore and manipulate data in R, and you can easily do this with PAD! The package bigrquery lets you easily connect to your PAD project from R.

Installation

You can do this using CRAN or GitHub, depending on your preference. Just run of the following two command snippets using the installation tool of your choice.

Using CRAN:

 install.packages("bigrquery")

Or from GitHub:

 devtools::install_github("r-dbi/bigrquery")

Then, load the installed package to each R session by running

 library("bigrquery")

Authorization

To authorize R to access the data in your PAD Project, use the function:

bq_auth()

We recommend against caching OAuth credentials between R sessions.

NOTE: Make sure your cta-tech.app Google Chrome opened before running this function. If you try to authorize access from another browser or Chrome profile, you will get an error message saying you have made a ‘malformed’ request. If this happens, switch to your cta-tech.app Chrome and try authorization again.

A new tab will open asking you to authorize Tidyverse API Packages access to your BigQuery; the answer is yes.

If you have the package httr installed, click ‘Continue’ to authorize access! If you do not have the package httr installed, you’ll be given an authorization code – you will need to copy, paste and submit this into your R Console prompt.

You are now connected!

Querying your PAD Data

bigrquery offers 3 different ways to interact with your PAD data:

1. Low-level API: these provide thin wrappers over the underlying REST API. All the low-level functions start with bq_, and mostly have the form bq_noun_verb(). This level of abstraction is most appropriate if you’re familiar with the REST API and you want to do something not supported in the higher-level APIs.

billing <- bq_test_project() # replace this with your project ID as a string
sql <- "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`"

tb <- bq_project_query(billing, sql)
bq_table_download(tb, n_max = 10)
#> First chunk includes all requested rows.
#> # A tibble: 10 x 4
#>     year month   day weight_pounds
#>    <int> <int> <int>         <dbl>
#>  1  1969     1    20          7.00
#>  2  1969     1    27          7.69
#>  3  1969     6    19          6.75
#>  4  1969     5    30          6.19
#>  5  1969    11     9          7.87
#>  6  1969     5    25          7.06
#>  7  1969     7    25          7.94
#>  8  1969     9    11          7.06
#>  9  1969     7    13          6.00
#> 10  1969     9    27          8.13

2. Using DBI: The DBI package wraps the low-level API and makes working with BigQuery like working with any other database system. This is the most convenient layer if you want to execute SQL queries in BigQuery or upload smaller amounts (i.e. <100 MB) of data.

library(DBI)
con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = bq_test_project() # replace this with your project ID as a string
)
con 
dbListTables(con)
#> [1] "github_nested"   "github_timeline" "gsod"            "natality"       
#> [5] "shakespeare"     "trigrams"        "wikipedia"

dbGetQuery(con, "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`", n = 10)
#> First chunk includes all requested rows.
#> # A tibble: 10 x 4
#>     year month   day weight_pounds
#>    <int> <int> <int>         <dbl>
#>  1  1969     1    20          7.00
#>  2  1969     1    27          7.69
#>  3  1969     6    19          6.75
#>  4  1969     5    30          6.19
#>  5  1969    11     9          7.87
#>  6  1969     5    25          7.06
#>  7  1969     7    25          7.94
#>  8  1969     9    11          7.06
#>  9  1969     7    13          6.00
#> 10  1969     9    27          8.13

3. Using dplyr: The dplyr package lets you treat BigQuery tables as if they are in-memory data frames. This is the most convenient layer if you don’t want to write SQL, but instead want dbplyr to write it for you.

library(dplyr)
con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = bq_test_project() # replace this with your project ID as a string
)
natality <- tbl(con, "natality")

natality %>%
  select(year, month, day, weight_pounds) %>% 
  head(10) %>%
  collect()
#> # A tibble: 10 x 4
#>     year month   day weight_pounds
#>    <int> <int> <int>         <dbl>
#>  1  1969    10     6          3.25
#>  2  1969     5    11          5.75
#>  3  1969     6    29          7.94
#>  4  1969     3     7          8.38
#>  5  1970     4    26          6.38
#>  6  1971    10     6          6.69
#>  7  1971     2    23          6.69
#>  8  1971     8    12          7.37
#>  9  1969     9     3          5.25
#> 10  1969     4    25          6.62

Troubleshooting:

If you are switching between different Google accounts when using R/R Studio, you may run into some authentication issues. If you see errors about being unable to refresh your token, an invalid_grant, or that your token has been expired or revoked, we recommend trying the below steps (note that this may have implications on other parts of your R/R Studio setup):

1. Locate your Gargle folder and delete the Cache folder

2. Locate your gcloud application_default_credentials file (depending on your system, location will vary) and delete it if it exists. This R troubleshooting link may be helpful in locating it.

3. Update your bigrquery package if needed.

4. Re-try authenticating with the desired Google account.