Data manipulation with R
Posted on Mon 19 October 2020 in R
Introduction
In my previous post I demonstrated how to obtain a prostate cancer dataset with genomic information in the form of gene expression quantification and created a local PostgreSQL database to hold the data.
Here, I will use R to connect to the PostgreSQL database, retrieve and then prepare the CGC data to perform a Differential Expression Analysis for sequence count data in the CGC dataset. This demonstration is on a RStudio project running in Windows 10, but the same steps can be followed on an Unix system.
As always, the code demonstrated here is available on my portfolio on GitHub.
Setting up .Renviron file
In my working directory, I create a text file named .Renviron
to store the credentials of the PostgreSQL database with the following information:
userid = "<USER_NAME>"
pwd = "<PASSWORD>"
Replace <USER_NAME>
and <PASSWORD>
with your credentials used to access the PostgreSQL database. Usually, the default username is postgres
and the password is defined during PostgreSQL installation.
Install/Load packages
Then I open a RStudio session and create a project in the folder containing the .Renviron
file. Now, I need to load the packages I will use today. You can install them using install.packages()
function if you do not have them installed yet:
# Run only once
install.packages(c("RPostgres", "here", "tidyverse"))
Note that to install more than one package at once we must use the concatenate c()
command to pass the packages names and they must be quoted and separated by commas — a R vector. The package dependencies will be installed as well.
Now I load the packages into the R session:
library(RPostgres)
library(here)
library(tidyverse)
The RPostgres
is needed to connect to the PostgreSQL database I created before. The here
package handles file paths. tidyverse
is a powerful collection of packages for data manipulation. By using it, it will actually load several other packages, such as dplyr
, stringr
and tidyr
. It is one of the packages I use the most. Check RPostgres
documentation here and tidyverse
documentation here.
Set up connection to tcga database
I now set up the connection to the database:
con <- dbConnect(
RPostgres::Postgres(),
dbname = "tcga",
host = "localhost",
port = 5432,
user = Sys.getenv("userid"),
password = Sys.getenv("pwd")
)
The Sys.getenv()
will retrieve the information in the .Renviron
file. Remember to not share this file so your credentials remain secret. The connection credentials are now stored in the con
object.
Retrieve and pivot gene_counts_cases table
With the command below, I retrieve the table containing the case-identified, raw gene counts I created last time:
cases <- dbGetQuery(con, "SELECT * FROM gene_counts_cases")
The dbGetQuery()
is one of the functions from DBI
package, a dependency of RPostgres
. Check DBI
documentation here.
The information is now stored in the cases
object, which has three columns: case_id
, gene_id
and gene_counts
. Therefore, each row is a combination of a case, a gene and a gene count — it is a “long” format.
I now will reformat the table to a “wide” format (pivot) because it is a requirement for the differential expression analysis. The pivoted table will have GxN dimensions, where G are the number of rows (the number of gene transcripts quantified) and N the number of columns (the number of cases).
I will use the pivot_wider()
function of the tidyr
package to do the job (WARNING: COMPUTATION INTENSIVE STEP):
cases_pivoted <-
cases %>% pivot_wider(
names_from = case_id,
values_from = gene_count,
values_fill = 0,
names_repair = "check_unique",
values_fn = mean
)
cases_pivoted
is the name of the object that will hold the pivoted table. The %>%
is dplyr
‘s syntax. It means that we are piping the contents of cases
object into the pivot_wider()
function and its arguments.
The names_from
argument tells which column will be pivoted to generate new columns. The values_from
argument tells which column hold the values that will fill the new pivoted table. The values_fill
argument will substitute any missing data for a zero. The names_repair
argument checks that each new column has a unique name. Finally, the values_fn
argument indicates the function that must be applied to the values filling the new pivoted table. Note that I used the mean
function because during this step I noticed that some cases were associated with more than one gene expression quantification file. Therefore, I had to take the mean of these extra gene counts to correctly generate the pivoted table.
Since I calculated means for the values, I then rounded to the next integer all numerical data in the table with the help of dplyr
‘s mutate()
, across()
and where()
functions and round()
, which is one of R’s standard (base) functions:
counts <- cases_pivoted %>% mutate(across(where(is.numeric), round, 0))
The where(is.numeric)
ensures that I only manipulated numeric data in the table. The across()
function applies the same transformation (rounding in this case) to multiple columns. Finally, mutate()
adds the new variables (rounded columns), replacing the existing ones.
Retrieve and de-duplicate follow_up table
Now I set aside the counts
table for a moment to prepare the sample classifications. I realized that the follow_up
table had duplicate data for some reason. Thus I connected to the database and de-duplicate the followup_primarytherapyoutcomesuccess_1
column by using string aggregation (PostgreSQL’s STRING_AGG
function). I also changed its name to outcome
for simplicity, and saved the results as the followup_dedup
table. This is the command I used:
dbExecute(
con,
"CREATE TABLE followup_dedup AS SELECT case_id, STRING_AGG(followup_primarytherapyoutcomesuccess_1, ',') AS outcome FROM follow_up GROUP BY case_id"
)
Note how I used DBI
‘s dbExecute()
instead of dbGetQuery()
, since I will not retrieve the table into the R session.
Now I create other table to link the cases ID numbers with the de-duplicated outcomes, saving the result as the outcomes
table:
dbExecute(
con,
"CREATE TABLE outcomes AS SELECT case_id, outcome FROM allcases INNER JOIN followup_dedup USING(case_id)"
)
I now retrieve the outcomes
table into the R session:
outcomes <- dbGetQuery(con, "SELECT * FROM outcomes")
Create sample classification and new labels
I will now create a new column named class
in the outcomes table with a simplified case/control classification based on the outcome
column with the help of dplyr
‘s mutate()
and case_when()
, which vectorizes multiple if/else statements (it is an R equivalent of the SQL CASE WHEN
statement):
outcomes <- outcomes %>% mutate(class = case_when(
str_detect(outcome, "Complete") ~ "control",
str_detect(outcome, "Partial") ~ "case",
str_detect(outcome, "Disease") ~ "case"
))
Now I will create a new label for the cases for simplification. I created a column named new_names
in the outcomes
table. The new names were created by joining the classification created in the previous step with the 12 last characters of the case_id
:
outcomes <- outcomes %>% mutate(new_names = paste0(str_sub(case_id, -12), "_", class))
Finally, let’s apply the new case labels, substituting the old ones with the help of dplyr
‘s recode()
function:
colnames(counts) <- dplyr::recode(colnames(counts), !!!setNames(as.character(outcomes$new_names), outcomes$case_id))
Convert gene_ids into row names, then delete gene_id column
The table is almost in the state required for differential expression analysis. I just need to convert the gene_id
column into row names of the data frame:
counts <- as.data.frame(counts)
row.names(counts) <- counts$gene_id
The gene_id
is not needed anymore. I can delete it:
counts <- subset(counts, select = -c(gene_id))
The data frame is now ready for differential expression analysis for sequence count data. The features (gene IDs) are embedded on the R object row names, whereas each column corresponds to a individual sample. The row/column intersection are therefore, the raw counts of gene expression.
Check the dimensions (GxN) of the data frame with the dim()
function and see that there 60483 rows and 236 columns, meaning that 60483 transcripts where quantified in samples obtained from 236 individuals with prostate cancer.
dim(counts)
# Output:
[1] 60483 236
See below first few rows and columns of the finalized counts
data frame:
In a future post I will demonstrate the differential expression analysis per se.
Conclusion
- I demonstrated how to connect to a local PostgreSQL database with
Rpostgres
andDBI
packages; - Reorganized data by pivoting a long data frame to a wider data frame with gene IDs in rows and samples in columns with
dplyr
package functions; - Labeled columns as cases or controls for the differential expression analysis, also with
dplyr
package.
Go back to Working with Cancer Genomics Cloud datasets in a PostgreSQL database Part 1 and Part 2.
Subscribe to my RSS feed, Atom feed or Telegram channel to keep you updated whenever I post new content.