Integrating R and Python with reticulate
Posted on Sun 20 March 2022 in R
Introduction
reticulate is an R package that allows interoperability between R and Python. I recently discovered this package, and I have been excited to efficiently run Python scripts inside an R session, bringing the best of both worlds.
In this post, I will demonstrate reticulate
with two scripts. First, I will start an R session with an R script. Then I will call a Python script inside the R session and manipulate the Python output.
The demonstration output will be a data frame containing exon coordinates of two genes and the nucleotide sequences of those exons.
As always, I will post the code of this demo at my portfolio.
R script: part 1/2
I will start by listing the R packages I will use for the demonstration. Install them with install.packages()
or BiocManager::install()
if they are Bioconductor packages (install BiocManager
with install.packages()
as well).
library(here)
library(tidyverse)
library(reticulate)
library(glue)
# Biocondutctor packages
library(GenomicRanges)
library(BSgenome)
library(BSgenome.Hsapiens.UCSC.hg38)
Next, using here()
package, I will create the full path of a GFF3
file. The GFF3
format stores genomic features in a text file to help represent genomic data. This file in question contains features of the whole human genome. Since it is a big file, download it at the NCBI’s FTP site.
gffPath <- here("GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gff")
Edit the path using here()
‘s syntax if necessary.
To keep things simple, I create a vector with just two genes:
genes <- c("HTT", "FMR1")
Now I can start working with the Python script. If you use conda
environments to work with Python as I do, you can select the environment with reticulate
‘s use_condaenv()
function. In this case, I will use my bioenv
environment.
use_condaenv("bioenv")
Using here()
again, I set up the path for the Python script. Then, I use reticulate
‘s source_python()
function to run the Python script.
pythonScript <- here("reticulate_demo_Python_side.py")
source_python(pythonScript)
Let me show you the contents of this script.
Python script
The modules I will use for the demonstration will be pyranges
and pandas
. I surmise pyranges
calls pandas
in the background, but explicit is better than implicit.
import pyranges as pr
import pandas as pd
(Remember to install all the necessary Python modules in the environment beforehand with conda
or pip
).
conda install -c bioconda pyranges
# or
pip install pyranges
Then, I read the GFF3
file into the Python session using pyranges
‘s read_gff3()
function. Since I saved the file path into R’s gffPath
object, I must use the r.
prefix to bring it into the Python session, like this:
grch38_gff = pr.read_gff3(r.gffPath)
Next, I append the .df
suffix into the grch38_gff
Python object to convert it into a pandas
data frame to facilitate the search for the genes I established into the R session:
df = grch38_gff.df
I will search for the genes with pandas
‘s str.contains()
method. To do this, I must create a regex string. I will append a $
to the end of each gene name to match the whole gene string:
suffix = "$"
Next, I will use a list comprehension and Python’s join()
method to create the search string :
search_string = "|".join([gene + suffix for gene in r.genes])
The search_string
object turns up like this (remember that the pipe |
character means “or” in regex):
print(search_string)
# outputs:
# HTT$|FMR1$
Then I can finally filter the data frame to include only those genes:
df = df[(df["gene"].str.contains(search_string)) & (df["Feature"] == "exon")]
The line above concludes the Python script. Now let us return to the R script to wrap things up.
R script: part 2/2
We can retrieve Python objects into the R session similarly to the other around. We must use the prefix py$
in R to get the objects generated by Python (the ones outputted by the script).
Therefore, I will retrieve the data frame df
object and assign it to the exonsCoordinates
R object:
exonsCoordinates <- py$df
Before working with it, let me source a function that will retrieve the exon sequences using the GenomicRanges
(pyranges
is Python’s GenomicRanges
analog), BSgenome
and BSgenome.Hsapiens.UCSC.hg38
packages.
source(here("src", "getSequence.R"))
# Function code (getSequence.R file):
getSequence <- function(chr, start, end) {
gr <- GenomicRanges::GRanges(glue::glue("{chr}:{start}-{end}"))
refBase <- BSgenome::getSeq(BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, gr)
refBase <- as.data.frame(refBase)$x[1]
return(refBase)
}
Important: The first use of BSgenome/BSgenome.Hsapiens.UCSC.hg38 will prompt the download of human genomic sequence caches to be saved into a specific location on your computer. Be sure you have sufficient data allowances and space.
Finally, using tidyverse
(dplyr
) pipe %>%
syntax, I create a new column in the data frame (mutate()
) using the Chromosome
, Start
, and End
columns as arguments to my custom getSequence()
function. The new column will contain the exon sequences. Observe the use of rowwise()
and ungroup()
. Since my function is not vectorized, I must use it row-wise instead of column-wise. The first function ensures this. The second function restores the data frame to its column-wise nature after getSequence()
finishes its job. Then, I select just some of the columns of the data frame with dplyr
‘s select()
function:
exonsCoordinatesClean <- exonsCoordinates %>%
rowwise() %>%
mutate(sequence = getSequence(Chromosome, Start, End)) %>%
ungroup() %>%
select(Chromosome, Feature, Start, End, Score, Strand, Frame, ID, gene, sequence)
Conclusion
In conclusion, I :
- Demonstrated how to run a Python script with
reticulate
without leaving an active R session; - Explained how to retrieve Python objects into an R session and vice-versa.
- Demonstrated a simple function that retrieves nucleotide sequences from human genome using chromosome coordinates.
Subscribe to my RSS feed, Atom feed or Telegram channel to keep you updated whenever I post new content.
References
Index of /genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
here | A Simpler Way to Find Your Files
Setting Up Your Unix Computer for Bioinformatics Analysis
GitHub - biocore-ntnu/pyranges: Performant Pythonic GenomicRanges
pandas - Python Data Analysis Library