Training and Evaluating a Neural Network Model

Posted on Mon 22 April 2024 in Python • Tagged with PyTorch, machine learning, transcriptomics

Introduction

In my previous post, I trained an XGBoost machine-learning model with single-cell RNA-Seq (scRNA-Seq) data to differentiate cell identity (parental cells versus paclitaxel-resistant cells) based on transcriptomic patterns.

As an exercise, I decided to use the same input data to experiment with other machine-learning models. In this post …


Continue reading

Analyzing scRNA-Seq data with XGBoost

Posted on Wed 10 April 2024 in R • Tagged with Bioinformatics, gene expression

Introduction

Breast cancer is one of the most important morbidity and mortality cases around the world. In 2022, 2.3 million women were diagnosed with breast cancer and about 670,000 died from the disease, according to the World Health Organization.

Traditional breast cancer treatment with chemotherapy may be complicated …


Continue reading

Parallelization with R

Posted on Mon 31 July 2023 in R • Tagged with Bioinformatics, gene expression, edgeR, furrr

Introduction

Sometimes, some computations can be carried out in parallel. Certain large tasks can be divided into independent ones, allowing them to be solved at the same time, rather than waiting for each task to be solved sequentially.

I find the native R parallel functions such as mclapply(), or those …


Continue reading

Genomic plots with circlize

Posted on Sat 29 April 2023 in R • Tagged with circlize, genomics, data visualization

Introduction

Genomics is undoubtedly a complex science. The human genome is huge, with more than 3 billion base pairs, about 20,000 protein-coding genes, several millions of variants, and many more interesting characteristics. The visualization of genomic/omics data is challenging due to the sheer volume of information. Circular plots …


Continue reading

Parsing the ClinVar XML file with pandas

Posted on Sat 04 February 2023 in Python • Tagged with pandas, ClinVar, genomics, variants

Introduction

ClinVar is one of the USA’s National Center for Biotechnology Information (NCBI) databases. ClinVar archives reports of relationships among human genetic variants and phenotypes (usually genetic disorders). Any organization, such as a laboratory, hospital, clinic etc can submit data to ClinVar. The core idea of ClinVar is aggregate …


Continue reading

Opening files of size larger than RAM with pandas

Posted on Mon 27 June 2022 in Python • Tagged with pandas, Genomics, Bioinformatics

Introduction

Dealing with big files is a routine for everyone working in genomics. FASTQ, VCF, BAM, and GTF/GFF3 files, to name a few, can range from some hundreds of megabytes to several gigabytes in size. Usually, we can use cloud services to configure computing instances with a lot of …


Continue reading

Integrating R and Python with reticulate

Posted on Sun 20 March 2022 in R • Tagged with reticulate, gff, GenomicRanges, pyranges, BSgenome

Introduction

reticulate is an R package that allows interoperability between R and Python. I recently discovered this package, and I have been excited to efficiently run Python scripts inside an R session, bringing the best of both worlds.

In this post, I will demonstrate reticulate with two scripts. First, I …


Continue reading

Genomic Analysis With Hail

Posted on Fri 09 July 2021 in Python • Tagged with Bioinformatics, Genomics, Hail

Introduction

Hello, long time no see! Since I lasted posted, many things happened. Since March I have been working as Post-Doc Researcher, hired by the Hospital Israelita Albert Einstein (HIAE, São Paulo, Brazil) to work for the Projeto Genomas Raros (“Rare Genomes Project”, GRAR from here on), a public-private partnership …


Continue reading

Making an Interactive Map with Shiny and Leaflet in R

Posted on Thu 18 February 2021 in R • Tagged with shiny, leaflet, data visualization, web app

Introduction

Shiny is a R package developed and maintained by the RStudio team. With Shiny, anyone can build interactive web apps to help data visualization. Here I present a simple template of an interactive Brazilian map displaying fictitious allelic frequencies with samples sizes across the country. It is a useful …


Continue reading

How to Query Ensembl BioMart with Python

Posted on Tue 19 January 2021 in Python • Tagged with Bioinformatics, Ensembl, BioMart, omics, data mining

Introduction

Recently, me and my colleagues wrote a manuscript involving meta-analysis of RNA-Seq studies. One of my tasks of this project was to perform a Gene Ontology (GO) enrichment analysis: “[G]iven a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO …


Continue reading

Machine Learning with Python: Supervised Classification of TCGA Prostate Cancer Data (Part 1 - Making Features Datasets)

Posted on Thu 05 November 2020 in Python • Tagged with Bioinformatics, gene expression, machine learning, supervised classification

Introduction

In a previous post, I showed how to retrieve The Cancer Genome Atlas (TCGA) data from the Cancer Genomics Cloud (CGC) platform. I downloaded gene expression quantification data, created a relational database with PostgreSQL, and created a dataset uniting the raw quantification data for 675 differentially expressed genes identified …


Continue reading

Machine Learning with Python: Supervised Classification of TCGA Prostate Cancer Data (Part 2 - Making a Model)

Posted on Thu 05 November 2020 in Python • Tagged with Bioinformatics, gene expression, machine learning, supervised classification

Introduction

In a previous post, I showed how to retrieve The Cancer Genome Atlas (TCGA) data from the Cancer Genomics Cloud (CGC) platform. I downloaded gene expression quantification data, created a relational database with PostgreSQL, and created a dataset uniting the raw quantification data for 675 differentially expressed genes identified …


Continue reading

Differential Expression Analysis with edgeR in R

Posted on Mon 26 October 2020 in R • Tagged with Bioinformatics, gene expression, edgeR

Introduction

In my previous post I demonstrated how to organize the CGC prostate cancer data to a format suited to differential expression analysis (DEA).

Nowadays, DEA usually arises from high-throughput sequencing of a collection (library) of RNA molecules expressed by single cells or tissue given their conditions upon collection and …


Continue reading

Data manipulation with R

Posted on Mon 19 October 2020 in R • Tagged with Bioinformatics, gene expression, SQL, PostgreSQL

Introduction

In my previous post I demonstrated how to obtain a prostate cancer dataset with genomic information in the form of gene expression quantification and created a local PostgreSQL database to hold the data.

Here, I will use R to connect to the PostgreSQL database, retrieve and then prepare the …


Continue reading

Meta-analysis and Meta-regression with R

Posted on Tue 13 October 2020 in R • Tagged with meta-analysis, statistical analysis, COVID-19, SARS-CoV-2, acute kidney injury

Introduction

On December 2019, reports from severe acute respiratory syndrome in Wuhan, China, were linked to a novel coronavirus, now known as SARS-CoV-2, and the disease it causes was termed coronavirus disease 2019 (COVID-19).

The World Health Organization declared the COVID-19 outbreak a Public Health Emergency of …


Continue reading

Working with Cancer Genomics Cloud datasets in a PostgreSQL database (Part 1)

Posted on Mon 12 October 2020 in SQL • Tagged with Bioinformatics, gene expression quantification, copy number variation, Windows

Introduction

Recently I have been looking for publicly-available genomics datasets to test machine learning models in Python. During my searches for such a “toy dataset”, I came upon the Cancer Genomics Cloud (CGC) initiative.

Anyone can register in CGC and have access to open access massive public datasets, like The …


Continue reading

Working with Cancer Genomics Cloud datasets in a PostgreSQL database (Part 2)

Posted on Mon 12 October 2020 in SQL • Tagged with Bioinformatics, gene expression quantification, copy number variation, Windows

Introduction

Recently I have been looking for publicly-available genomics datasets to test machine learning models in Python. During my searches for such a “toy dataset”, I came upon the Cancer Genomics Cloud (CGC) initiative.

Anyone can register in CGC and have access to open access massive public datasets, like The …


Continue reading

FASTQ to Annotation (Part 4)

Posted on Tue 06 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In a previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

FASTQ to Annotation (Part 3)

Posted on Mon 05 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In a previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

FASTQ to Annotation (Part 2)

Posted on Fri 02 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In a previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

FASTQ to Annotation (Part 1)

Posted on Thu 01 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In my previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

Setting Up Your Unix Computer for Bioinformatics Analysis

Posted on Wed 30 September 2020 in Unix • Tagged with Bioinformatics

Introduction

In this post I will show how I set up my Unix machine to use Bioinformatics programs and tools. I am currently using Ubuntu 20.04 LTS (Focal Fossa) 64-bit on a Windows Subsystem for Linux (WSL2) on Windows 10, so no GUI today!

The code and files used …


Continue reading