Analyzing scRNA-Seq data with XGBoost

Posted on Wed 10 April 2024 in R • Tagged with Bioinformatics, gene expression

Introduction

Breast cancer is one of the most important morbidity and mortality cases around the world. In 2022, 2.3 million women were diagnosed with breast cancer and about 670,000 died from the disease, according to the World Health Organization.

Traditional breast cancer treatment with chemotherapy may be complicated …


Continue reading

Parallelization with R

Posted on Mon 31 July 2023 in R • Tagged with Bioinformatics, gene expression, edgeR, furrr

Introduction

Sometimes, some computations can be carried out in parallel. Certain large tasks can be divided into independent ones, allowing them to be solved at the same time, rather than waiting for each task to be solved sequentially.

I find the native R parallel functions such as mclapply(), or those …


Continue reading

Opening files of size larger than RAM with pandas

Posted on Mon 27 June 2022 in Python • Tagged with pandas, Genomics, Bioinformatics

Introduction

Dealing with big files is a routine for everyone working in genomics. FASTQ, VCF, BAM, and GTF/GFF3 files, to name a few, can range from some hundreds of megabytes to several gigabytes in size. Usually, we can use cloud services to configure computing instances with a lot of …


Continue reading

Genomic Analysis With Hail

Posted on Fri 09 July 2021 in Python • Tagged with Bioinformatics, Genomics, Hail

Introduction

Hello, long time no see! Since I lasted posted, many things happened. Since March I have been working as Post-Doc Researcher, hired by the Hospital Israelita Albert Einstein (HIAE, São Paulo, Brazil) to work for the Projeto Genomas Raros (“Rare Genomes Project”, GRAR from here on), a public-private partnership …


Continue reading

How to Query Ensembl BioMart with Python

Posted on Tue 19 January 2021 in Python • Tagged with Bioinformatics, Ensembl, BioMart, omics, data mining

Introduction

Recently, me and my colleagues wrote a manuscript involving meta-analysis of RNA-Seq studies. One of my tasks of this project was to perform a Gene Ontology (GO) enrichment analysis: “[G]iven a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO …


Continue reading

Machine Learning with Python: Supervised Classification of TCGA Prostate Cancer Data (Part 1 - Making Features Datasets)

Posted on Thu 05 November 2020 in Python • Tagged with Bioinformatics, gene expression, machine learning, supervised classification

Introduction

In a previous post, I showed how to retrieve The Cancer Genome Atlas (TCGA) data from the Cancer Genomics Cloud (CGC) platform. I downloaded gene expression quantification data, created a relational database with PostgreSQL, and created a dataset uniting the raw quantification data for 675 differentially expressed genes identified …


Continue reading

Machine Learning with Python: Supervised Classification of TCGA Prostate Cancer Data (Part 2 - Making a Model)

Posted on Thu 05 November 2020 in Python • Tagged with Bioinformatics, gene expression, machine learning, supervised classification

Introduction

In a previous post, I showed how to retrieve The Cancer Genome Atlas (TCGA) data from the Cancer Genomics Cloud (CGC) platform. I downloaded gene expression quantification data, created a relational database with PostgreSQL, and created a dataset uniting the raw quantification data for 675 differentially expressed genes identified …


Continue reading

Differential Expression Analysis with edgeR in R

Posted on Mon 26 October 2020 in R • Tagged with Bioinformatics, gene expression, edgeR

Introduction

In my previous post I demonstrated how to organize the CGC prostate cancer data to a format suited to differential expression analysis (DEA).

Nowadays, DEA usually arises from high-throughput sequencing of a collection (library) of RNA molecules expressed by single cells or tissue given their conditions upon collection and …


Continue reading

Data manipulation with R

Posted on Mon 19 October 2020 in R • Tagged with Bioinformatics, gene expression, SQL, PostgreSQL

Introduction

In my previous post I demonstrated how to obtain a prostate cancer dataset with genomic information in the form of gene expression quantification and created a local PostgreSQL database to hold the data.

Here, I will use R to connect to the PostgreSQL database, retrieve and then prepare the …


Continue reading

Working with Cancer Genomics Cloud datasets in a PostgreSQL database (Part 1)

Posted on Mon 12 October 2020 in SQL • Tagged with Bioinformatics, gene expression quantification, copy number variation, Windows

Introduction

Recently I have been looking for publicly-available genomics datasets to test machine learning models in Python. During my searches for such a “toy dataset”, I came upon the Cancer Genomics Cloud (CGC) initiative.

Anyone can register in CGC and have access to open access massive public datasets, like The …


Continue reading

Working with Cancer Genomics Cloud datasets in a PostgreSQL database (Part 2)

Posted on Mon 12 October 2020 in SQL • Tagged with Bioinformatics, gene expression quantification, copy number variation, Windows

Introduction

Recently I have been looking for publicly-available genomics datasets to test machine learning models in Python. During my searches for such a “toy dataset”, I came upon the Cancer Genomics Cloud (CGC) initiative.

Anyone can register in CGC and have access to open access massive public datasets, like The …


Continue reading

FASTQ to Annotation (Part 4)

Posted on Tue 06 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In a previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

FASTQ to Annotation (Part 3)

Posted on Mon 05 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In a previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

FASTQ to Annotation (Part 2)

Posted on Fri 02 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In a previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

FASTQ to Annotation (Part 1)

Posted on Thu 01 October 2020 in Unix • Tagged with Bioinformatics, genomic variation, entrez-direct, EDirect

Introduction

In my previous post, I showed how to configure an Ubuntu system to install Bioinformatics programs.

Now, using the environment I created, I will demonstrate a bash script, FastQ_to_Annotation.sh that takes next generation sequencing (NGS) raw reads from human whole genome sequencing as input and produces …


Continue reading

Setting Up Your Unix Computer for Bioinformatics Analysis

Posted on Wed 30 September 2020 in Unix • Tagged with Bioinformatics

Introduction

In this post I will show how I set up my Unix machine to use Bioinformatics programs and tools. I am currently using Ubuntu 20.04 LTS (Focal Fossa) 64-bit on a Windows Subsystem for Linux (WSL2) on Windows 10, so no GUI today!

The code and files used …


Continue reading