How to Query Ensembl BioMart with Python

Posted on Tue 19 January 2021 in Python

Introduction

Recently, me and my colleagues wrote a manuscript involving meta-analysis of RNA-Seq studies. One of my tasks of this project was to perform a Gene Ontology (GO) enrichment analysis: “[G]iven a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set”. In other words, I could verify which cellular pathways were in action during the experimental conditions.

After I finished the GO analysis, I got a spreadsheet with a list of GO terms — brief descriptions of the cellular process performed by each pathway. When I was thinking about the pathways, I started to wonder: “which proteins participate into each pathway?” So I decided to go data mining the Ensembl BioMart to find out those protein genes.

Ensembl is a huge project by the European Bioinformatics Institute and the Wellcome Trust Sanger Institute to provide databases of annotated genomes for several (mainly vertebrate) species. BioMart is one of their data mining tools. In this post, I will describe how I used Python to query BioMart. I will introduce a simple function to generate GNU Wget commands to retrieve query results via RESTful access. Then, I will show how I aggregated the data to met my objective (i.e. list all genes participating in a biological pathway represented by a GO term).

The code and example files presented here are available in my portfolio.

Prepare identifiers

I created a project folder where I saved the Python script with this demonstration’s code (ensembl_rest.py) and two subfolders. The data folder holds the example data: a spreadsheet named go_demo.xlsx. The src folder contains the functions’ code. I will write about them later.

.
├── data
│   └── go_demo.xlsx
├── src
│   ├── files_to_pandas.py
│   ├── lists_ensembl.R
│   └── query_biomart.py
└── ensembl_rest.py

The spreadsheet contains just two rows of data (so that the computation can be completed quickly). They are derived from a GO enrichment analysis output performed by limma‘s package goana function available for R software. The GO ids were the identifiers (search keywords) I used to query BioMart. You may use whatever identifier recognized by BioMart (more on that later). Just be sure they are on a nicely-named column so you can read it into Python.

Print screen of identifiers file

Load modules and set file paths

These are the modules I used:

import glob
import subprocess
from collections import defaultdict
from pathlib import Path

import dask.dataframe as dd
import pandas as pd
from dask import delayed

from src.files_to_pandas import files_to_pandas
from src.query_biomart import query_biomart

The first four modules are from Python’s standard library. I already used dask before. If you do not have dask or pandas installed, do it now with pip. I installed openpyxl as well, since it serves to read/write Excel spreadsheets.

pip install "dask[complete]" pandas openpyxl

Observe that I am explicitly loading the functions by prefixing the modules names with src. (remember that we can call every Python script a module).

Using pathlib.Path I nicely set the folders and files paths, using the project folder as the root.

project_folder = Path().resolve() # project root
data_folder = project_folder / "data"
go_list = data_folder / "go_demo.xlsx"

Then I loaded the identifiers file into a pandas.DataFrame with the help of openpyxl.

df = pd.read_excel(go_list, engine="openpyxl")

Configure the queries: datasets, filters, values and attributes

Now let’s examine an excerpt from the query_biomart() function code. It contains the necessary modules, the function’s arguments and some of their types hinted with the help of typing module:

import re
from typing import List

def query_biomart(
filter_names: List[str],
values: List[str],
attribute_names: List[str],
dataset_name: str = "hsapiens_gene_ensembl",
formatter: str = "TSV",
header: bool = False,
uniqueRows: bool = False
) -> str:

# ...Function here ...

There is a description of each argument below:

filter_names (List[str]): A list of strings that define restrictions on the query.
values (List[str]): A list of strings containing the values that will be used for the filters. Must have same length of filter_names.
attribute_names (List[str]): A list of strings that define the data characteristics that must be retrieved for the filtered data.
dataset_name (str, optional): A string indicating which dataset will be queried. Each species has their respective dataset. Defaults to “hsapiens_gene_ensembl” (humans).
formatter (str, optional): A string to indicate how output must be formatted. Options: “HTML“, “CSV“,”TSV” and “XLS“. Defaults to “TSV“.
header (bool, optional): A Boolean indicating if the output must have a header. Defaults to False.
uniqueRows (bool, optional): A Boolean indicating if the output must have unique rows (deduplicate data). Defaults to False.

Thus I searched BioMart:

Into Homo sapiens dataset (“hsapiens_gene_ensembl”),
Filtering by GO terms (“go_parent_term”),
Using GO ids (for example, GO:0002790) as search keywords (values),
And wanted to retrieve the gene name attribute (“external_gene_name”).

Here is an example of the function with the inputs above. All defaults were maintained, except for uniqueRows, which I set to True to remove duplicate data:

query_biomart(filter_names=["go_parent_term"],
        values=["GO:0002790"],
        attribute_names=["external_gene_name"],
        uniqueRows=True)

See the Appendix to help you see which information can be searched in BioMart.

The output of the function is a string ‐ a ready-to-use GNU Wget Unix program command. The query string inside the command has a XML format that is sent to Ensembl’s servers. The query result will then be saved on a file named based on the values (search keywords) of the query. The extension of the file depends on the formatter argument.

I created a Bash script named commands.sh inside the data folder to backup and document my work. I did this with a for loop. In other words, Python wrote for me a series of Wget commands, one for each keyword in the identifiers data frame (go_ids column ‐ df["go_ids"]) alongside the desired filter and attribute:

with open(data_folder / "commands.sh", "w", newline="\n") as commands:

commands.write("#!/bin/bash" + "\n")

for go_id in df["go_ids"]:
   commands.write(query_biomart(filter_names=["go_parent_term"],
   values=[go_id], attribute_names=["external_gene_name"],
   uniqueRows=True) + "\n")

See a print screen of the commands.sh file below. Notice the shebang (#!) line: it ensures the file can be executed by Bash.

Print screen commands.sh file

Execute the queries

I executed the commands.sh file with the help of Python’s subprocess library. Notice I indicated the path of the file by converting the libpath.Path() address to a string with str() and changed directories with the cwd argument so that the queries results would be downloaded into the data folder as well.

subprocess.call(str(data_folder / "commands.sh"), cwd=data_folder)

Then I waited for the download completion. The print screen below shows part of the wget progress.

wget download progress

After a while, two files, named GO0002790.txt and GO0019932.txt were created, each containing a list of several genes related with the GO ids:

.
├── data
│   ├── commands.sh
│   ├── go_demo.xlsx
│   ├── GO0002790.txt
│   ├── GO0019932.txt
│   └── lists_ensembl.xlsx
├── src
│   ├── files_to_pandas.py
│   ├── lists_ensembl.R
│   └── query_biomart.py
└── ensembl_rest.py

Result excerpt

Label and unify data into a `pandas.DataFrame`

Similarly to my previous post I needed to make a relation of GO id —> genes, so I adapted and renamed the old function I used on that occasion. The new function is named files_to_pandas. It takes a file and converts it into a pandas.DataFrame while creating a new column to accommodate the file name into a new column (in our case, the files were named after the values ‐ GO ids).

The idea is that I needed to create one data frame for each file. Thus, I used the glob method to get an iterable of file names and used dask‘s delayed method to assemble the several pandas.DataFrames generated by the files_to_pandas function loop into a unified dask.DataFrame.

file_list = data_folder.glob("*.txt")

dfs = [delayed(files_to_pandas)(filename) for filename in file_list]

ddf = dd.from_delayed(dfs)

Reorganize data with `defaultdict`

I could have ended in the previous step, but I surmised that I could condense the data frame into fewer rows, by aggregating all genes into the same row of the equivalent GO id. So I had the idea to use Python’s defaultdict. It is a subclass of the dictionary class. First, I initialize an empty defaultdict. It will have GO ids as keys and a list of gene names as values:

go_to_genes = defaultdict(list)

Then I simply had to loop through the rows of the dask.DataFrame, insert the keys into the defaultdict and appending the gene names into the value-list one by one.

for go_id, gene in zip(ddf["filename"], ddf["gene_name"]):
    go_to_genes[go_id].append(gene)

Save the `defaultdict` into a new `pandas.DataFrame`

Finally, I created another data frame to store the contents of the defaultdict. Notice how I used the .items() method. It is the correct way to access the contents of a defaultdict.

go_genes = pd.DataFrame(go_to_genes.items(), columns=["go_ids", "gene_name"])

Error checking

Just to be sure that all queries worked correctly, a used a simple loop to check the contents of the gene names column:

for go_id, gene in zip(go_genes["go_ids"],go_genes["gene_name"]):
    if "Query" in str(gene):
        print(f"Error in {go_id}")
    else:
        continue

print("Error check done.")

I am using “Query” is used as an example, because BioMart errors may contain this word in their error messages, such as:

"Query ERROR: caught BioMart::Exception::Database: Could not connect to mysql database ..."

Of course, other strings can be used. In retrospect, “ERROR” would have been an even better option than “Query”. If some error were detected, the loop would print the GO id that failed the query (due to a network error, for example), so I would have to retry the query.

Annotate original data frame

No errors were found, so I finally could annotate my original data frame by joining them by the "go_ids" column. I even created a new column ("gene_string") to convert the Python lists into a nicely formatted comma-delimited string of gene names:

df_annotated = df.set_index("go_ids").join(go_genes.set_index("go_ids"))

df_annotated["gene_string"] = [", ".join(map(str, element)) for element in df_annotated["gene_name"]]

I then saved the result into a new tab (named go annotated) inside my go_demo.xlsx file:

with pd.ExcelWriter(go_list, mode="a") as writer:
    df_annotated.to_excel(writer, sheet_name="go annotated", index=True)

Conclusion

In this post, I:

Introduced a function to create customized ready-to-use query strings via REST API access;
Provided a file with the descriptors of data deposited in BioMart (list_ensembl.xlsx);
Demonstrated how to use Python to invoke GNU Wget to download the the queries’ results;
Demonstrated how to aggregate and manipulate the queries results using pandas, dask and defaultdict.

Feel free to use the query_biomart() function to data mine BioMart as you wish!

Subscribe to my RSS feed, Atom feed or Telegram channel to keep you updated whenever I post new content.

Appendix

I prepared a spreadsheet named lists_ensembl.xlsx also stored into the data folder. Ensembl has a lot of datasets, filters and attributes, so examine the other rows if you are interested. I produced the spreadsheet with the help of biomaRt R package. If you prefer R over Python, be sure to try querying BioMart with it. The lists_ensembl.R file contains the R code that generated the spreadsheet.