Setting Up Your Unix Computer for Bioinformatics Analysis

Posted on Wed 30 September 2020 in Unix

Introduction

In this post I will show how I set up my Unix machine to use Bioinformatics programs and tools. I am currently using Ubuntu 20.04 LTS (Focal Fossa) 64-bit on a Windows Subsystem for Linux (WSL2) on Windows 10, so no GUI today!

The code and files used here can be retrieved from this post corresponding folder on my portfolio.

Preparing the system

First, it is recommended that we upgrade the system. Open the command line terminal in your machine and copy and paste or type the following commands, pressing Enter after each one (make sure you type your password correctly whenever asked):

sudo apt-get update
sudo apt-get upgrade

Then I must install some useful libraries, especially to be sure that all future libraries I need will be installed and work properly. Some of these (e.g. default-jdk, the Java libraries), may already be installed in your system, but just to ensure:

sudo apt-get install -y curl unzip build-essential ncurses-dev
sudo apt-get install -y byacc zlib1g-dev python-dev git cmake
sudo apt-get install -y default-jdk ant

Installing (mini)conda

Now I will install miniconda. What is miniconda? Miniconda is a simplified version of Conda, an environment management system. Every program we install on our computers depend on other programs to work. So if a program X needs a program Y to work, it may stop working if Y gets an update that for some reason is incompatible with the original X program.

Thus, environments were developed to solve this kind of problem, because they serve to isolate groups of programs, ensuring only compatible versions of software are working together. Therefore, miniconda serves to create and manage environments. The best practice is that one should create one environment for one specific use. In my case, I installed miniconda to create a environment and populate it with tools used for several Bioinformatics analysis. Other people can create environments for other uses with specific programs needed and so on. Other advantage of miniconda is that the configuration files for environments can be shared with others, ensuring backup and reproducibility.

Without further ado, let’s finally install miniconda. Since I am using a Unix with Python 3.7.7 pre-installed, the version of the installer is this one. Check the installation page if you have a different Python version.

You can download the installer from your browser or via command line:

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh > Miniconda3-latest-Linux-x86_64.sh

Then, go to the folder where the installer was downloaded and run the script:

bash Miniconda3-latest-Linux-x86_64.sh

./Miniconda3-latest-Linux-x86_64.sh # same effect

When the installation finishes, I must initialize conda:

miniconda3/condabin/conda init

Close the terminal and open it again. Now miniconda must be ready to use. Check by typing and pressing Enter:

conda

Then, I added two channels. Channels are “the locations where packages are stored”. Miniconda has the defaults channel pre-configured. The two channels in question are dedicated to Bioinformatics and Data analysis programs, which may not be present in the default channels, so I must add them.

Configuring miniconda channels

Once again in the terminal enter the following commands:

conda config --add channels bioconda
conda config --add channels conda-forge

Miniconda sets up priorities in the list of channels it receives. When we need to install some program, miniconda will search in the higher-priority channels first, then in the channels with lower-priority. “Different channels can have the same package” and you can “safely put channels at the bottom of your channel list to provide additional packages that are not in the default channels” as stated in the official website. The flag --add adds the respective channels (bioconda and conda-forge) to the top of the priorities list. If you want to give lower priority, putting them in the bottom of the list, use the --append command instead. Thus, according to the command above, the order of channel priorities in our new miniconda installation will be: conda-forge, bioconda and lastly, defaults.

Create an environment for Bioinformatics programs

Now that miniconda is configured, I will create the environment that will receive them. I will name it bioenv. You can choose whatever name you like!

conda create -y --name bioenv python=3.6

Activating and deactivating an environment

With the bioenv created, I must activate it:

conda activate bioenv

I need to perform this step every time I want to use the programs that I will install in this environment. If you do not need to use the environment for the moment, simply deactivate it:

conda deactivate

Simply activate it again when needed.

Installing programs

Now we can finally install our programs. Activate the environment again (only if you have deactivated it). Download the bioenv.txt file in my GitHub repository. This file contains a selection of most used Bioinformatics programs (hat tip to Dr. István Albert)

cat bioenv.txt | xargs conda install -y

Backing up and restoring your environment configuration

Miniconda has a special command to backup your environment configuration. Activate (if needed) the environment you want to backup and enter the command:

conda env export | grep -v "prefix" > bioenv.yml

It will result in a YAML file in the current working folder containing all configurations in your environment. Again, I named the file bioenv.yml but you can choose whatever you like. Note that if you already have a bioenv.yml in your directory, it will be overwritten, so be careful.

To restore this environment in your computer, or on other computer, first install miniconda again, and then use the command:

conda env create -f bioenv.yml

The -f flag means you are creating an environment using the configurations in the bioenv.yml file. The first line of the yml file sets the new environment’s name, so you can change it in the file if you like. It will also restore the channels configured in the previous installation of miniconda.

Conclusion

This is how I configured my system so I could use the major Bioinformatics tools out there. In summary, I:

  • Prepared an Unix (Ubuntu) system;
  • Installed miniconda, an environment manager;
  • Configured channels so I could retrieve desired software;
  • Created an environment, showed how to activate and deactivate it, and finally installed software in it;
  • Showed how to backup your environment for safekeeping or sharing with others.

In future posts I will demo some uses of the installed programs I in the new environment.

References

My Portfolio

Trying the New WSL 2. It’s Fast! (Windows Subsystem for Linux) | DigitalOcean

Miniconda

Miniconda installer

Miniconda & Conda documentation

Managing channels; conda 4.8.4.post65+1a0ab046 documentation

The Biostar Handbook: 2nd Edition