Machine Learning with Python: Supervised Classification of TCGA Prostate Cancer Data (Part 1 - Making Features Datasets)
Posted on Thu 05 November 2020 in Python
Introduction
In a previous post, I showed how to retrieve The Cancer Genome Atlas (TCGA) data from the Cancer Genomics Cloud (CGC) platform. I downloaded gene expression quantification data, created a relational database with PostgreSQL, and created a dataset uniting the raw quantification data for 675 differentially expressed genes identified by edgeR, race, age at diagnosis and tumor size at diagnosis.
In this post, I will use Python to prepare features datasets to use them to produce a classification model using machine learning tools, especially the scikit-learn
module. Check its documentation here.
The dataset and code presented here are available in my portfolio.
Prepare workspace
Since I will generate a statistical model, it is important to test it in never-seen before data, to assess its prediction validity. Thus, I will use a customized script, make_features.py
, to transform and split the dataset into separate train and test datasets. I will fit the model and then use it to predict the risk of prostate cancer of the subjects included in the test dataset and then compare with actual status (control: prostate cancer in remission/cases: prostate cancer progressing) to see how well it predicted.
Check the make_features.py
. Let’s examine it. First, I import the necessary modules:
from pathlib import Path
import janitor as jn
import pandas as pd
from sklearn import preprocessing, model_selection
pathlib
is a module from the Python standard library. It helps file path management. janitor
is a module for data clean-up and manipulation. pandas
is also used for data manipulation and transformation, especially if it is contained on data frames. sklearn
is an alias for scikit-learn
. It is a powerhouse of several machine learning functions and utilities. Install all non-standard library modules using your Python package manager, usually pip
or through anaconda, if you are using a conda environment with Python.
Next, I set up some constants that I will use later:
RANDOM_SEED=42
TEST_SIZE=0.30
INDEX=676
CORR_THRESHOLD=0.80
RANDOM_SEED
: An integer used to initialize the pseudo-random number generator. It helps generate reproducible outputs in different sessions/computers;TEST_SIZE
: A decimal (float) number indicating the size of the test dataset. Currently, it is 30% of the samples in the complete dataset;INDEX
: An integer indicating a slicing index. I will explain it later.CORR_THRESHOLD
: A float number indicating a threshold for eliminating correlated variables in the dataset; it helps overfitting by reducing the multicollinearity issue.
Next, I will indicate the path of the dataset, which i saved as a CSV file:
project_folder = Path().resolve().parent.parent
CSV_FILE = project_folder / "data" / "interim" / "prostate_cancer_dataset.csv"
See below the structure of the current working directory:
.
├── data
│ ├── interim
│ │ └── prostate_cancer_dataset.csv
│ └── processed
├── models
└── src
├── features
│ └── make_features.py
└── models
└── make_model.py
Load data into Pandas
Now, I create a pandas.DataFrame
using the read_csv
function, and assign it to df
object:
df = pd.read_csv(CSV_FILE)
Check the column names with the columns
method:
df.columms
The status
column contains the sample labels: 0
means “control” and 1
means “case”. The other columns are the variables (or features) of each sample.
Dummy-encode categorical variables
I can now transform the data, making it suitable for machine learning modeling. Luckily, the TCGA dataset has no missing values and no gross errors are present. Otherwise, I would have to impute missing values and correct the errors. Now I will recode the categorical variables in the dataset. To check which variables are categorical, use the pandas
dtypes
method. The variables labeled with object
are usually categorical. If you see object
beside the name of a variable you know it is not categorical, then it is possible that there are missing data or some other error.
df.dtypes
df = pd.get_dummies(df, drop_first=True)
Split datasets
With the transformed dataset, let’s go ahead and split the dataset into training and testing datasets.
X, y = jn.get_features_targets(df, target_columns="status")
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_SEED
)
The get_features_targets
from janitor
module will get the df
object and separate the sample labels from the variables (features). The features will then be assigned to object X
, which will be instanced as a pandas.DataFrame
and the status
column (see the target_columns
argument) will be put in the y
object, which will be instanced as a pandas.Series
.
The second function, model_selection.train_test_split
from sklearn
will split the X
and y
objects into its training and testing counterparts — therefore, four objects: X_train
, X_test
, y_train
and y_test
. Check the TEST_SIZE
and RANDOM_SEED
constants I set up in the beginning of the script being used here. They indicate that 30% of the dataset must be included as a test dataset (therefore 70% in the training dataset) and setting a integer into the random_state
argument ensures that the splitting outputs can be reproduced.
Also note the stratify
argument. It ensures that the same proportion of cases and controls are drawn for training and testing datasets. For this, I indicate the object containing the labels, which is y
.
The commands below print the number of cases and controls for each dataset:
print(f"Training Set has {sum(y_train)} Positive Labels (cases) and {len(y_train) - sum(y_train)} Negative Labels (controls)")
print(f"Test Set has {sum(y_test)} Positive Labels (cases) and {len(y_test) - sum(y_test)} Negative Labels (controls)")
The output:
Training Set has 38 Positive Labels (cases) and 127 Negative Labels (controls)
Test Set has 16 Positive Labels (cases) and 55 Negative Labels (controls)
Normalize data
Now I will standardize (scale or Z-score normalize) the numerical columns. For each numerical feature, I will calculate its mean and standard deviation. Then, for each observed value of the variable, I will subtract the mean and divide by the standard deviation.
After the dummy coding of the categorical variables, they were transposed to the end of the data frame. I manually checked the column names and identified the column index. I then sliced the list of column names and stored in the num_cols
object:
num_cols = list(X.columns)[:INDEX]
That’s why I conveniently saved the index number into the INDEX
variable at the beginning of the script. It helps code reusability with different data. It is one of the advantages of avoiding the so called “magic numbers”.
With the numeric columns identified, I then used the StandardScaler()
function from sklearn
‘s preprocessing
module:
sca = preprocessing.StandardScaler()
X_train[num_cols] = sca.fit_transform(X_train[num_cols])
X_test[num_cols] = sca.transform(X_test[num_cols])
I used the function fit_transform()
function in the X_train
dataset, passing the num_cols
list as the indication of which columns need to be normalized (using the brackets []
notation) and saving the resulting transformation with the same names into the X_train
object. Thus, for all effects and purposes, I am replacing the old columns with the normalized columns.
Note that for X_test
dataset I used a different formula, transform()
. This is because I am using just the coefficients fitted by fit_transform()
in the train dataset to generate the normalization in the test dataset. This way I am sure that the scaling is not “contaminated” by the test data, that is supposed to not seem before the classification model fitting.
Remove correlated features
Now I will filter out correlated features. Please note that I would not normally do this with genomic data, but since here is just an exercise, I will show how to do it so you can apply to your projects when necessary. Check below the code (hat tip to Dario Radečić for the snippet):
correlated_features = set()
correlation_matrix = X_train.corr()
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > CORR_THRESHOLD:
colname = correlation_matrix.columns[i]
correlated_features.add(colname)
The commands above will calculate the pairwise correlation between all features in the dataset, creating a correlation_matrix
. If the correlation between two features is above CORR_THRESHOLD
(currently 0.80), the loop will store the name of one of them into the correlated_features
set, ensuring that no names are repeated. If you want to know how may pairs of correlated features were present in the dataset, run the command below to print to the console:
print(f"Number of correlated feature pairs: {len(correlated_features)}")
Then I can convert the set as list and pass it to pandas
‘ drop()
method:
X_train = X_train.drop(list(correlated_features),axis=1)
X_test = X_test.drop(list(correlated_features),axis=1)
I dropped the columns and saved the data frames with the same name for convenience. Note the axis=1
argument, it tells drop()
to look the columns for the list elements and then remove them.
Serializing datasets for future use
Finally, the train and test datasets are ready for use! To save them into disk for later use, I will use pandas
‘ to_pickle
method. A “pickled” Python object is serialized: it was converted into a series of bytes (a byte stream). This byte stream therefore can be “unpickled” to restore the object exactly as it was when was “pickled”. This is useful to backup data, share with others, and so on. Using pathlib.Path
notation, I saved the objects into the “data/processed/” folder:
X_train.to_pickle(project_folder/"data"/"processed"/"X_train")
X_test.to_pickle(project_folder/"data"/"processed"/"X_test")
y_train.to_pickle(project_folder/"data"/"processed"/"y_train")
y_test.to_pickle(project_folder/"data"/"processed"/"y_test")
Note: pickled objects are Python-specific only — non-Python programs may not be able to reconstruct pickled Python objects. WARNING: never, NEVER, unpickle data you do not trust. As it says in the Python documentation: “It is possible to construct malicious pickle data which will execute arbitrary code during unpickling”.
Now, my folders are like this:
.
├── data
│ ├── interim
│ │ └── prostate_cancer_dataset.csv
│ └── processed
│ ├── X_train
│ ├── X_test
│ ├── y_train
│ └── y_test
├── models
└── src
├── features
│ └── make_features.py
└── models
└── make_model.py
Conclusion
In this post, I showed how to:
- Import a CSV dataset into
pandas
; - Dummy-encode categorical data;
- Split the dataset into train/test datasets;
- Normalize data (Z-scores);
- Serialize (pickle) the datasets for future use.
Go to the Part 2, where I show how to use the datasets to generate a classification model for predicting risk of prostate cancer disease progression with the make_model.py
script.
Subscribe to my RSS feed, Atom feed or Telegram channel to keep you updated whenever I post new content.
References
The Cancer Genome Atlas Program
scikit-learn: machine learning in Python - scikit-learn 0.23.2 documentation
Feature Selection in Python — Recursive Feature Elimination
pickle — Python object serialization | Python 3.9.0 documentation