In clinical applications, feature selection is a crucial step for the selection of low dimensional biomarkers from high dimensional data generated by Omics technologies. Here we introduce GARBO, a novel multi-island adaptive genetic algorithm to simultaneously optimize accuracy and set size in omics-driven biomarker discovery problems.

The GARBO algorithm is currentely implemented in python.


Install phython 2.7 and install the following python modules within your virtual environment sklearn, multiprocessing, matplotlib, operator, bisect, deap, numpy, random, skfuzzy, cPickle, pandas, scipy, sys, getopt source activate 'your_virtual_env'


To run GARBO-islands in parallel on a computer desktop

export NGEN=100 # Indicate the number of GA-iterationsexport 
NPOP=50 # Indicate the number of the chromosomes to be generated for each nicheexport 
MINL=30 # Indicate the initial minimum length of the generated chromosomesexport 
MAXL=50 # Indicate the initial maixmum length of the generated chromosomesexport 
NN=2 # Specify the number of nichesexport 
RN=1 # Indicate if an initial rank of the features must be compiled (RN=1) otherwise it starts with no ranking information (RN=0).
export INPUT_FILE="data_ccle_erl_ge.csv" # Omics dataset (samples as columns and rows as features< The last feature must named 'class' and it correpsonds to the target label)
export OUTPUT_DIR="MRNA_run_1"    # Folder that will contain a file for each nicheserialized python-obejcts. Each file contains the
nohup python -g $NGEN -p $NPOP -s $MINL -l $MAXL -n $NN -r $RN -i $INPUT_FILE -o $OUTPUT_DIR > output_mrna.log &

To run multiple times GARBO for cross-validations (please note that the sinlge islands will run in parallel).

First, we create a bash script to launch one GARBO-task (e.g. '').

#SBATCH --mail-type=END
#SBATCH --mail-user=##################
source activate garbo
python -g $NGEN -p $NPOP -s $MINL -l $MAXL -n $NN -r $RN -i $INPUT_FILE -o $OUTPUT_DIR
conda deactivate

Then, we create a second bash script indicting the setting for a run in parallel. For instance, with the following setting, we run GARBO 10 times (e.g. 10-fold cross validation) and each run utilizes 10 islands for a total of 100 jobs.

#SBATCH --partition parallel
#SBATCH --time 2-00:00:00       
#SBATCH --mem-per-cpu=5000
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --array=1-10
export NGEN=500
export NPOP=500
export MINL=30
export MAXL=50
export NN=10
export RN=1
export INPUT_FILE="train_ge_$SLURM_ARRAY_TASK_ID.csv"
srun -o ccle_erl_ge$SLURM_ARRAY_TASK_ID.out -e ccle_ge_$SLURM_ARRAY_TASK_ID.err

To read the output of GARBO, the user needs to load a Python object structure from a pickle file created by GARBO (.pkl). This Python object structure consists of four data structures:

  • the last population of chormosomes (biomarker sets);
  • a list of populations of chromosomes that are saved during the GA-iterations.
  • a logbook indictating evalaution metrics of the GA-iteration;
  • the weights to build the final feature ranking.
def readDataResult(pathname, nn = 10):
    ga_out_all = []
    list_all_chr = []
    for i in range(nn):
        f = open(pathname + str(i) + '.pkl', 'rb')
        ga_out = []
        for i in range(4):
        list_all_chr = list_all_chr + ga_out[1]
    return ga_out_all

Contact Information

Vittorio Fortino Dario Greco