Hyperparameter Optimisation with Slurm

Posted <2018-11-08 Thu 22:21> by Aaron S. Jackson.

Over the past few weeks, the Computer Vision Lab has been moving over to Slurm, as opposed to manually running jobs on a particular machine. One of the things that makes me excited about this is the ability to scheduler jobs, go home and come back with results. This is much nicer than logging in repeatedly to see if a job has finished and to start another one.

In particular, it allows me (us, if they want) to much more easily tune hyperparamters for deep learning models I'm working on, without the need for any framework specific frameworks. So far, I have only played around with grid search using some shell scripts, but a Bayesian approach shouldn't be too tricky (with the help of awk) either! Who needs Python…

Here is a toy example:

#!/usr/bin/bash

LR=(1e-4 1e-5)
DECAY=(0.5 0.3 0.2 0.1 0.05 0.01 0.005)

mkdir runs

for train_lr in ${LR[@]} ; do
    for train_lr_decay in ${DECAY[@]} ; do
    export train_lr train_lr_decay
    sbatch --gres=gpu hypertrain.sh
    done
done

This queues up a bunch of instances of `hypertrain.sh`, which is what actually invokes the training code.

#!/bin/bash

dest=runs/lr=$train_lr/lrdecay=$train_lr_decay

mkdir -p $dest

cp -r base/* $dest
pushd $dest

echo "Learning Rate: $train_lr"
echo "LR Decay: $train_lr_decay"

th main.lua

popd
test_model_file=runs/lr=$train_lr/lrdecay=$train_lr_decay/model_20.t7 \
           th test.lua | \
           awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' \
           > $dest/accuracy

As you can see, I am exporting some variables in the original script. Slurm makes a copy of the current environment variables when queuing a job. I could optionally pass arguments. Once the training has completed for a certain number of epochs, I run a test script to check the performance on some validation set, and dump the average (computed with awk) to a file called accuracy.

Once my experiment has completed, I can check the status of all selected hyperparamters with another script:

#!/usr/bin/bash

eval "$(cat train.sh | egrep ^[A-Z]+=)"

for train_lr in ${LR[@]} ; do
    for train_lr_decay in ${DECAY[@]} ; do
    dest=runs/lr=$train_lr/lrdecay=$train_lr_decay
    echo $train_lr $train_lr_decay \
         $(cat $dest/accuracy*) | \
        column -t
    done
done

My approach to a Bayesian approach will be to dump the scores to a single results file, along with the hyperparameters. Each time a training and test finishes, I can queue up a new experiment based on those results from the same script. A tree of experiments!

Wanting to leave a comment?

Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).

Tags: deeplearning computervision