Hyperparameter Optimisation with Slurm
Over the past few weeks, the Computer Vision Lab has been moving over to Slurm, as opposed to manually running jobs on a particular machine. One of the things that makes me excited about this is the ability to scheduler jobs, go home and come back with results. This is much nicer than logging in repeatedly to see if a job has finished and to start another one.
In particular, it allows me (us, if they want) to much more easily tune hyperparamters for deep learning models I'm working on, without the need for any framework specific frameworks. So far, I have only played around with grid search using some shell scripts, but a Bayesian approach shouldn't be too tricky (with the help of awk) either! Who needs Python…
Here is a toy example:
#!/usr/bin/bash
LR=(1e-4 1e-5)
DECAY=(0.5 0.3 0.2 0.1 0.05 0.01 0.005)
mkdir runs
for train_lr in ${LR[@]} ; do
for train_lr_decay in ${DECAY[@]} ; do
export train_lr train_lr_decay
sbatch --gres=gpu hypertrain.sh
done
done
This queues up a bunch of instances of `hypertrain.sh`, which is what actually invokes the training code.
#!/bin/bash
dest=runs/lr=$train_lr/lrdecay=$train_lr_decay
mkdir -p $dest
cp -r base/* $dest
pushd $dest
echo "Learning Rate: $train_lr"
echo "LR Decay: $train_lr_decay"
th main.lua
popd
test_model_file=runs/lr=$train_lr/lrdecay=$train_lr_decay/model_20.t7 \
th test.lua | \
awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' \
> $dest/accuracy
As you can see, I am exporting some variables in the original script. Slurm makes a copy of the current environment variables when queuing a job. I could optionally pass arguments. Once the training has completed for a certain number of epochs, I run a test script to check the performance on some validation set, and dump the average (computed with awk) to a file called accuracy.
Once my experiment has completed, I can check the status of all selected hyperparamters with another script:
#!/usr/bin/bash
eval "$(cat train.sh | egrep ^[A-Z]+=)"
for train_lr in ${LR[@]} ; do
for train_lr_decay in ${DECAY[@]} ; do
dest=runs/lr=$train_lr/lrdecay=$train_lr_decay
echo $train_lr $train_lr_decay \
$(cat $dest/accuracy*) | \
column -t
done
done
My approach to a Bayesian approach will be to dump the scores to a single results file, along with the hyperparameters. Each time a training and test finishes, I can queue up a new experiment based on those results from the same script. A tree of experiments!
Wanting to leave a comment?
Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).