CVL Cluster Maintenance

Posted <2020-10-14 Wed 21:41> by Aaron S. Jackson.

#+DRAFT

On Monday I will be updating the software on the lab's GPU cluster. Aside from CentOS updates, this will also include updating Slurm. We're currently running 18.08.6-2, and will be upgrading to 20.02.5-1. There are a few nice improvements - most notably is support for partial allocation of Nvidia cards. Currently, we are only able to allocate an entire card per job. Of course, we can still run multiple processes per card, but it requires extra effort and some foresight about what you want to run, before submitting your submission script. Another nice addition is allowing allocation of GRES per task, as opposed to per-node. This is handy for those embarrassingly parallel CUDA jobs, such as inference of large datasets. The final feature I am excited about is the built in REST API, which allows job submission. I regularly deploy my models via APIs, and as of yet, I've never bothered to write a nice interface between front-ends and Slurm. The built in Slurm API should make things much easier.

This upgrade is perhaps one of my most well thought out and planned upgrades. I usually make a list of things I want to do and just work my way through the list. This time I have set up an ESXi host and created a dev cluster using the exact same Ansible configuration, but a different inventory file. With this, I've been able to fully test the upgrade from our current version of Slurm, to the new version. Hopefully this has allowed me to reduce the number of issues I run into during the maintenance period.

I initially intended to manage the ESXi host through Ansible, but this seems to require vCenter, which we do not have. Instead, I configured eight barebone CentOS 7 virtual machines and created a snapshot of each called initial. After enabling SSH on the ESXi host I am able to fully deploy the cluster, and test it, by running a simple script:

#!/bin/bash

if [ ! -z $RESET_VM ] ; then
    ssh root@asjdev.cs.nott.ac.uk <<EOF
vim-cmd vmsvc/getallvms | grep asjdev | \
    while read id name _ ; do
    snap_id=\$(vim-cmd vmsvc/snapshot.get \$id | grep -A1 initial | tail -n1 | \
                        awk -F':' '{ print \$2 }')
    vim-cmd vmsvc/snapshot.revert \$id \$snap_id 0
    vim-cmd vmsvc/power.on \$id
    done
EOF
fi

for i in asjdev01 asjdev02 asjdev03 asjdev04 \
          asjdev05 asjdev06 asjdev07 asjdev08 ; do
    while !  ping -c1 $i  ; do sleep 5 ; done
done

ansible-playbook --forks 8 \
         -i dev \
         --ask-vault-pass \
         playbook.yml

sum=$(
(
    ssh -i ~/.ssh/rex.id_rsa root@asjdev01.cs.nott.ac.uk <<EOF
cd /home/
srun -A cvldev -n7 --mem=128m -q cpu -p cpu hostname
EOF
) | sort -n | md5sum | awk '{ print $1 }')

[[ $sum == 'd15b2f7f90be7af1a1127a332226cabe' ]] && \
    echo "Slurm OK!" && exit 0
echo 'Something is wrong with that Slurm config...'
exit 1

The script is not particularly tidy, but it essentially logs into the ESXi host, grabs a list of VMs matching the name asjdev, and then grabs the snapshot ID of latest. It then reverts to that snapshot and powers on the VM. Very nice! The script then waits until all VMs are back up, before deploying the cluster through a sequence of playbooks. Finally, it logs into the login node of the Slurm setup and just verifies the output of hostname.

The packages Ansible role is a bit bandwidth intensive, so I'm thinking about mirroring those packages somewhere local. I will feel less guilty if I decide to integrate this with Gitlab CI.

Wanting to leave a comment?

Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).

Tags: work linux