Slurm and Hardware Accelerated OpenGL

Posted <2020-04-03 Fri 23:48> by Aaron S. Jackson.

Until recently our lab's GPU cluster had only really been used for deep learning. There hasn't been the need to mess about with OpenGL, or anything non-CUDA, even for Blender, since rendering could be done using Cycles which doesn't need OpenGL. The latest version of Blender, while apparently offering a huge amount of extra functionality, needs an up to date version of OpenGL, version 3.3, far too new for the Mesa OpenGL.

I would have thought such a setup would be fairly well documented online. Either I'm using the wrong search terms or they keep all of this secret? Not sure! From my research online, most HPC setups seem to allocate an entire node to each job. This is not practical for us due to having up to eight GPUs in a server, and only a 10 or so servers. I suspect in such a setup, an Xserver would be running on all GPUs continuously. Anyway, getting it working wasn't too bad but it does require a few things to be just right.

Despite the recent security bug with X, the setuid bit needs to be set.
The xserver pam configuration needs to be modified.
A rather messy X configuration needs to be generated.
A utility script will simplify job submission, but needs to be written!

I'll go into a little more detail on each, and dump any config files or scripts where they might be useful, in the following sections.

Implementation Details

setuid

Back in 2018 there was a bug with the way the X server handled logging, which allowed a regular user to overwrite the shadow password file, allowing the removal of the root password entirely. This was easily avoided by removing the setuid bit, which starts the process as the program owner - in this case root. This has since been patched, so hopefully re-setting the setuid bit is not an issue.

chmod a+s /usr/bin/Xorg

Note I'm not 100% sure this is still required, having made the modifications to the xconfig PAM configuration. I will test one day and update this post. Anyway, since Xserver has been patched, this is not currently a big deal as far as I know.

PAM

On most Linux distributions (possibly all?), starting the X server requires the user to have logged in from the console. This is, of course, the standard approach to using a desktop PC. For security reasons, this is enforced through the Pluggable Authentication Modules setup. Modifying the /etc/pam.d/xserver file to the following should do the trick.

#%PAM-1.0
auth       required     pam_permit.so
account    required     pam_permit.so
session    optional     pam_keyinit.so force revoke

The default PAM config has the auth sections set to pam_rootok being sufficient and pam_console being required. Setting it to pam_permit will just allow it regardless.

Xorg config

For this setup to work, a separate X Layout has to be generated for each GPU. This is because GPUs are assigned on a per-job basis, and so there needs to be a separate X session for every GL job submitted.

The script below iterates through all NVIDIA GPUs and generates an Xserver Layout, tied to each PCI Bus ID, in the format that Xserver expects. It also generates a virtual screen and display, which is required if you don't want have a monitor hooked up to each GPU! The mouse and keyboard inputs can be shared by all layouts.

#!/bin/bash

ID=0

lspci | grep NVIDIA | grep VGA | \
    awk '{ print $1 }' | \
    while read pci_id ; do

    device=PCI:$((0x${pci_id%:*})):0:0
    name=$(lspci -s $pci_id | sed 's/.*\[//' | sed 's/\].*//')

    cat <<EOF
Section "ServerLayout"
  Identifier "Layout$ID"
  Screen 0 "Screen$ID"
  InputDevice    "Keyboard0" "CoreKeyboard"
  InputDevice    "Mouse0" "CorePointer"
  Option "IsolateDevice" "$device"
EndSection
Section "Monitor"
  Identifier     "Monitor$ID"
  VendorName     "Unknown"
  ModelName      "Unknown"
  HorizSync       28.0 - 33.0
  VertRefresh     43.0 - 72.0
  Option         "DPMS"
EndSection
Section "Device"
    Identifier     "Device$ID"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "$name"
    BusID          "$device"
    Screen          0
EndSection
Section "Screen"
    Identifier     "Screen$ID"
    Device         "Device$ID"
    Monitor        "Monitor$ID"
    DefaultDepth    24
    Option         "UseDisplayDevice" "none"
    SubSection "Display"
        Depth           24
        Modes          "1920x1080"
    EndSubSection
EndSection

EOF

    ID=$((ID + 1))
done


cat <<EOF
Section "InputDevice"
  Identifier     "Mouse0"
  Driver         "mouse"
  Option         "Protocol" "auto"
  Option         "Device" "/dev/input/mice"
  Option         "Emulate3Buttons" "no"
  Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
  Identifier     "Keyboard0"
  Driver         "kbd"
EndSection
EOF

This script can then be called with the output being redirected to the usual /etc/X11/Xorg.conf file.

Job Submission Utility Script

This script is pretty disgusting. Let me just dump it below - hopefully the comments will do a good enough job at explaining what is going on. The general idea is that we want to figure out the physical GPU which has been assigned to the Slurm job, since under normal circumstances, we just have a device ordinal which is offset by cgroups. Hence, requesting a single GPU will always leave CUDA_VISIBLE_DEVICES set to 0. We can figure out which GPU has been assigned by inspecting the output of nvidia-smi, since it will only show the allocated GPU, for its PCI ID. This PCI ID can then be used to look up the X server layout from the X configuration, allowing us to start the display on the correct GPU.

Attempting to start an Xserver on a GPU not assigned to the Slurm job will fail, again, thanks to cgroups limiting our access. Finally, we wrap our x11vnc server in a subshell and put it into the background so that it can restart once a user disconnects. I may add a password option to all of this, instead of specifying the nopw option, but to be honest for our setup, this is not particularly crucial. Of course, firewall rules are added to allow the our login node access to all ports on all compute nodes, making port forwarding from SSH easy for the user.

# Make sure we are running with a GPU allocated
if [ -z "$CUDA_VISIBLE_DEVICES" ] ; then
    echo "No GPU requested. Exiting."
    exit 1
fi

# Utility function for getting user's email address.
function get_email () {
    ldapsearch -h ldap.cs.nott.ac.uk -x \
        -b "dc=cs,dc=nott,dc=ac,dc=uk" "uid=$1" | grep "mail:" | \
        head -n1 | awk -F':' '{ print $2 }'
}


# First we need to figure out the physical ID of the GPU we have been
# assigned by Slurm.
PCI_ID=$(nvidia-smi --query-gpu=gpu_bus_id --format=csv \
         | tail -n1 | cut -b10-)
PCI_ID=PCI:$((0x${PCI_ID%:*})):0:0

# We need to map this physical ID to a X11 Layout which has been
# pre-mapped in the Xorg config file.
PRIMARY_GPU=$(cat /etc/X11/xorg.conf | \
          grep -B4 ".*IsolateDevice.*${PCI_ID}" | \
          grep -o Layout[0-9] | grep -o [0-9])

echo "Using GPU $PRIMARY_GPU on $(hostname)"
export DISPLAY=:${PRIMARY_GPU}

# Console redirection is required in the case of X as it will not
# start unless there is a pseudoteletype allocated to it. Easiest
# thing to do is to pipe null into it.
echo "Starting X Server..."
X -layout Layout${PRIMARY_GPU} $DISPLAY </dev/null 2>/dev/null &
sleep 3

RESOLUTION=${RESOLUTION:-1900x950}
echo "Setting a screen resolution of $RESOLUTION"
xrandr --fb $RESOLUTION
sleep 2

# There is a chance that the assigned PORT might be unavailable, but
# this seems fairly unlikely given that there are a maximum of 8 X
# sessions on any of our GPU servers.
PORT=$(( 5900 + RANDOM / 500 ))

echo "Starting VNC server on port $PORT..."
(
    # This is nested as a subshell to allow it to reopen if the client
    # disconnects.
    while true ; do
    x11vnc -q -nopw -rfbport $PORT 2>/dev/null
    done
) &

printf "%70s\n" | tr ' ' '-'
echo "To access VNC, set up the following port forwarding:"
echo "   ssh -L5900:$HOSTNAME:$PORT $USER@${SLURM_SUBMIT_HOST}"
echo "and VNC to localhost:5900"
printf "%70s\n" | tr ' ' '-'

mail -s "VNC info for $SLURM_JOB_ID" \
    `get_email $USER` <<EOF
To access VNC, set up the following port forwarding:

   ssh -L5900:$HOSTNAME:$PORT $USER@${SLURM_SUBMIT_HOST}

and VNC to localhost:5900
EOF

echo "Job starting now."

# libGL is not symlinked to the nvidia library, so we'll just stick it
# in the LD_PRELOAD variable to ensure it is loaded on each program
# start.
export LD_PRELOAD=/lib64/libGLX_nvidia.so.0

# A window manager is not strictly necessary but it does make the
# whole thing quite a bit easier to use without the overhead of a full
# desktop environment.
xfwm4 2>/dev/null &

Putting the Pieces Together

Handling the configuration for all GPU servers can be managed elegantly using a short Ansible role:

---
- name: setup Xorg stuff
  hosts: nodes_gpu
  tasks:
    - name: enable setuid Xorg
      shell: chmod a+s /usr/bin/Xorg
    - name: generate Xorg config
      shell: /usr2/sbin/xconfig > /etc/X11/xorg.conf
    - name: Modify pam xserver config
      copy:
        content: |
          #%PAM-1.0
          auth       required     pam_permit.so
          account    required     pam_permit.so
          session    optional     pam_keyinit.so force revoke
        dest: /etc/pam.d/xserver

Just for the sake of clarity, /usr2 is our NFS mounted file system for statically compiled binaries, source code and scripts. Our OpenGL utility script is also stored in this file system, ans so, an example Slurm job submission script would look similar to this:

#!/bin/bash

#SBATCH --gres gpu
source /usr2/share/gl.sbatch # The messy script above

blender

Fortunately all this complexity (or mess, if you'd rather call it that) is easily abstracted away from the user, who can simply check their email and find the command to execute to setup a VNC port forwarding. In terms of managing access to GPUs, it's exactly as it would be for CUDA stuff. It also allows multiple users on a single GPU node, just as it would for running CUDA jobs. Finally, it does not require an X server to be running for each GPU continuously - they only take up a small amount of video RAM, but that's memory which might be needed by a user's CUDA job.

Quite a fun thing to hack together.

Wanting to leave a comment?

Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).

Tags: computing hacks slurm