Blog IndexPosts by TagHome

Distributed Learning in PyTorch

Posted <2018-12-08 Sat 18:44> by Aaron S. Jackson.

This is very old - it's almost definitely all changed now!

This is just a quick brain dump about getting PyTorch to use multiple nodes. I am assuming that the following is available:

Let's start then.

Quick thoughts on PyTorch I'm still using lua Torch for my research. I think it is a beautifully designed framework. However, it seems to gradually be slipping into a something which is no longer maintained. I'm willing to try PyTorch now that it has hit it's version 1 release, but I'm also going to look into Deep Learning 4 Java with a Clojure wrapper.

PyTorch Installation I refuse to use (Ana|Mini)conda and as such installed PyTorch using pip in a Python 3.6 virtualenv. Our nodes have CUDA 8 pre-installed and are running CentOS 7.4.

Running something I got this running with the ImageNet PyTorch example here. You'll need to download either the full ImageNet or ImageNet-200 from the website to run it.

I made the following changes to, which allow it to use the environment variables provided by Slurm.

< parser.add_argument('-j', '--workers', default=8, type=int, metavar='N',
> parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
< parser.add_argument('--world-size', default=int(os.environ['SLURM_NPROCS']), type=int,
> parser.add_argument('--world-size', default=-1, type=int,
< parser.add_argument('--rank', default=int(os.environ['SLURM_PROCID']), type=int,
> parser.add_argument('--rank', default=-1, type=int,
<     import socket
<     print(socket.gethostname() + ' .... ' + str(args.rank))
<     if args.rank > 0:
<         import time
<         time.sleep(5)

I have a script called which detects some network settings and starts the Python script. This script assumes you have your fast network in a subnet. Make sure you change this.



# Tell Gloo to use the 10GbE.
GLOO_SOCKET_IFNAME=$(ifconfig | grep -B1 $SUBNET | \
                         head -n1 | \
                         awk -F':' '{ print $1 }')

# Figure out which node is going to be king.
if [ $SLURM_PROCID -eq 0 ] ; then
    IP=$(ifconfig | grep $SUBNET | awk '{ print $2 }')
    echo "Master IP addr $IP"
    while [ ! -f $MASTER_FILE ] ; do
        sleep 1

python --arch resnet50 \
       --batch-size 64 \
       --dist-url $MASTER \
       --dist-backend gloo \
       --multiprocessing-distributed \

Also note that the IP address of a master node is set in dist-url. The master's IP URL is distributed to other nodes by a file in my home directory, which is the same on all machines (via NFS). The following command allocates two nodes, each with four GPUs. The Python script itself will use all four GPUs, hence we start two instances of the script, spread across two nodes.

srun --nodelist=beast,rogue, -n2 -N2 --gres=gpu:4 -qunlimited

This should work but doesn't perform very well. I have found it performs much better if I run two instances per machine (and therefore each GPU is used twice), by specifying -n4. Our cluster is too busy to get away with testing this on a larger number of GPUs.

Wanting to leave a comment?

Comments and feedback are welcome by email (

Related posts:

Tags: computervision

Blog IndexPosts by TagHome

Copyright 2007-2023 Aaron S. Jackson (compiled: Sat 6 May 12:02:07 BST 2023)