New ZFS Storage for the Computer Vision Lab

Posted <2018-11-25 Sun 22:45> by Aaron S. Jackson.

By my suggestion, we recently purchased a new file server in the Computer Vision Lab. This has become necessary since we are moving our GPU machines over to Slurm - no more direct login! Without centralising dataset storage, coordination of which machine a job should run on becomes just as challenging as logging into a machine directly. Our decision to move over to a scheduler was primarily due to greedy GPU usage, but it has many other advantages (in my opinion) too.

ZFS

The file server itself is a CyberStore 224S machine with an external CyberStore 212S JBOD connected via a direct attached SAS3 link. The server runs FreeBSD 11 and houses six 4TB SSDs, intended for caching (more on this later), while the JBOD contains twelve 12TB 7200RPM SAS disks. This raw 144TB of storage is divided into two pools.

NAME        STATE     READ WRITE CKSUM
db          ONLINE       0     0     0
  raidz1-0  ONLINE       0     0     0
    da13    ONLINE       0     0     0
    da14    ONLINE       0     0     0
    da15    ONLINE       0     0     0
  raidz1-1  ONLINE       0     0     0
    da16    ONLINE       0     0     0
    da17    ONLINE       0     0     0
    da18    ONLINE       0     0     0
cache
  da1       ONLINE       0     0     0

home        ONLINE       0     0     0
  raidz2-0  ONLINE       0     0     0
    da7     ONLINE       0     0     0
    da8     ONLINE       0     0     0
    da9     ONLINE       0     0     0
    da10    ONLINE       0     0     0
    da11    ONLINE       0     0     0
    da12    ONLINE       0     0     0

At this point, you may be wondering why db is configured differently to home? My justification for this is as follows: RAIDZ2 can be a bit of a performance killer, in terms of IOPS. The db pool requires very fast reads and the data, in general, is replaceable, and as such is configured as a striped volume of two RAIDZ1 vdevs. While backed up, the data in home would be much more annoying to recover and the I/O demand is significantly lower, as such, a single RAIDZ2 should provide a fair balance between IOPS and data integrity.

Currently we only have one of the six SSDs hooked up as cache to db. I will add more and keep an eye of the hit/miss rates of both L1 and L2 ARC. I am tempted to make one of the SSDs a log device for home, but we'll see how we get on. A common argument against spending money on L2ARC cache is that the money could be better spent on RAM. Well, in many cases yes, absolutely, this is true. However, there is a very specific type of workload this pool will be used for - random I/O of large(ish) datasets for the purposes of training convolutional neural networks. We'd be lucky to get one person's dataset into RAM, and it'd just get flushed out by another user's job. While the adoption so far is still quite low, we currently have an L2ARC hit rate of about 65%. This will drop slightly of course as time goes on, but at that point I will likely add a second SSD and keep an eye on performance.

Networking

Our GPU machines run CentOS. This is probably not the most popular choice among most computer vision research groups (or perhaps our own!), but in general it is rock solid. However, there is a cost to this "stability". Old drivers! The ixgbe kernel module provided for the Intel 82599 card in 3.10 is incredibly out of date, and as such it is necessary to update this to another version.

Additionally, we have had many issues with the one particular type of motherboard in conjunction with this card - The ASRock EP2C612 WS. Urgh. Things seem relatively stable now, but it required disabling ASPM with the kernel option pcie_aspm=off.

On the FreeBSD side of things, an update was also required. Previously it had two kernel panics, since updating there hasn't been another, so I hope this continues. Updating this driver resulted in two more noticeable things. The first, which is great, is it resulted in all four ix0, ix1, ix2 and ix3 devices showing up. For some reason, after the first reboot, only two or three of these showed up. Immediately after upgrading the driver and rebooting, the devices reappeared. The second is that LACP load balancing for outgoing traffic went out the window. It seems the updated driver likes to use hardware accelerated hashing of the packets, but only pays attention to Layer 2 information. This is no good for NFS and perhaps many other types of traffic. To correct this, I had to recreate the lagg with the option use_flowid, which forces the hashing to the CPU with an insignificant performance penalty.

When updating the driver for FreeBSD, make sure you check the readme. For this particular version it is necessary to change hw.intr_storm_threshold to 0.

So, in general, I can't say the NIC we chose was the best choice. It certainly has some very irritating issues, most, if not all, I think I have managed to mitigate.

The more stable side of the networking seems to come from the Ubiquiti EdgeSwitch 16XG, which has been very stable. This is a very impressive switch for the price. We had been advised against this switch due to the small buffer size (2MB), which could potentially overflow and result in dropped packets. We haven't had any problems with this. I assume this kind of issue might only occur when trying to squeeze traffic from multiple hosts, into a single adapter of one host. In our use case, the opposite is true. Mostly data comes from a lagg and into many hosts. No problems so far!

Performance Optimisations

On the NFS clients, I've found that setting the NFS block size for clients to the maximum supported under CentOS to be the fastest option. This is 131072 byte for both reads and writes.

Most of the optimisations have been made on the server side:

kern.maxvnodes=17027808 - it was very easy for vfs.numvnodes to exceed this limit, even under fairly low load. This resulted in intense staggering of read operations across all clients.

vfs.zfs.prefetch_disable=1 - I am still unsure about whether this one is a sensible idea. The success rates of ZFS prefetches appeared very low in our application. I figure this might save some IOPS.

vfs.zfs.l2arc_write_max=134217728 and vfs.zfs.l2arc_write_boost=134217728 - The default is potentially too slow and prevents the L2ARC from warming up quickly enough. I might fine this later.

zfs set atime off db - We do not need to know when random dataset files were last modified, this was just a waste of I/O.

Performance Testing

I submitted the following to four nodes via Slurm:

find /db/pszaj -type f -print0   | xargs -P64 -0 -I{} dd if={} of=/dev/null

The contents of my db directory is very mixed. There are several hundred thousand large 7-10MB files, and a similar number of small 100KB files.

Outgoing from the lagg on the file server was a conservative peak of 17Gbps, with an average of perhaps 13Gbps. So, let's say I am fairly happy with the performance. Of course, I ran this a few times first, which allowed L2ARC to warm up, but this is the expected behaviour when training a CNN so I feel comfortable with it. Of course, steady synchronous reads will be faster.

Perhaps unsurprisingly, I am not too concerned about the write performance in this case.

Wanting to leave a comment?

Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).

Tags: computers linux freebsd