New ZFS Storage for the Computer Vision Lab
By my suggestion, we recently purchased a new file server in the Computer Vision Lab. This has become necessary since we are moving our GPU machines over to Slurm - no more direct login! Without centralising dataset storage, coordination of which machine a job should run on becomes just as challenging as logging into a machine directly. Our decision to move over to a scheduler was primarily due to greedy GPU usage, but it has many other advantages (in my opinion) too.
ZFS
The file server itself is a CyberStore 224S machine with an external CyberStore 212S JBOD connected via a direct attached SAS3 link. The server runs FreeBSD 11 and houses six 4TB SSDs, intended for caching (more on this later), while the JBOD contains twelve 12TB 7200RPM SAS disks. This raw 144TB of storage is divided into two pools.
NAME STATE READ WRITE CKSUM
db ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da13 ONLINE 0 0 0
da14 ONLINE 0 0 0
da15 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da16 ONLINE 0 0 0
da17 ONLINE 0 0 0
da18 ONLINE 0 0 0
cache
da1 ONLINE 0 0 0
home ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
da7 ONLINE 0 0 0
da8 ONLINE 0 0 0
da9 ONLINE 0 0 0
da10 ONLINE 0 0 0
da11 ONLINE 0 0 0
da12 ONLINE 0 0 0
At this point, you may be wondering why db
is configured
differently to home
? My justification for this is as
follows: RAIDZ2 can be a bit of a performance killer, in terms of IOPS.
The db
pool requires very fast reads and the data, in
general, is replaceable, and as such is configured as a striped volume
of two RAIDZ1 vdevs. While backed up, the data in home
would be much more annoying to recover and the I/O demand is
significantly lower, as such, a single RAIDZ2 should provide a fair
balance between IOPS and data integrity.
Currently we only have one of the six SSDs hooked up as cache to db. I will add more and keep an eye of the hit/miss rates of both L1 and L2 ARC. I am tempted to make one of the SSDs a log device for home, but we'll see how we get on. A common argument against spending money on L2ARC cache is that the money could be better spent on RAM. Well, in many cases yes, absolutely, this is true. However, there is a very specific type of workload this pool will be used for - random I/O of large(ish) datasets for the purposes of training convolutional neural networks. We'd be lucky to get one person's dataset into RAM, and it'd just get flushed out by another user's job. While the adoption so far is still quite low, we currently have an L2ARC hit rate of about 65%. This will drop slightly of course as time goes on, but at that point I will likely add a second SSD and keep an eye on performance.
Networking
Our GPU machines run CentOS. This is probably not the most popular
choice among most computer vision research groups (or perhaps our own!),
but in general it is rock solid. However, there is a cost to this
"stability". Old drivers! The ixgbe
kernel module provided
for the Intel 82599 card in 3.10 is incredibly out of date, and as such
it is necessary to update this to another version.
Additionally, we have had many issues with the one particular type of
motherboard in conjunction with this card - The ASRock EP2C612 WS. Urgh.
Things seem relatively stable now, but it required disabling ASPM with
the kernel option pcie_aspm=off
.
On the FreeBSD side of things, an update was also required.
Previously it had two kernel panics, since updating there hasn't been
another, so I hope this continues. Updating this driver resulted in two
more noticeable things. The first, which is great, is it resulted in all
four ix0, ix1, ix2 and ix3 devices showing up. For some reason, after
the first reboot, only two or three of these showed up. Immediately
after upgrading the driver and rebooting, the devices reappeared. The
second is that LACP load balancing for outgoing traffic went out the
window. It seems the updated driver likes to use hardware accelerated
hashing of the packets, but only pays attention to Layer 2 information.
This is no good for NFS and perhaps many other types of traffic. To
correct this, I had to recreate the lagg with the option
use_flowid
, which forces the hashing to the CPU with an
insignificant performance penalty.
When updating the driver for FreeBSD, make sure you check the readme.
For this particular version it is necessary to change
hw.intr_storm_threshold
to 0.
So, in general, I can't say the NIC we chose was the best choice. It certainly has some very irritating issues, most, if not all, I think I have managed to mitigate.
The more stable side of the networking seems to come from the Ubiquiti EdgeSwitch 16XG, which has been very stable. This is a very impressive switch for the price. We had been advised against this switch due to the small buffer size (2MB), which could potentially overflow and result in dropped packets. We haven't had any problems with this. I assume this kind of issue might only occur when trying to squeeze traffic from multiple hosts, into a single adapter of one host. In our use case, the opposite is true. Mostly data comes from a lagg and into many hosts. No problems so far!
Performance Optimisations
On the NFS clients, I've found that setting the NFS block size for clients to the maximum supported under CentOS to be the fastest option. This is 131072 byte for both reads and writes.
Most of the optimisations have been made on the server side:
kern.maxvnodes=17027808
- it was very easy for
vfs.numvnodes
to exceed this limit, even under fairly low
load. This resulted in intense staggering of read operations across all
clients.
vfs.zfs.prefetch_disable=1
- I am still unsure about
whether this one is a sensible idea. The success rates of ZFS prefetches
appeared very low in our application. I figure this might save some
IOPS.
vfs.zfs.l2arc_write_max=134217728
and
vfs.zfs.l2arc_write_boost=134217728
- The default is
potentially too slow and prevents the L2ARC from warming up quickly
enough. I might fine this later.
zfs set atime off db
- We do not need to know when
random dataset files were last modified, this was just a waste of
I/O.
Performance Testing
I submitted the following to four nodes via Slurm:
find /db/pszaj -type f -print0 | xargs -P64 -0 -I{} dd if={} of=/dev/null
The contents of my db
directory is very mixed. There are
several hundred thousand large 7-10MB files, and a similar number of
small 100KB files.
Outgoing from the lagg on the file server was a conservative peak of 17Gbps, with an average of perhaps 13Gbps. So, let's say I am fairly happy with the performance. Of course, I ran this a few times first, which allowed L2ARC to warm up, but this is the expected behaviour when training a CNN so I feel comfortable with it. Of course, steady synchronous reads will be faster.
Perhaps unsurprisingly, I am not too concerned about the write performance in this case.
Related posts:
Wanting to leave a comment?
Comments and feedback are welcome by email (aaron@nospam-aaronsplace.co.uk).