Problems with PCIe switching
In the Computer Vision Lab, we recently purchased a new GPU machine ("Havok"), which has been sent back due to issues with the PCIe switching. This machine is built upon the SuperMicro TRT2 system. We have another machine ("Storm"), which is built upon the SuperMicro TRT system, the previous generation to the TRT2. The difference between these two machines is the PCIe switching. In the TRT2, the eight primary PCIe sockets are connected to one CPU via two 96 lane PCIe switches. In the TRT, there is a tree of PCIe switches, so only two cards are connected to each switch.
One would expect the TRT2 to be much faster. Fewer switches surely means lower latency, and if it is the newer model, why wouldn't it be faster? Unfortunately this does not appear to be the case, and is the reason why the machine has been returned. For example, copying a contiguous 64MB block of memory between two GPUs via a single switch 1000 times, takes 34 seconds (GPUs 0 and 1). Copying the same block of memory between two GPUs via two switches takes 7 seconds (GPUs 0 and 7). The same test on Storm, a TRT machine, takes 7 seconds via a single switch (GPUs 0 and 1), 8 seconds via two switches (GPUS 0 and 3), and 17 seconds between three switches and the QPI bus (GPUS 0 and 7), which makes a lot of sense.
Training a network on the same GPU selections should yield similar degradation. The script we used to test this is built on the Torch scientific computing framework:
require 'cutorch' cutorch.setDevice(1) local orig = torch.FloatTensor(256, 256, 256):cuda(); for i = 1,1000 do cutorch.setDevice(i % 2 + 1) local tmp = orig:clone() end
This is definitely something we should keep a watch for in the future. My first asumption was that the PCIe bus is simplex and doesn't have enough bandwidth to accomodate both the push and pull of data between GPUs on the same switch. Where as going between two GPUs on separate switches allows for both pushing and pulling of data to and from GPUs. However, Wikipedia tells me that the PCIe bus is a diferentially signalled, bidirectional bus, which suggests each PCI lane has four electrical links between each card and switch. So, while this is not a limitation of the bus, it might still be a limitation of the 96 lane PCIe switches used in the TRT2, but not the 48 lane PCIe switches used in the TRT.
We still don't know exactly what the problem is. We are hoping our supplier can figure it out based on the results we have demonstrated to them. Still, this is quite an interesting problem, perhaps even a total oversight on SuperMicro's side.