My objective is to try and eliminate performance bottlenecks on our iscsi setup. Our current situation is several Xenserver hosts running virtual machines from our iScsi storage as well as Hyper-V virtual machines on separate LUN’s. The last months we see more delay on our iScsi network.
You can check the delay of iScsi traffic several ways but what i noticed on our virtual machines running Debian was the I/O delay. Run the following command on one of the virtual machines:
vmstat 10 5
The last column states the waiting time of the CPU. If this is higher then 15-20 you should be worried. We had this on several virtual machines causing huge delays in services and response times. Various suggestions have indicated switching on flow control on the switch ports of both XenServer and the iScsi storage.
We managed to fix this situation by the following actions:
On the XenServer hosts
If you dump traffic on your XenServer host and see that there are issues with checksumming not matching, causing checksum errors, and or if you notice a drop in network performance, you can disable TCP offloading. This offloading needs to be done on the host’s Network interfaces, as well as the Instances’ VIF and PIF. Here is an example of how you can find these checksum errors as well as how you can manually disable TCP offloading for the host’s Network Adapter.
Run the following command:
tcpdump -i eth0 -v -nn | grep incorrect
The output should be as shown here:
16:38:07.676943 IP (tos 0x0, ttl 64, id 60844, offset 0, flags [DF], proto TCP (6), length 60) xxx.xxx.xxx.xxx.46455 > yyy.yyy.yyy.yyy.80: S, cksum 0x4b8f (incorrect (-> 0x1f20), XXXXXXXXXX:YYYYYYYYYY(0) win 5840 <mss 1460,sackOK,timestamp 731623207 0,nop,wscale 4>
16:38:07.711402 IP (tos 0x0, ttl 64, id 28467, offset 0, flags [DF], proto TCP (6), length 52) xxx.xxx.xxx.xxx.41269 > yyy.yyy.yyy.yyy.80: ., cksum 0x6a18 (incorrect (-> 0xa1f1), 1:1(0) ack 773 win 552 <nop,nop,timestamp XXXXXXXX YYYYYYYY>
16:38:07.726013 IP (tos 0x0, ttl 64, id 60845, offset 0, flags [DF], proto TCP (6), length 40) xxx.xxx.xxx.xxx.46455 > yyy.yyy.yyy.yyy.80: ., cksum 0x4b7b (incorrect (-> 0x328c), 1:1(0) ack 1 win 5840
The output that you are really looking for is something like incorrect (-> 0x6e35) this is showing that there are checksums that are failing to be received correctly. These results indicate that there are TCP offloading issues. We need to handle this correctly.
Disable offloading on the interfaces:
ethtool -K eth0 rx off tx off sg off tso off ufo off gso off gro off lro off;
On the switches
After altering all offloading options on the XenServer it is now time to configure the switches correctly. On all ports connected to the iScsi storage device and/or the XenServer iScsi NIC’s you should enable ‘flow control’. After enabling this we saw the CPU waiting time on the VM’s go down and everything started to run smoothly.
– HP Procurve 28xx series are not the best when using iScsi, if you need more ‘switching power’ please replace the 2810 for the 2910-al models.
– If you use iScsi with 2 ports (as a bond) look in to trunking on the switch
– If you want a failover network please enable and configure spanning-tree on the HP switches.