Host disk write latency errors – troubleshooting – part 2

Written by Simon Eady
Published on 15/3/2013 - Read in about 2 min (412 words)

As some of you read previously, I had been experiencing disk latency issues on our SAN and tried many initial methods to troubleshoot and understand the root cause. Due to other more pressing issues this was placed aside until we started to experience VMs being occasionaly restarted by vSphere HA as the lock had been lost on a given VMDK file. (NOT GOOD!!)

The Environment:-

3x vSphere 5.1 Hosts

2x 4port Nics 1GBe (allowing 2x iSCSi vmkernel ports per host for redundancy)

Dedicated Switching (isolated from the LAN) for iSCSi and vMotion (on seperate respective VLANs)

_MSA2312i SAN G2 (with 4 Shelves)

The iSCSi Multipathing policy was set to Round Robin.

SIOC is enabled.

_

After a great deal of digging I resolved to contacting VMware support whom pointed me in turn to the SAN as the Host log files had the following..

<span style="color: #ff0000;">2013-02-20T14:35:38.026Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0x6796 from world 0 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124401f31c0, 91852) to dev "naa.600508e00000000078c4a59a76937603" on path "vmhba1:C1:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x85, CmdSN 0xab from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x4d, CmdSN 0xac from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0xad from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.</span>

So duely armed I contacted HP support whom immediately escalated the issue internally. During this time I had a very helpful conversation with a good friend @VirtualisedReal whom pointed me in the direction of the HP MSA best practice document. I applied the subnetting configuration it suggested, which seperates the iSCSi ports A1 & B1 from A2 & B2 on seperate subnets and also configured each of the hosts 2 iSCSi vmkernel ports to point to the seperate paired iSCSi ports on the SAN.

When HP did eventually come back to me they suggested the SAN was perfectly fine, However! enough time had passed since the iSCSi port configuration change that I could already see a noticable drop in latency.

I waited another week (and since then) and I am very glad to say the latency is considerably lower with no reoccurance of the locks being lost on VM vmdk files.

Share this post