DefinIT Because if IT were easy, everyone would do it…

15Mar/13Off

Host disk write latency errors – troubleshooting – part 2

Posted by Simon Eady

vmware logo
As some of you read previously, I had been experiencing disk latency issues on our SAN and tried many initial methods to troubleshoot and understand the root cause. Due to other more pressing issues this was placed aside until we started to experience VMs being occasionaly restarted by vSphere HA as the lock had been lost on a given VMDK file. (NOT GOOD!!)

The Environment:-
3x vSphere 5.1 Hosts
2x 4port Nics 1GBe (allowing 2x iSCSi vmkernel ports per host for redundancy)
Dedicated Switching (isolated from the LAN) for iSCSi and vMotion (on seperate respective VLANs)
MSA2312i SAN G2 (with 4 Shelves)
The iSCSi Multipathing policy was set to Round Robin.
SIOC is enabled.

After a great deal of digging I resolved to contacting VMware support whom pointed me in turn to the SAN as the Host log files had the following..

<span style="color: #ff0000;">2013-02-20T14:35:38.026Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0x6796 from world 0 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124401f31c0, 91852) to dev "naa.600508e00000000078c4a59a76937603" on path "vmhba1:C1:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x85, CmdSN 0xab from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x4d, CmdSN 0xac from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0xad from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.</span>

So duely armed I contacted HP support whom immediately escalated the issue internally. During this time I had a very helpful conversation with a good friend @VirtualisedReal whom pointed me in the direction of the HP MSA best practice document. I applied the subnetting configuration it suggested, which seperates the iSCSi ports A1 & B1 from A2 & B2 on seperate subnets and also configured each of the hosts 2 iSCSi vmkernel ports to point to the seperate paired iSCSi ports on the SAN.

When HP did eventually come back to me they suggested the SAN was perfectly fine, However! enough time had passed since the iSCSi port configuration change that I could already see a noticable drop in latency.

I waited another week (and since then) and I am very glad to say the latency is considerably lower with no reoccurance of the locks being lost on VM vmdk files.

6Mar/13Off

PowerCLI Script to set RDM LUNs to Perennially Reserved – Fixes Slow Boot of ESXi 5.1 with MSCS RDMs

Posted by Sam McGeown

vmware logoI've previously posted around this topic as part of another problem but having had to figure out the process again I think it's worth re-posting a proper script for this. VMware KB 1016106 is snappily titled "ESXi/ESX hosts with visibility to RDM LUNs being used by MSCS nodes with RDMs may take a long time to boot or during LUN rescan" and describes the situation where booting ESXi (5.1 in my case) takes a huge amount of time to boot because it's attempting to gain a SCSI reservation on an RDM disk used by MS Clustering Services. It also details the fix.

The process is fairly simple, but a bit labour intensive if you're doing it manually on a large cluster.

  1. Retrieve the ScsiCanonicalName for each RDM
  2. Set the configuration for each RDM on each Host to "PerenniallyReserved"