So then! of late my attention has been drawn to Cloud Credibility which is a fantastic place to help validate your own and others cloud expertise by completing various tasks.
Among other things it encourages you to read up on white papers, carry out lab work (Hands-on-labs), watch training and informational materials and thus rewarding you with points for you and your team. What is also great is points really do mean prizes!
I have recently become apart of a Team (Team - DefinIT) with the following well known Virtualisation bloggers and vExperts.
Barry Coombs - Virtualised Reality
Michael Poore - vSpecialist
Sam McGeown - DefinIT
This presents another great aspect to Cloud Credibility as it encourages team work with tasks and social/technical interactions.
If you haven't signed up I would strongly recommend you do so!
As some of you read previously, I had been experiencing disk latency issues on our SAN and tried many initial methods to troubleshoot and understand the root cause. Due to other more pressing issues this was placed aside until we started to experience VMs being occasionaly restarted by vSphere HA as the lock had been lost on a given VMDK file. (NOT GOOD!!)
3x vSphere 5.1 Hosts
2x 4port Nics 1GBe (allowing 2x iSCSi vmkernel ports per host for redundancy)
Dedicated Switching (isolated from the LAN) for iSCSi and vMotion (on seperate respective VLANs)
MSA2312i SAN G2 (with 4 Shelves)
The iSCSi Multipathing policy was set to Round Robin.
SIOC is enabled.
After a great deal of digging I resolved to contacting VMware support whom pointed me in turn to the SAN as the Host log files had the following..
<span style="color: #ff0000;">2013-02-20T14:35:38.026Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0x6796 from world 0 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. 2013-02-20T14:35:38.030Z cpu8:51055)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124401f31c0, 91852) to dev "naa.600508e00000000078c4a59a76937603" on path "vmhba1:C1:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x85, CmdSN 0xab from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x4d, CmdSN 0xac from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0xad from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.</span>
So duely armed I contacted HP support whom immediately escalated the issue internally. During this time I had a very helpful conversation with a good friend
@VirtualisedReal whom pointed me in the direction of the HP MSA best practice document. I applied the subnetting configuration it suggested, which seperates the iSCSi ports A1 & B1 from A2 & B2 on seperate subnets and also configured each of the hosts 2 iSCSi vmkernel ports to point to the seperate paired iSCSi ports on the SAN.
When HP did eventually come back to me they suggested the SAN was perfectly fine, However! enough time had passed since the iSCSi port configuration change that I could already see a noticable drop in latency.
I waited another week (and since then) and I am very glad to say the latency is considerably lower with no reoccurance of the locks being lost on VM vmdk files.
PowerCLI Script to set RDM LUNs to Perennially Reserved – Fixes Slow Boot of ESXi 5.1 with MSCS RDMs
I've previously posted around this topic as part of another problem but having had to figure out the process again I think it's worth re-posting a proper script for this. VMware KB 1016106 is snappily titled "ESXi/ESX hosts with visibility to RDM LUNs being used by MSCS nodes with RDMs may take a long time to boot or during LUN rescan" and describes the situation where booting ESXi (5.1 in my case) takes a huge amount of time to boot because it's attempting to gain a SCSI reservation on an RDM disk used by MS Clustering Services. It also details the fix.
The process is fairly simple, but a bit labour intensive if you're doing it manually on a large cluster.
- Retrieve the ScsiCanonicalName for each RDM
- Set the configuration for each RDM on each Host to "PerenniallyReserved"
The voting is now open for your favourite VMware virtulization blogs over at vmware-land.com
With 200+ blogs now up and running with content covering every aspect from PowerCLI to VDI, technical deepdives and general VMware topical blogging! there is a very strong chance you will have read an article in at least a few of them.