As some of you read previously, I had been experiencing disk latency issues on our SAN and tried many initial methods to troubleshoot and understand the root cause. Due to other more pressing issues this was placed aside until we started to experience VMs being occasionaly restarted by vSphere HA as the lock had been lost on a given VMDK file. (NOT GOOD!!)
3x vSphere 5.1 Hosts
2x 4port Nics 1GBe (allowing 2x iSCSi vmkernel ports per host for redundancy)
Dedicated Switching (isolated from the LAN) for iSCSi and vMotion (on seperate respective VLANs)
MSA2312i SAN G2 (with 4 Shelves)
The iSCSi Multipathing policy was set to Round Robin.
SIOC is enabled.
After a great deal of digging I resolved to contacting VMware support whom pointed me in turn to the SAN as the Host log files had the following..
<span style="color: #ff0000;">2013-02-20T14:35:38.026Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0x6796 from world 0 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. 2013-02-20T14:35:38.030Z cpu8:51055)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124401f31c0, 91852) to dev "naa.600508e00000000078c4a59a76937603" on path "vmhba1:C1:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x85, CmdSN 0xab from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x4d, CmdSN 0xac from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0xad from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.</span>
So duely armed I contacted HP support whom immediately escalated the issue internally. During this time I had a very helpful conversation with a good friend
@VirtualisedReal whom pointed me in the direction of the HP MSA best practice document. I applied the subnetting configuration it suggested, which seperates the iSCSi ports A1 & B1 from A2 & B2 on seperate subnets and also configured each of the hosts 2 iSCSi vmkernel ports to point to the seperate paired iSCSi ports on the SAN.
When HP did eventually come back to me they suggested the SAN was perfectly fine, However! enough time had passed since the iSCSi port configuration change that I could already see a noticable drop in latency.
I waited another week (and since then) and I am very glad to say the latency is considerably lower with no reoccurance of the locks being lost on VM vmdk files.
PowerCLI Script to set RDM LUNs to Perennially Reserved – Fixes Slow Boot of ESXi 5.1 with MSCS RDMs
I've previously posted around this topic as part of another problem but having had to figure out the process again I think it's worth re-posting a proper script for this. VMware KB 1016106 is snappily titled "ESXi/ESX hosts with visibility to RDM LUNs being used by MSCS nodes with RDMs may take a long time to boot or during LUN rescan" and describes the situation where booting ESXi (5.1 in my case) takes a huge amount of time to boot because it's attempting to gain a SCSI reservation on an RDM disk used by MS Clustering Services. It also details the fix.
The process is fairly simple, but a bit labour intensive if you're doing it manually on a large cluster.
- Retrieve the ScsiCanonicalName for each RDM
- Set the configuration for each RDM on each Host to "PerenniallyReserved"
The voting is now open for your favourite VMware virtulization blogs over at vmware-land.com
With 200+ blogs now up and running with content covering every aspect from PowerCLI to VDI, technical deepdives and general VMware topical blogging! there is a very strong chance you will have read an article in at least a few of them.
Today while creating new VMs from a template I got the error "the server fault invalidargument had no message" when editing the VM settings, the settings were modified successfully but the error was present whether a change had been made or not to the settings of the VM.
A quick search of the web suggested removing said VM from the inventory and re-adding from the datastore, for many this fixed the issue but not for me.
Another suggested removing and reading the Host from the cluster which I did and still no joy. Finding little else to go on I elected to simply restart the host the VM/template was originally on.
Lo and behold this fixed the issue!
Had a strange one after deploying an XP VM from a template today - the VM would not power on and threw the following error:
An error was received from the ESX host while powering on VM [VM name].
cpuid.coresPerSocket must be a number between 1 and 8
Digging around on google the error seemed to be related to over-allocating vCPUs (e.g. assigning 8 vCPUs on a VM with 4 physical CPU cores). This was a single vCPU machine on a 12 processor host, so not likely to be that! It did give me the idea that maybe the VMX had an error, so I edited the VM hardware and added an extra CPU and saved the config. I then edited it back to a single CPU and powered on the machine - it worked!
Examining the vmx showed that the coresPerSocket was set to zero which is incorrect:
cpuid.coresPerSocket = "0"
And after the change, the numvcpus setting was added and coresPerSocket updated:
cpuid.coresPerSocket = "1" numvcpus = "1"
Fortunately, it's a simple fix and once I'd updated the template, not something that will bother me again!
So VMware's Support Assistant is pretty awesome and it's free! I thought I'd do a quick run through of the installation and set up for anyone who was interested, it's fairly straightforward and if you raise a lot of calls or have multiple calls on the go it's a time saver!
VMware's official page for the Support Assistant is here - https://www.vmware.com/products/datacenter-virtualization/vcenter-support-assistant/overview.html
I'm very pleased to say that as of 21st December, I passed my VCP510 exam and am now VCP5 qualified! It's something that I've wanted to do for a long time (since VCP3) but have never been able to get funding for the required course. My current employer sent me on the vSphere 5 Fast Track course earlier this year, so I was all set to take the exam.
My exam experience was somewhat marred by a very poor first attempt which I narrowly failed. The exam I sat had dozens of spelling and grammatical mistakes, inaccuracies and other problems and I spent far too long commenting on those than concentrating on the questions. Fortunately I was eventually able to speak with VMware Education and they issued me with an exam voucher (they will also be releasing a new version of the exam soon, which I'm assured will resolve these problems). My second attempt was a lot better and I smashed the 300 point pass mark by 128 points, which went some way to restoring confidence in my own knowledge of the subject!
I'm now looking forward to studying for the VCAP-DCA and DCD exams with a view to completing them in 2013...
Without wishing to bore the pants off of any would be reader I shall summarize my ruminations as someone whom is still quite new to the VMware world.
The first thing that comes to mind is a a couple of recent meetings I have had with VMware. Learning that they are now very keen to engage with 'the rest of us' and by that I mean those of us working in SME's as we represent well over 50% of their business revenue. For me personally this was excellent news as we have already invested heavily into the VMware product range and plan to carry on doing so in the future. The recent release of VMware suites was a good step forward but I still feel they need to do a lot better in communicating to SME's about their vast (and ever increasing) product range as there are many gems that can often go unoticed. Our discovery of vCops earlier this year was a good example of this.