VMware vSphere – DRS only shows one host in cluster

I rebuilt an ESX host in my HA/DRS cluster today, following my build procedure to configure as per VMware best practices and internal guidelines. When the host was fully configured and up-to-date, I added it to the cluster and enabled HA and DRS. Then I went to generate some DRS recommendations to balance the load an ease off my overstretched host, but no recommendations were made.

I couldn’t manually migrate any VMs either – it was odd, because both hosts were added into the cluster, and could ping and vmkping each other from the console.

I also received email alerts -

[VMware vCenter – Alarm Host error] Error detected on [HOST] in [Data Center]: Agent can’t send heartbeats.msg size: 1266, sendto() returned: Operation not permitted

It turns out that there were slight naming differences between the default VMKernels on each host, which stops communication. Since one VMKernel was named “VMKernel” and the other “VMKernel 2” it stops the migrations, and hence DRS. The hosts would add into the cluster OK, DRS actually showed as “imbalanced” on the Cluster summary screen – it was just DRS and vMotion which wouldn’t work.

With the VMKernels renamed to exactly the same thing, DRS kicked off no problem, as did a manual migration.

So the moral of the story is this; name ALL networks in the same cluster identically. It makes sense when you think that the VM needs to see it’s Virtual Machine Network on each host – why should the Service Console and VMKernel be any different?

Certificate errors when connecting Gateway Server or non-domain Agent to System Center Operations Manager 2007 R2

This was a bit of an odd one. I was adding a Gateway Server to a newly rebuilt SCOM 2007 R2 Root Management Server when I kept encountering this error:

The certificate specified in the registry at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings cannot be used for authentication.  The error is The credentials supplied to the package were not recognized(0x8009030D).

I followed the Microsoft install and setup guides exactly, and it’s not my first time either – but I’d never seen that one before.

It turns out that it’s a quirk with Certificate Services and how you request your certificate. I used the Certificate Services website on my Server 2003 Enterprise Root Certificate Authority to request the correct certificate, based on the OperationsManager template I created. Crucially, there wasn’t the option to import the certificate to the Machine/Personal certificate store – it went into the User/Personal. This meant that when it came to exporting and then re-importing the certificate, the private key was not correct.

Requesting the certificate through the MMC Certificates Snap-in and restarting the Health Service resolves the issue.