Host disk write latency errors – troubleshooting – part 2

vmware logo
As some of you read previously, I had been experiencing disk latency issues on our SAN and tried many initial methods to troubleshoot and understand the root cause. Due to other more pressing issues this was placed aside until we started to experience VMs being occasionaly restarted by vSphere HA as the lock had been lost on a given VMDK file. (NOT GOOD!!)

The Environment:-
3x vSphere 5.1 Hosts
2x 4port Nics 1GBe (allowing 2x iSCSi vmkernel ports per host for redundancy)
Dedicated Switching (isolated from the LAN) for iSCSi and vMotion (on seperate respective VLANs)
MSA2312i SAN G2 (with 4 Shelves)
The iSCSi Multipathing policy was set to Round Robin.
SIOC is enabled.

After a great deal of digging I resolved to contacting VMware support whom pointed me in turn to the SAN as the Host log files had the following..

<span style="color: #ff0000;">2013-02-20T14:35:38.026Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0x6796 from world 0 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124401f31c0, 91852) to dev "naa.600508e00000000078c4a59a76937603" on path "vmhba1:C1:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE 2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x85, CmdSN 0xab from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x4d, CmdSN 0xac from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-02-20T14:35:38.030Z cpu8:51055)ScsiDeviceIO: 2316: Cmd(0x4124401f31c0) 0x1a, CmdSN 0xad from world 91852 to dev "naa.600508e00000000078c4a59a76937603" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.</span>

So duely armed I contacted HP support whom immediately escalated the issue internally. During this time I had a very helpful conversation with a good friend @VirtualisedReal whom pointed me in the direction of the HP MSA best practice document. I applied the subnetting configuration it suggested, which seperates the iSCSi ports A1 & B1 from A2 & B2 on seperate subnets and also configured each of the hosts 2 iSCSi vmkernel ports to point to the seperate paired iSCSi ports on the SAN.

When HP did eventually come back to me they suggested the SAN was perfectly fine, However! enough time had passed since the iSCSi port configuration change that I could already see a noticable drop in latency.

I waited another week (and since then) and I am very glad to say the latency is considerably lower with no reoccurance of the locks being lost on VM vmdk files.

Site to Site VPN Tunnel traffic flow problems

Firewalls being used – Sonicwall 3500 & Cisco 506e

Several months ago we relocated and it was then necessary to setup a Site to Site VPN tunnel with another network. (In this instance the other network was not directly managed by us)

Upon the creation of the tunnel and after successful traffic tests all looked well. However after several hours or less in some cases traffic stopped flowing yet both firewalls reported the tunnel as “up”. We reviewed the first and second phase settings and tweaked the Sonicwall VPN settings to hopefully remedy.

Options on the Sonicwall such as “Enable IKE Dead Peer Detection” & “Enable Keep Alive” were enabled and disabled to try and find a fix for the VPN traffic flow problem.

What was interesting during the troubleshooting process, we found that if we manually restarted the VPN tunnel it would resume with no issue, but obviously this was hardly a practical fix for our issues.

Liaising with the other site we also experimented with Phase 1 and Phase 2 Life Time settings with no success.

It was then we had a small eureka moment, we decided to check the time servers each firewall referenced. It transpired the Time Server being referenced by the Cisco Firewall was out of sync (it was an internally hosted NTS)

After the offending NTS had been re-sync’d we decided to completely recreate the VPN tunnel double checking the settings as we went along. The VPN Tunnel came up with no issues and has been stable ever since.

I would add if we encounter a problem like this again I would simply point both Firewalls to the same NTS but as one of the firewalls in this case was managed by a third party this was not an option.

Configuring a Guest wireless network with restricted access to Production VLANs

It’s a fairly common requirement – setting up a guest WiFi network that is secure from the rest of your LAN. You need a secure WLAN access for the domain laptops which has full access to the Server and Client VLANs, but you also need a guest WLAN for visitors to the office which only allows internet access. Since the budget is limited, this must all be accomplished via a single Access Point – for this article, the access point is a Cisco WAP4410N. (more…)

Migrating the HP Systems Insight Manager 6.x database

We run to monitoring systems where I work, the first is HP SIM and the second is Microsoft System Center Operations Manager. Currently, they and their databases all reside on a single rather battered server, “MONITOR1”.

I’ve installed a new SQL Server 2008 server “SQL1” on Windows Server 2008 to take some of the load, and take advantage of the 64-bit OS and SQL installation.

Both servers are part of the domain “DOMAIN”

The process goes something like this:

  1. Add the user that SIM runs as to the SQL server logins. For me that’s “DOMAIN\Insight.Manager”
  2. Create a new database on SQL1 with exactly the same name as the MONITOR1 database for SIM. Since my 6.x install is an upgraded 5.x install, the database is called “Insight_v50_0_16732390”.
  3. Add the SIM user account to the new database with DBO permissions.
  4. Stop the HP SIM service on MONITOR1
  5. Right click “Insight_v50_0_16732390” on MONITOR1 and Export. Export all the tables to SQL1…and wait a long time for the data to transfer.
  6. While you’re waiting, you can edit the following files (c:\Program Files\HP\Systems Insight Manager\Config\) – database.props and database.admin. Change any references for MONITOR1 to SQL1.
  7. Once it’s completed, stop the SQL server on MONITOR1 and start the HP SIM services again – fire up the SIM homepage to check everything is running OK.
  8. If it all checks out, remove the old database and if it’s no longer needed, uninstall the SQL server too.

Trace cables the easy way with Cisco CDP on Windows

No matter how good your network diagrams are, sometimes you need to verify the port your server/desktop is in. Cisco Discovery Protocol is a great tool for network admins when you need to quickly map routers and switches, and if you’ve got an ESX server connected you’ll see that it picks up CDP info too – but the vast majority of my managed systems are Windows.

Here’s how to use TCPDUMP by Micro Olap to extend that functionality to your Windows boxes.

Firstly you need to find the interface number of the network adaptor you are trying to find CDP data for.  Use this command:

tcpdump -D

Which gives you a list of the interfaces on the computer:


My actual NIC is the third one in the list, so I can run the command:

tcpdump -i 3 -nn -v -s 1500 -c 1 ether[20:2] == 0x2000

-i n [interface and the number in the list, for me 3]

-nn [don’t resolve DNS, speeds things up]

-v [verbose mode, otherwise we won’t see all the packet details]

-s 1500 [set the maximum packet size to capture, the MTU is 1500 by default so it will capture the entire packet]

-c 1  [Capture one packet only, since we only want the CDP packet and filter using the header]

ether[20:2] == 0x2000 [Check the Ethernet header packet ID for the hex value 0x2000 – CDP protocol]


Some output is omitted, but you can see that the name of the switch and the port are both in there.

Easier than tracing a cable!

CCNA Qualified

ccna_medAfter some pretty heavy investment in terms of time and money, I’ve passed my ICND2 exam and am now qualified as a Cisco Certified Network Associate (anyone else find it odd that you’re not even considered a professional by Cisco at this level?!)

I do consider the Cisco qualifications as significantly more valuable than the others that I hold, simply because of the difficulty of the exams. I do find them “honest” in that they’re not trick questions, and you don’t need a technique to pass – just in depth knowledge.

Anyway, I think I’ll take few weeks before I look to my next study/exam.

Cisco Qualified!

As is normally the case when I’m studying, I haven’t had time to post much on here lately. I’ve been studying to pass the ICND1 exam (snappily titled “Interconnecting Cisco Network Devices Part 1”)

I’m really pleased to say that neglecting this site paid off, or rather the study did – I passed with a score of 930! It was a LOT harder than I had expected, I thought I’d walk out after 20m! It does now mean that I am CCENT. I’ll be taking the ICND2 exam early in the new year which will move me up to CCNA.

Also in the exams category, I’m taking a beta exam “PRO: Design & Deploy Messaging Solutions with Microsoft Exchange Server 2010”. Another snappy title and another bundle of fun!


Teaming NICs with ESX 3.5 and Cisco Switches in an aggregate.

Here’s the setup. We have a core switch of 2 Cisco 3750s, connected together for fault tolerance as a single logical switch; we also have several ESX 3.5 hosts with 4 Gigabit Ethernet NICs installed each. The Virtual Machines will all be on VLAN 8 (reserved for internal servers) and the VMKernel will be on VLAN 107 (reserved for VMKernel traffic like VMotion).  I want to create a load balanced, fault tolerant aggregate of these four NICs over the Core Switch.

Configure ESX server’s vSwitch

Configuring the vSwitch is actually pretty simple, but there are a couple of gotchas, so don’t skip this bit! First thing to note is that if you are making changes to the vSwitch and the Service Console is on that vSwitch you can quite easily lock yourself out. Make sure you configure this correctly, first time! In this setup, I am adding all 4 NICs to vSwitch0, which will be the only vSwitch. I’ll then use Port Groups to assign VLANs and Active/Passive configurations to the VMKernel/Service Console.

First things first then – assign the four NICs to the vSwitch. This is done in the Configuration Tab in VMware Infrastructure Client, then the Networking page. Edit the properties of your vSwitch, then select the Network Adaptor tab. Add all the NICs you wish to team in there (they may already be in there, depending on your setup). You should end up with something that looks like this (note that I’ve not assigned any VLAN yet):


Now you need to configure the NIC teaming, so edit the vSwitch Properties and under the Ports tab select the vSwitch. Click edit, and then go to the NIC teaming tab. Configure the teaming options like this:

That’s the easy part over and done with! Time to move onto the Cisco!

Configuring the Cisco Core Switch

Firstly, we need to log on to the switch and enter enable mode; I’m going to assume you know how to do this – if not, you really shouldn’t be attempting this setup!

Determine the switches trunk load balancing setup by using the command “show etherchannel load-balance”. It should look something like this:

If the protocol is NOT src-dst-ip, then you won’t be able to establish a trunk connection with the ESX server. If your protocol is not src-dst-ip, change it with the command “port-channel load-balance src-dst-ip”. This now matches the “Route based on IP hash” setting you configured in ESX. Although ESX has a setting for MAC based hashing, as does the Cisco, I was unable to get it to work.

Moving on. You need to create a Port-Channel interface for the trunk (this is a virtual interface that binds the 4 GigabitEthernet interfaces together). As i’ve got other Port-channels in use for connections to other switches, I’m setting up port-channel 40. Move to config mode (conf t) and then enter the setup:

interface Port-channel40
 description VMTEST01 Aggregate
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 8
 switchport mode trunk
 switchport nonegotiate
 spanning-tree portfast trunk

Description simply adds a description, “switchport trunk encapsulation dot1q” sets the encapsulation of the trunk to 802.1Q. “switchport trunk native vlan 8” means that any traffic without a VLAN tag will be automatically assigned to VLAN 8. “switchport mode trunk” obviously designates that we want a trunk, rather than access. “switchport nonegotiate” means that it will not attempt to negotiate the protocol, and be a static trunk, rather than LCAP or PGaP. “spanning-tree portfast trunk” causes a Layer 2 LAN interface configured as an access port to enter the forwarding state immediately, bypassing the listening and learning states (i.e. if the link goes down and then comes back up, it will do so quickly).

With the Port-channel configured, you now need to edit your GigabitEthernet ports and assign them to the Port-channel. For each port in the trunk, enter the following config (this example is port 8 on the master switch in my stack, hence 1/0/8):

interface GigabitEthernet1/0/8
 description VMTEST01 VMNIC1
 switchport trunk encapsulation dot1q
 switchport trunk native vlan 8
 switchport mode trunk
 switchport nonegotiate
 channel-group 40 mode on
 spanning-tree portfast trunk

The difference between that and the Port-channel setup? “channel-group 40 mode on” is simply assigning the port-channel in static mode.

Once all four NICs are assigned you might have to wait a few minutes for every layer of the connection to settle down before the trunk comes up. To check the status of the etherchannel you can use the command “show etherchannel 40 summary”, replacing the 40 for whichever number you assigned to your port-channel.

I hope this helps navigate the minefield that I found to be setting up the NIC teaming!

Using AC97 audio with Windows 7

Like thousands of other IT pros out there, I'm testing Windows 7 out on my laptop – since I don't want to mess around with my main PC, it's running on some older kit. The problem with that is that there aren't many Vista drivers around for the hardware – why would there be, it's not even supposed to be able to run Vista?! It does, however, run Windows 7 very admirably (just one of the many improvements).

The only problem was the sound card, the only drivers available from Dell for the onboard sound were for XP, which crash in both Vista and 7. The sound card is compatible with Intel's generic AC97, so it didn't take long to find a Vista compatible AC97 driver from RealTek which will run any AC97 hardware, regardless of the actual manufacturer.