vNetworking hardships

I have been spending more and more time in virtual networking side of an ESXi VMware Infrastructure lately trying to figure out a strange error involving one of two VMhosts dropping connection to a dvSwitch.  I firstly had no idea what the issue could have been caused by as I had all host networking, the host from the console reported everything was fine, log files reported no networking faults, Cisco switches reported no errors and a hard shutdown of the host was the only thing that seem to bring the machine and vmguests back to form.  Hard booting a production VMHost is not what I call a fun time.   We had thought it was a faulty switch that was taking down the dvSwitch some how so I removed the networking from the switch and ran home run connections to the core switching cluster.

The fortunate failure that hit was when I didn't put the management port on the (distributed Switch) vdS and had it on a (virtual standard Switch) vsS the machine stayed up and the virtual guest machines lost their networking.   It reported about an "Out of Sync" error.

This got me researching how the virtual distributed Switch (vdS) works.   Distributed switching is dependent on the shared storage infrastructure and an "Out of Sync" error defines that the .dvsdata folder and underlying files are out of sync with the hidden virtual switch residing on each VMHost.   So I opened a call with VMware support and seemed so far to have stumped them too.  :)

What I learned so far about the vdS:
pNIC configuration - make sure the portfast is enabled.

Brandon Riley's blog about vSphere Troubleshooting (got some tools name from the course and checked out my trainsignal Troubleshooting DVD)

Here are two great overviews of the vdS including information about the timeout / sync issue.

Marc o'Polo and discussion around failover modes

Frank Denneman's blog about load based teaming

Definitely some fascinating reading on how the vDS works and it also gave me some significant understanding for the troubleshooting aspect of this.  The issue at hand though (in my opinion) could reside in two places.  Either it is around the fact that these two Hosts are the only ones that do not have Powerpath /VE installed and they aren't receiving the shared storage updates frequently enough or the other issue could be that I don't have the trunk ports on the Cisco switch setup with the spanning-tree portfast trunk command to accommodate the "failover based on Physical NIC Load" failover policy that I am using.

I have to wait and see on the powerpath (short on licensing and has been ordered.) or the network switch config which will take a long time to change all the ports.

I will update when the solution arises.

--update--

So after re-examination of the logs and correlating times to issues with networking it appears like a backup / storage write and read collision.  SQL maintenance plans, VMDK disk backups, standard file system backups all running at the same time was clobbering the dVS as part of the storage, but only the machines utilizing one path (aka non-powerpath /VE servers)

We moved the vDS to vSS and are in the midst of installing powerpath/VE on the last two hosts.

--final update--

I wish it was a spectacular fix that resolves this issue but after waiting to write this until I was ensured that it was fixed before posting this.  During one of the hosts outages I noticed a hardware Memory DIMM error stating "Uncorrectable ECC error" on every memory stick in the machine ONLY in the VMware Logs.  (I ran the vendor's diagnostic and it reported everything is a-OK).  So even with that as part of troubleshooting we replaced all the RAM in the box and waited again.  Upon reboot the same error reared it's ugly head.  The call went into the vendor to replace the Memory controller (aka the Mainboard) in which they replaced that day. 

It has been about 4 weeks since the last outage and there has been little to no blips from the host since.  I learned a few tips about critical log file locations and about warning signs for these type of issues but the most reminder I got from this experience is "check the physical layer first" type of troubleshooting should ALWAYS prevail.


1