failure of the system, looking for answers

Need help for the guys. Had a major failure of the system last week, and I'm a bit confused as to what happened and why I was unable to get free. I lost a bit of my confidence in Vsphere, and I'd like to renew that.

A bit on the environment, esxi 5.0 U1, with servers HP Gen 8 3 node cluster running. HP P4500 SAN iscsi for vmfs data storage. ESXi is installed on HP's CF cards in servers. This cluster has been upward and running for about 6 months, not a single hiccup before that.

Wednesday we have migrated 2 of our vm (SQL and progress) to this cluster 5.0 database, the virtual machine went on different hosts in the cluster. The virtual machine was running on a cluster 2 older nodes with esxi 4.1 U2. Machines migrated fine, fine tools updates, reboot, everything seemed fine.

About 2 hours after that the migration, we started to receive calls that our Progress database is down, users could not connect. Then we started getting more calls, other machines were inaccessible. Glancing from vcenter I could see all the virtual machines in question were on the same host, and I was unable to open the console in vsphere for VMs on that host. The host showed he was connected, showed all the vm connected, but I could not open the console or the desktop remotely to one of the VMs on that host. I started studying, and of course this host fell all the SAN Iscsi network connections. The railways have shown dead. Ports of nic card viewed active, the switch ports showed activity, but the connections were down. I could always ping the address of management for the host however, and ports for vmotion were in place.

At this point, I started to try and vmotion VMs on that host data base progress, he would not migrate, just sat at 8% prepare to migrate. I tried other virtual machines with the same result. I started to wonder why HA had not kicked, and why I couldn't move anything. At this stage the host started disconnecting from the cluster. I could always ping on the host, but vsphere showed as disconnected. I couldn't move my VM, and I couldn't go to the host via vcenter, via the vsphere client pointed directly to the host, or by using the DCUI.

So I called VMware support, got an engineer on the line with me, and it became clear we were going to have to power cycle that host and crash all virtual machine running on it. It wasn't a very pleasant for me answer because this database of progress is our main production system, and I was afraid of corruption. We had no choice, so we did. When the host came back upward, fortunately the virtual machine came very well. VMware engineer digging in the newspapers and said that this was with a particular NIC card driver with known issues. He showed me the KB on this issue, and it seemed to be a known issue. We have updated these drivers on all hosts, and that's all.

My problem with this is that, how is the cluster ran fine for 6 months without problem, and how come the redundant path to the SAN did not keep the connection active when with the path with the bad nic card driver failed? I have 2 different, with 2 NICs different paths for the Iscsi SAN. The other card had no known driver issue. Why both paths failed not just in cause of a driver on one of the cards problem? More worrying is also how is it that I could not immigrate anything, and why no HA kick in?

Sorry for the novel here, but without all the details is not part of a story. My biggest concern, that's why I couldn't move anything? In the event of a host failure, what you're supposed to do in order to migrate the machines if they don't migrate via vsphere client? We were down for about 2.5 hours, and a lot of questions were thrown on me the senior management as to why my "system available, redundant high" took hours to retrieve...

Guys here any ideas, thoughts on how I would have handled it differently, reasons why I should be confident everything is fine now?

Thanks for your time

Kevin

trink408 wrote:

Thanks Matt.

I guess some of the problem is not completely understand how the HA or Vmotion. I was under the impression once that the virtual machine was not reachable HA wouldn't kick and move the virtual machine? The virtual machine was not to ping requests or accessible by the Office remotely. I couldn't get their power off or do anything through vcenter either.

If they had no link usable storage, this seems possible.  HA simply does not address the failures of storage.  Not at all.  It IS possible to activate at the level of the VM HA, but it isn't on by default - must be activated on a per-VM basis.  You should read the book by Duncan Epping on this - it's the bible of the AH.

So in case something happens with the connection of storage, you have really no way to vmotion anything right here, and your only option is to kill the host / VM running on it, and then migrate or leave HA move them?

Fix.

I didn't know that I wouldn't be able to vmotion virtual machine if the host has lost its connection to storage. The other hosts in the cluster saw storage.

The source host is controlling these VMDK, as far as the other guests are concerned (they see a lock on the file) and when they ask the host if its still alive, he answers (because it has not powered the network lost or down, which are the failure mode THAT HA is designed to handle). So they take charge.

The reason why it took so long was because I didn't kill the host with my concerns for the database. I was hoping that the VMware engineer may have a way to gracefully close things down. So he spent some time looking around and trying to determine what was going on. Ultimately we just power cycle the host, so I could have done much earlier, and if we are facing this again in the future, I would.

As long as you follow the seller advised for more decent databases (keeping the transaction logs, etc.), don't have good backups, theres no real risk of data corruption.  VMware does not significantly change the i/o path, in order to have the same exposure than on a physical host.

I still do not understand why the two paths were marked as dead and the host completely lost connection to the SAN, as well as to show finally disconnected in vcenter.

Well the host went offline because it got stuck in an all-paths-down scenario, which is common for 5.0.  5.1 solves this problem a little.  I don't know why all roads fell, but I suspect that you have a misconfig somewhere... normally you expect at least 4 paths in a system set up correctly on the left.  Check with HP to ensure that you follow best practices.

Maybe all the IP stack has been corrupted or something?

Possible, of course, but unlikely.  Never heard before.

I appreciate the help and the preview, I thought that the problem was a lack of understanding on my part and fully accept it.

HA and vMotion are complex.  At least now you can go back to mgmt and tell them why he fell more (because it was a failure outside the scope of the software scenario) and may ask for money to build a proper SQL cluster.  Definitely recommend the book of Duncan Epping: http://www.amazon.com/VMware-vSphere-Clustering-Technical-Deepdive/dp/1463658133/ref=la_B002YJMRCY_1_2?ie=UTF8&qid=1363622697&sr=1-2

Tags: VMware

Similar Questions

Maybe you are looking for