A VMware mistake may shutdown thousands of virtual infrastructures
This morning the VMware’s customers that upgraded their virtual data centers with the new Infrastructure 3.5 Update 2 (build 103908) had an awful surprise: any virtual machine that is turned off cannot be powered on again, and any attempt to execute a VMotion (the live migration of a VM from one host to another) fails.
The reason behind this huge and unprecedented issue is an error in the license expiration time.
The only way to workaround the problem at the moment is to disable the Network Time Protocol (NTP) client and set the date back to August 10, as promptly suggested by a customer here.
Of course this countermeasure has an impact on the log consistency and on any tool that analyzes the VirtualCenter events for different purposes (performance monitoring, trend analysis, capacity planning calculation, etc.).
More than that obviously, this issue has an impact on the availability of those infrastructures where the IT administrators are in vacation (and there are many on August 12) and cannot operate any recovery.
The users from all around the world are reporting failures of part of their systems and in some case even the complete knock-down.
VMware has over 200,000 enterprise customers (100% of Fortune 100 and 95% of Fortune 500), and it claimed that 59% of them use VMotion in production.
The company didn’t provide any statistics about how many already deployed the Update 2, but the license fault could have impacted thousands of them.
VMware is aware of the issue but couldn’t provide any immediate solution.
At the moment it seems that the entire VMware Knowledge Base collapsed.
Calling the support line customers can just receive a brief message saying that the problem will be solved within 36 hours.
Additionally, VMware removed the capability to download any affected product.
The existence of such issue is more than enough to undermine the credibility of the company (which already made some mistakes in the past) in a complex moment of its successful history.
A 36 hours timeframe to provide a solution is just an unacceptable answer for all those enterprises that deploy virtualization in production.
The whole thing may severely damage the stock performance of today.
Update: The license of VMware ESXi 3.5 U2 (build 103909) is reported as affected by the same problem.
Second update: To further aggravate the situation, today is the so called Microsoft Patch Tuesday, so a number of guest operating systems are being automatically (or manually for those unaware of the issue) rebooted.
As this is not enough, any customer running a VDI environment certainly allows its end-users to reboot their virtual desktops any time they want.
Third update: as the VMware Knowledge Base is still unavailable probably due to overload, virtualization.info publishes the original KB article about this issue.
Fourth update: The issue also impacts ESX 3.5 Update 1 with certain patches.
The full details are available in the comment section of this post, thanks to the effort of a virtualization.info reader.
Suddenly the problem is no more a matter of early adoption.
Fifth update: As promptly reported in the comments section, the VMware’s new CEO, Paul Maritz, published on the official blog an apology, informing that a patch has been released:
…I am sure you’re wondering how this could happen. We failed in two areas:
- Not disabling the code in the final release of Update 2
- Not catching it in our quality assurance process
We are doing everything in our power to make sure this doesn’t happen again. VMware prides itself on the quality and reliability of our products, and this incident has prompted a thorough self-examination of how we create and deliver products to our customers. We have kicked off a comprehensive, in-depth review of our QA and release processes, and will quickly make the needed changes…
Maritz couldn’t desire a worst start for its new role in the company. Nonetheless this is a great opportunity: the co-founder and former VMware CEO, Diane Greene, was often accused of being unable to grow her company as a big enterprise, capable of competing against Microsoft.
Handling this incident Maritz has the first chance to demonstrate that he’s the right person to do better than Greene.
Sixth update: VMware is still unable to republish the ESX 3.5 and ESXi 3.5 Update 2 images for fresh installations.
Their availability is expected by August 13, 2008 at 6pm PST.
Seventh update: VMware just informed its customers that it cannot deliver a new, patched image of the product for the planned deadline.
The images are now planned for release August 14, between 2am and 8am PDT.
Eighth update: A number of enterprise customers may be unable to apply the first patch released (see Fifth update above) for a number of reasons:
- Unable to schedule a maintenance window
- Internal change control procedures
- No available server to VMotion running VM’s onto
VMware is aware of these constrain and informed its customers that is developing a second procedure, called U2 Alternative Install Process (U2 AIP), to apply the patch, available on demand calling the Support.
At the moment (August 15, 2008) there is no release date for this new patch installation procedure.
Meanwhile the full patched images are finally available online and all the download links have been reactivated.
The new build numbers are:
- ESX 3.5 Update 2 – 110268
- ESXi 3.5 Installable Update 2 - 110271
68 Comments
Anonymous
Tuesday, August 12, 2008 2:35:00 PM
but anyway, massimo is right, its not their first fault ... in a very very hard time ...
I'm thinking VMware should ether spend a little money in their QA department or replace them with competent staff. These are both examples of items that should have been easily detected before the product ships.
The point is valid to let others test before you, yet that isn't important with this issue.
The issue is that VMware messed up. The enterprise admin did not. Assign blame where it belongs.
And no, I didn't get burned. Haven't loaded it at any customers yet. "waiting" :)
How many production servers are going down today? How many millions of dollars are being lost today?
My office only has a few VMware servers and you can guess there won't be anymore after this.
The guidance being given is to move clocks back two days? ARE YOU SERIOUS? IN THE REAL WORLD, WE HAVE SLAs AND ARE LEGALLY REQUIRED TO KEEP OUR SERVERS TIME SYNCED.
This is a CATASTROPHE.
----------------------------------------
Problem:
An issue has been discovered by many VMware customers and partners with ESX/ESXi 3.5 Update 2 where Virtual Machines fail to power on or VMotion successfully. This problem began to occur on August 12, 2008 for customers that had upgraded to ESX 3.5 Update 2. The problem is caused by a build timeout that was mistakenly left enabled for the release build.
Affected Products:
- VMware ESX 3.5 Update 2 & ESXi 3.5 Update 2 (pre-Update 2 releases are not impacted by this problem).
- Reports of problems with ESX 3.5 U1 with the following patches applied.
1. ESX350-200806201-UG
2. ESX350-200806202-UG
3. ESX350-200806217-UG
- No other VMware products are affected.
What has been done?:
- Product and Web teams pulled the ESX 3.5 Update 2 bits from the download pages last night so no more customers will be able to download the broken build.
- VMware Engineering teams have isolated the cause of the problem and are working around the clock to deliver updated builds and patches for impacted customers.
- A Knowledgebase article has been published (http://kb.vmware.com/kb/1006716), but traffic to the knowledgebase is causing time outs. A new static page has been published at http://www.vmware.com/support/esx35u2_supportalert.html that customers and partners will be able to view.
- The phone system has been updated to advise customers of the problem
- Vmware partners have been notified of the issue.
Workarounds:
1) Do not install ESX 3.5 U2 if it has been downloaded from VMware's website or elsewhere prior to August 12, 2008.
2) Set the host time to a date prior to August 12, 2008. This workaround has a number of very serious side affects that could impact product environments. Any Virtual Machines that sync time with the ESX host and serve time sensitive applications would be broken. These include, but are not limited to database servers, mail servers, & domain administration systems.
Next Steps:
VMware to send an email to all customers who have downloaded this version. This effort is underway and should happen before 11 am today.
Resolution:
VMware Engineering has isolated the root cause and is working to produce patches for impacted customers today (August, 12, 2008) and will likely re-release ESX 3.5 Update 2. More information will be provided as the product teams have it.
Such a mistake should not be possible (e.g: some guy CAN forgot something but the system should not ALLOW to build a release build with #define EXPIRE 1 )
B) Product Management:
Why the hell putting a timebomb in betas anyway... who cares if someone doesn t upgrade it and goes in production with it ? They deserve trouble if they do (pop a warning = ok, make product useless = not ok)
C) IT/website:
A knowledge base is supposed to be up and running during this kind of event? VMware has to understand that there is NO EXCUSE for not being able to maintain a KB site 365 days a year (/., theregister or whatever). VMware wants to show they can beat Microsoft, Redhat, Citrix and whever else ? Fine, learn how to run a simple website first. If your KB solution cannot hold the load, ditch it for what I care (anyway the relevance is crap and google indexing would not hurt).
So, if I may, let's quote the email of Paul Maritz the 9th of July (1 month ago):
"As such, I will call upon our leadership team to be more empowered in decision making, as well as drive down accountability and decision making at all levels in the company."
Now would be a *very* good time to start "driving down accountability" Paul...
VMware Who? I dont trust them. As listed in the above statement, They have helped me confirm why.
The Update patch bundles will be released separately later in the week.
So if you are one of the unlucky ones to have applied the Update 2 patch, now you will have to wait longer than noon PST to resolve the issue.
Evaluate Hyper-V or Citrix XenEnterprise. The performace gained by going with XenEnterprise caught all of my IT staff off guard. Much better performance.
If you are using ESX/ESXi in pre-production, again, look at XenEnterprise or Microsoft Hyper-v. You'll need a management application so look no further than VMLogix. They have a labmanager that supports all top 3 hypervisors.
Then call up your Vmware rep. Tell him that you went w/a competitor. This is the only way VMware will get better.
But, also consider they whacked their Founder, Dianne Green and brought in a Microsoft suit. The entire VMware culture will and has changed, for the worse.
2) it fail to start it's not the same issue as described int he subject
3) it's still a critical issue and the kb site should work 24/7
4) xen is in the same court (path to upgrade between 4.0 & 4.1 is not so smooth)
If you want to look at the future, look at XenServer, HA, Multipathing, NetApp connector, XenMotion and a real high available XenCenter!!! I can tell you it is a lot ceaper, solid as a rock and performe even better than VMWare.
Hello Anonymous: It's possible we have different perceptions on Citrix, but that doesn't make my perception any more ignorant than yours. Perhaps if you knew of my experiences, you would understand my point of view.
If you want to talk further about ignorance, check the grammar in your post "(your)".
Jas
Who would have thought their own engineering team could have created the perfect trojan horse. It is that rediculous, one would even have to be sceptical that it was an inside job to upset the apple cart? maybe one last line of code from diane?
Citrix + HyperV arent even at the same stage as ESX2.5 - so please stop adding stupid citrix + hyperv comments.
HyperV doesnt even have a VMotion equivalent.
Citrix has only just introduced HA functionality - and please - remember the first release of XenMotion !!! advised not to be done during business hours !!!
Also, Transparent page memory sharing - or memory dedupe as it is now fashionably called. Citrix + HyperV cant do it ! A simple thing like VMs sharing identical pages of memory !!
VMware, you are like my Alfa Spider. You cost me a lot of money, technology advanced - when things go wrong - it hurts - alot of people knock you - but I am prepared to put up with a little pain for the privelage of having such a state of the art setup for the last 5 years.
But please release the patch soon !!
Advocating caution and diligence with changes on enterprise infrastructure on one post and then recommending Hyper-V and Citrix on another? You guys SHOULD be posting anonymously.... and btw, Jason's been posting on VMware communities before most of you made it out of high school.
Jacques Cronje
Wont the hosting server get a reboot if it gets patched, therefore powering off the VM's and thus causing them to require a power on causing them to break?
At 5:17 (European time) I received an email from vmware stating that there is a Patch released for this issue. A general message to the community has also been released.
http://www.vmware.com/go/esxexpresspatches
Regards,
Leo Bink
Sure VMware f*cked up big time here, but if i switched products on that count i would have dropped Citrix and Microsoft a long time ago.
I still strongly agree with Anonymous Wednesday, August 13, 2008 12:02:00 AM
Of course, no one on my team is crazy enough to put in a patch or update until it's been in release for a while either....short of it solving an issue. That's certainly a bit too trusting for my tastes.
Hyper V is utter crap, xen is a lot better, but neither are anywhere near vmware's level...so I'm not switching, or advising a switch over what amounts to a lot of administrators ignoring the cardinal rule of not updating until the patch is matured.
Microsoft is turfing like a demon now though, it's fun to watch. Every forum or thread I see has at least one turf of hyper v "Vmware sucks so bad, I'm moving to hyper v, because performance is awesome!"....Yeah right. Someone apparently hasn't loaded down a hyper v server and watched as it vomited on itself. If you have the need of a cheaper virtualization product that doesn't do vmotion or true HA, go with the free esx product(and still, use your damned brain and wait before patching--it's not a toy, and it's not shiny--so why act like it is?!?)
Reality is, as bad as this situation is had this problem happend to Hyper-V, it probably would have required a reboot. A reboot in Hyper-V means taking off your workload using "quick migration" which mean 10's of seconds or minutes of downtime per VM. Assuming your VMotion was still operable here, there would have been zero downtime caused by this patch. Even if you were unlucky and both your source and destination were impacted by this issue not allowing you to VMotion, being able to use VMotion for installing patches is certainly the norm and therefore installing VMware patches normally does not result in any downtime.
This incident reminds us that it is not easy to be an OS vendor. VMware is experiencing some growing pains. Hopefully for their sake they jettison the smug attitude about taking on the world as the "new operating system" and replace it with a bit of humility and responsibility that comes with their wide adoption.
All currently supported by Xenserver - Xen embedded,XenMotion,High Avaiability, NetApp Storage Migration. Continuous Availability with Marthon for Xenserver, not to mention SVI (When?) Citrix Provisioning Server is already doing it! Xenserver is becoming Vmware's Roadmap :)
Hello Anonymous: It's possible we have different perceptions on Citrix, but that doesn't make my perception any more ignorant than yours. Perhaps if you knew of my experiences, you would understand my point of view.
Sorry....I'll get on my knees and praise the almighty VM gods.
You need to get out more
We had not loaded the update so saw no impact.
1) Patch Tuesday should have very little effect on anyone here. The problem manifests itself when you power on a VM, not when you reboot a VM. How many people apply patches and then power their servers off, then on rather than reboot.
2) For those people that did have a problem with a VM being powered down and then not being able to power it back up, I'm guessing it was probably one or a couple of VM's rather than dozen's or hundred's in a single company. I don't know why anyone would power off dozen's or hundreds of production VM's at the same time before trying to power one back on.
3) I don't know about Citrix products, but I know from past experience that MS has put out updates that caused more problems than this in the past. With Microsoft's patching history, I can't see going years in a Hyper-V environment w/o at least one major downtime issue caused by a MS patch.
I agree with the the Anonymous post from Wednesday, August 13, 2008 7:25:00 PM, what was the actual impact?
How many people saw actual production downtime?
In the instances of actual production downtime, how many VM's was it?
Outside of anyone who was unfortunate enough to have a hardware failure requiring vmotion, I am guessing the actual impact was very minimal compared to the firestorm of people proclaiming the sky was falling.
Personally I manage 9 ESX hosts with 90+ production VMs and did not have a single second of production impact from this update.
I am reading your blog for quite a long time, but to be honest you tend to be more speculative than objective, this last post of yours is tabloid to my taste (although I don't want to underestimate the importance of this VI3 bug).
"A VMware mistake may shutdown thousands of virtual infrastructures": this bug won't shutdown anything
"To further aggravate the situation, today is the so called Microsoft Patch Tuesday": a soft reboot is not a problem
And so on...
IMHO your credibility is more important in the long run, personally I would prefer virtualization.info as objective as possible and constructive in case of problems (regardless of vendor).
I think facts are more important than generating feelings, but it's your job, it's up to you.
Best regards,
Attila
I'm not talking about the bug which caused the license violation. I'm talking about the fact that when a license violation occurs they take extreme action. All software, included licensing components will have bugs.
I worked with a product where a licensing related defect caused 2 solders in Iraq to lose their lives. ESX isn't a toy, it's basic infrastructure. VMware can't predict how people will use the software and how mission critical some features will be. And tt is not acceptable for VMware to cause a customer downtime so that they can project their revenue. The ramifications of a widespread failure like this a huge.
Once I took the time to really guage the impact and speak with my VMWare SE and TAM, i quickly realized this was not the disaster that the author of this blog had claimed... I simply then waited for the patch to come in and have just now completed the remediation on my ESX35U2 farm. I suffered no production downtime or end-user impact.
To the author, as you well know many are just now coming to grips with virtualization as a concept, and your posting caused a lot of unnecessary confusion in my shop over the past 24 hours. I respect what you do, i only ask that you be a bit more careful jumping on an issue while facts are far from complete.
I read your blog almost every day and so far it has been very objective, maybe the most complete and informative blog on the topic. I just can't understand why such a misleading title for this issue. I use VMware (and Xen as well) and I have to say that I was pretty alarmed when I first read your article.
As already pointed out by other posters, there is no shutdown. The problem occurs only when you power off (not reboot) a VM, then you would be unable to power it on again. There is a significant difference.
I expect a correction on your blog.
I once respected your blog. No more.
If you're going to build a product which in ANY of its incarnations involves time-bombed code than design it so that the product and any management console prominently displays a licence expiry message for at least one month before functionality is disabled. That way, users and the vendors get advance warnings.
Now, it could be that the code paths get so screwed up that this sort of mechanism could fail.
Oh, and if you're going to time-bomb your betas make them expire well before the RTM date, so if the wrong code gets into the RTM build it is found immediately.
This is tabloid, emotive and all written in the negative. I'm honestly wondering if Microsoft is helping you with your editing. Just in this one article
- "may shutdown thousands of virtual infrastructures" It hasn't
- "The users from all around the world are reporting failures" As above
- "The existence of such issue is more than enough to undermine the credibility of the company" First time something like this has happened
- "A 36 hours timeframe to provide a solution is just an unacceptable answer". I think their response has been excellent
- "The whole thing may severely damage the stock performance of today" Stock has gone up.
- "VMware is still unable to republish"
- "VMware just informed its customers that it cannot deliver a new, patched image of the product for the planned deadline."
Like other readers, I prefer objectivity, facts and less sensationalism.
I will still read the site, and learn from it. It would take many incidents to rock my faith in him, as it would vmware.
VMware screwed up in colossal fashion. Deal with it and shut the hell up.
The biggest VMware bug in years happened to have... hardly any impact at all on production systems.
I'm sure most VMware employees hate EMC and Microsoft.
Good way to express your feelings in support of Dianne...
Sure VMWare made a big mistake, but admins should NOT readily roll out any system changes without a solid test, approval, change, release, review process first - go ITIL !
Add New Comment
Full Edition | iPhone© Edition


