A VMware mistake may shutdown thousands of virtual infrastructures

Posted by Alessandro Perilli   |   Tuesday, August 12, 2008   |   68 Comments

This morning the VMware’s customers that upgraded their virtual data centers with the new Infrastructure 3.5 Update 2 (build 103908) had an awful surprise: any virtual machine that is turned off cannot be powered on again, and any attempt to execute a VMotion (the live migration of a VM from one host to another) fails.

The reason behind this huge and unprecedented issue is an error in the license expiration time.

The only way to workaround the problem at the moment is to disable the Network Time Protocol (NTP) client and set the date back to August 10, as promptly suggested by a customer here.

Of course this countermeasure has an impact on the log consistency and on any tool that analyzes the VirtualCenter events for different purposes (performance monitoring, trend analysis, capacity planning calculation, etc.).

More than that obviously, this issue has an impact on the availability of those infrastructures where the IT administrators are in vacation (and there are many on August 12) and cannot operate any recovery.

The users from all around the world are reporting failures of part of their systems and in some case even the complete knock-down.

VMware has over 200,000 enterprise customers (100% of Fortune 100 and 95% of Fortune 500), and it claimed that 59% of them use VMotion in production.
The company didn’t provide any statistics about how many already deployed the Update 2, but the license fault could have impacted thousands of them.

VMware is aware of the issue but couldn’t provide any immediate solution.
At the moment it seems that the entire VMware Knowledge Base collapsed.
Calling the support line customers can just receive a brief message saying that the problem will be solved within 36 hours.
Additionally, VMware removed the capability to download any affected product.

The existence of such issue is more than enough to undermine the credibility of the company (which already made some mistakes in the past) in a complex moment of its successful history.
A 36 hours timeframe to provide a solution is just an unacceptable answer for all those enterprises that deploy virtualization in production.

The whole thing may severely damage the stock performance of today.


Update: The license of VMware ESXi 3.5 U2 (build 103909) is reported as affected by the same problem.


Second update: To further aggravate the situation, today is the so called Microsoft Patch Tuesday, so a number of guest operating systems are being automatically (or manually for those unaware of the issue) rebooted.

As this is not enough, any customer running a VDI environment certainly allows its end-users to reboot their virtual desktops any time they want.


Third update: as the VMware Knowledge Base is still unavailable probably due to overload, virtualization.info publishes the original KB article about this issue.

VMwareKB


Fourth update: The issue also impacts ESX 3.5 Update 1 with certain patches. 
The full details are available in the comment section of this post, thanks to the effort of a virtualization.info reader.

Suddenly the problem is no more a matter of early adoption.


Fifth update:
As promptly reported in the comments section, the VMware’s new CEO, Paul Maritz, published on the official blog an apology, informing that a patch has been released:

…I am sure you’re wondering how this could happen.  We failed in two areas:

  • Not disabling the code in the final release of Update 2
  • Not catching it in our quality assurance process 

We are doing everything in our power to make sure this doesn’t happen again.  VMware prides itself on the quality and reliability of our products, and this incident has prompted a thorough self-examination of how we create and deliver products to our customers.  We have kicked off a comprehensive, in-depth review of our QA and release processes, and will quickly make the needed changes…

Maritz couldn’t desire a worst start for its new role in the company. Nonetheless this is a great opportunity: the co-founder and former VMware CEO, Diane Greene, was often accused of being unable to grow her company as a big enterprise, capable of competing against Microsoft.

Handling this incident Maritz has the first chance to demonstrate that he’s the right person to do better than Greene.


Sixth update: VMware is still unable to republish the ESX 3.5 and ESXi 3.5 Update 2 images for fresh installations.
Their availability is expected by August 13, 2008 at 6pm PST.


Seventh update: VMware just informed its customers that it cannot deliver a new, patched image of the product for the planned deadline.

The images are now planned for release August 14, between 2am and 8am PDT.


Eighth update: A number of enterprise customers may be unable to apply the first patch released (see Fifth update above) for a number of reasons:

  • Unable to schedule a maintenance window
  • Internal change control procedures
  • No available server to VMotion running VM’s onto

VMware is aware of these constrain and informed its customers that is developing a second procedure, called U2 Alternative Install Process (U2 AIP), to apply the patch, available on demand calling the Support.
At the moment (August 15, 2008) there is no release date for this new patch installation procedure.

Meanwhile the full patched images are finally available online and all the download links have been reactivated.
The new build numbers are:

  • ESX 3.5 Update 2 – 110268
  • ESXi 3.5 Installable Update 2 - 110271

68 Comments

Anonymous Anonymous Tuesday, August 12, 2008 2:35:00 PM  
Or as a typical Enterprise should do, wait a few weeks or months to deploy the latest update so as to avoid these types of problems.
Anonymous Anonymous Tuesday, August 12, 2008 2:45:00 PM  
The issue will not shut down anything. It will however, prevent you from powering on new VM's. We are all waiting for a patch!
Anonymous Anonymous Tuesday, August 12, 2008 2:54:00 PM  
rebooting a vm is no problem, until you don't power off the vm
Anonymous Anonymous Tuesday, August 12, 2008 3:43:00 PM  
i am doubling comment #1. every enterprise-admin with a little changemanagemnent understanding would wait a few weeks until deploy such a critical patch in the mission critical enterprise. even VI gets simpler and to manage the process for change needs to be harder.

but anyway, massimo is right, its not their first fault ... in a very very hard time ...
Anonymous Anonymous Tuesday, August 12, 2008 4:03:00 PM  
Pegasus error after installing ESX 3.5 update 1, unable to power on vm's after installing ESX 3.5 update 2, I can't wait until update 3! :-P

I'm thinking VMware should ether spend a little money in their QA department or replace them with competent staff. These are both examples of items that should have been easily detected before the product ships.
Anonymous Anonymous Tuesday, August 12, 2008 4:14:00 PM  
This only affects VirtualCenter. You can use the VI Client to connect directly to a host and power it on from there.
Anonymous Anonymous Tuesday, August 12, 2008 4:18:00 PM  
I'm sorry, my comment about logging into hosts...that's a *different* bug.
Anonymous Anonymous Tuesday, August 12, 2008 4:38:00 PM  
Based on the comment from the first guy that a typical enterprise should wait. Wait how long? As far as I know, August 12 is an arbitrary date. It could have been any date.

The point is valid to let others test before you, yet that isn't important with this issue.

The issue is that VMware messed up. The enterprise admin did not. Assign blame where it belongs.

And no, I didn't get burned. Haven't loaded it at any customers yet. "waiting" :)
Anonymous Anonymous Tuesday, August 12, 2008 5:15:00 PM  
VMware forums are now down also.
Anonymous Anonymous Tuesday, August 12, 2008 5:49:00 PM  
VMware has screwed up big time. Their QA has been steadily and noticeably declining as they try and rush product after product out the door. This is a perfect example.

How many production servers are going down today? How many millions of dollars are being lost today?

My office only has a few VMware servers and you can guess there won't be anymore after this.

The guidance being given is to move clocks back two days? ARE YOU SERIOUS? IN THE REAL WORLD, WE HAVE SLAs AND ARE LEGALLY REQUIRED TO KEEP OUR SERVERS TIME SYNCED.

This is a CATASTROPHE.
Anonymous Anonymous Tuesday, August 12, 2008 6:33:00 PM  
If the simple answer is to change the date and your system(s) are hosed anyways, then what's the harm is moving the date back; getting the server back online; REMOVE the update; restart the server; put it back into production under the correct date and time? It's not working anyways so what's the deal with the SLA (if the server is "off" then how does it still maintain that it's time synced?). Move the server to an offline or NON production environ to resolve the issue and move forward. Ok So VM screwed up, learn to adapt and move forward. Take the initiative and show your bosses that you do have a brain cell and can resolve issues on your own already. Don't go blasting companies and not do anything about your current predicament because you chose not to wait awhile for any hapless bugs to be found.
Anonymous Anonymous Tuesday, August 12, 2008 6:34:00 PM  
If you think VMware bugs are bad, just wait until they start crawling out of the Microsoft product.
Anonymous Anonymous Tuesday, August 12, 2008 6:45:00 PM  
Time to look at Hyper V - it rocks!
Anonymous Jason Boche Tuesday, August 12, 2008 7:04:00 PM  
I consider jumping to another (inferior) product a little premature at this point. There are no guarantees a competing product won't make a similar mistake (it's not that hard to imagine when you consider Citrix is one of the competitors). You'd be willing to cut off your nose to spite your face? At least wait until the dust settles and we receive the final resolution and explanation from VMware.
Anonymous Anonymous Tuesday, August 12, 2008 7:21:00 PM  
Has anyone tried to "break" licensing by unloading the licensing service on your licensing server? I'd be curious to see what effect this has on the problem.
Anonymous Anonymous Tuesday, August 12, 2008 7:39:00 PM  
We've been using VMware for years and it's performed wonderfully, but I feel that I owe it to my company to take a look at all solutions and in this case I'm glad that I did. Since I have deployed Hyper V it has performed flawlessly and I'm saving my company lots of money in the process. We are moving ahead with more Hyper V deployments after taking a closer look.
Anonymous Anonymous Tuesday, August 12, 2008 7:51:00 PM  
update from my VMware rep .... looks like versions of 3.5u1 w/ certain patches may be impacted, too.

----------------------------------------
Problem:

An issue has been discovered by many VMware customers and partners with ESX/ESXi 3.5 Update 2 where Virtual Machines fail to power on or VMotion successfully. This problem began to occur on August 12, 2008 for customers that had upgraded to ESX 3.5 Update 2. The problem is caused by a build timeout that was mistakenly left enabled for the release build.

Affected Products:
- VMware ESX 3.5 Update 2 & ESXi 3.5 Update 2 (pre-Update 2 releases are not impacted by this problem).
- Reports of problems with ESX 3.5 U1 with the following patches applied.
1. ESX350-200806201-UG
2. ESX350-200806202-UG
3. ESX350-200806217-UG

- No other VMware products are affected.

What has been done?:

- Product and Web teams pulled the ESX 3.5 Update 2 bits from the download pages last night so no more customers will be able to download the broken build.
- VMware Engineering teams have isolated the cause of the problem and are working around the clock to deliver updated builds and patches for impacted customers.
- A Knowledgebase article has been published (http://kb.vmware.com/kb/1006716), but traffic to the knowledgebase is causing time outs. A new static page has been published at http://www.vmware.com/support/esx35u2_supportalert.html that customers and partners will be able to view.
- The phone system has been updated to advise customers of the problem
- Vmware partners have been notified of the issue.

Workarounds:
1) Do not install ESX 3.5 U2 if it has been downloaded from VMware's website or elsewhere prior to August 12, 2008.
2) Set the host time to a date prior to August 12, 2008. This workaround has a number of very serious side affects that could impact product environments. Any Virtual Machines that sync time with the ESX host and serve time sensitive applications would be broken. These include, but are not limited to database servers, mail servers, & domain administration systems.

Next Steps:
VMware to send an email to all customers who have downloaded this version. This effort is underway and should happen before 11 am today.

Resolution:

VMware Engineering has isolated the root cause and is working to produce patches for impacted customers today (August, 12, 2008) and will likely re-release ESX 3.5 Update 2. More information will be provided as the product teams have it.
Anonymous Anonymous Tuesday, August 12, 2008 7:54:00 PM  
Go for HyperV. There will have to be a fundamental shift in file system design for HyperV to support VMotion style migrations. "Quick Migration" is dumb and is nothing more then Microsoft Clustering Services. The entire LUN has to be brought offline, which means one VM per LUN or you lose a bunch at once if you want to "migrate" the server.
Anonymous Anonymous Tuesday, August 12, 2008 8:31:00 PM  
A) Build/engineering:
Such a mistake should not be possible (e.g: some guy CAN forgot something but the system should not ALLOW to build a release build with #define EXPIRE 1 )

B) Product Management:
Why the hell putting a timebomb in betas anyway... who cares if someone doesn t upgrade it and goes in production with it ? They deserve trouble if they do (pop a warning = ok, make product useless = not ok)

C) IT/website:
A knowledge base is supposed to be up and running during this kind of event? VMware has to understand that there is NO EXCUSE for not being able to maintain a KB site 365 days a year (/., theregister or whatever). VMware wants to show they can beat Microsoft, Redhat, Citrix and whever else ? Fine, learn how to run a simple website first. If your KB solution cannot hold the load, ditch it for what I care (anyway the relevance is crap and google indexing would not hurt).

So, if I may, let's quote the email of Paul Maritz the 9th of July (1 month ago):

"As such, I will call upon our leadership team to be more empowered in decision making, as well as drive down accountability and decision making at all levels in the company."

Now would be a *very* good time to start "driving down accountability" Paul...
Anonymous Anonymous Tuesday, August 12, 2008 8:51:00 PM  
XenServer from Citrix is absolutely SOLID! XenServer 5.0 is in beta2 now even has HA and XenConvert (Physical 2 Virtual conversion)capabilities built in.

VMware Who? I dont trust them. As listed in the above statement, They have helped me confirm why.
Anonymous Anonymous Tuesday, August 12, 2008 9:03:00 PM  
From vmware kb page

The Update patch bundles will be released separately later in the week.

So if you are one of the unlucky ones to have applied the Update 2 patch, now you will have to wait longer than noon PST to resolve the issue.
Anonymous Anonymous Tuesday, August 12, 2008 9:06:00 PM  
Maybe VMware should buy some NetScalers for their website...... :-p
Anonymous Anonymous Tuesday, August 12, 2008 9:13:00 PM  
First off, throw VMware away. Too expensive, tech support very unresponsive and there are better alternatives.

Evaluate Hyper-V or Citrix XenEnterprise. The performace gained by going with XenEnterprise caught all of my IT staff off guard. Much better performance.

If you are using ESX/ESXi in pre-production, again, look at XenEnterprise or Microsoft Hyper-v. You'll need a management application so look no further than VMLogix. They have a labmanager that supports all top 3 hypervisors.

Then call up your Vmware rep. Tell him that you went w/a competitor. This is the only way VMware will get better.

But, also consider they whacked their Founder, Dianne Green and brought in a Microsoft suit. The entire VMware culture will and has changed, for the worse.
Anonymous Anonymous Tuesday, August 12, 2008 9:15:00 PM  
Jason.....your clueless on your comment about Citrix. you shouldn't make ignorant comments.
Anonymous Anonymous Tuesday, August 12, 2008 9:34:00 PM  
1) there is a workaround
2) it fail to start it's not the same issue as described int he subject
3) it's still a critical issue and the kb site should work 24/7

4) xen is in the same court (path to upgrade between 4.0 & 4.1 is not so smooth)
Anonymous Anonymous Tuesday, August 12, 2008 9:54:00 PM  
ESX 2.5.x is as sloid as a rock, but when the 3.x.x came the shit happend at VMWare, It is loades with little bugs and needs more patches than a Windows server, thats why they build a update server! The price of VMWare is astronamical support is also very expensive but if you need them they are not there to help you. It will take 36 hours to get it fixed, and then you will have to download the iso file, together with thousands of others! The patch is released next week. Seriously, this demostrates exactly how serious VMWare looks at this little mistake.

If you want to look at the future, look at XenServer, HA, Multipathing, NetApp connector, XenMotion and a real high available XenCenter!!! I can tell you it is a lot ceaper, solid as a rock and performe even better than VMWare.
Anonymous Anonymous Tuesday, August 12, 2008 9:57:00 PM  
A simple fix to this is to buy Xen Server.
Anonymous Anonymous Tuesday, August 12, 2008 9:58:00 PM  
Dont you mean XenServer 4.2? And the beta is very impressive, I'm thinking about switching.
Anonymous Anonymous Tuesday, August 12, 2008 9:59:00 PM  
And when will XenServer 5.0 be available etc.??
Anonymous Anonymous Tuesday, August 12, 2008 10:43:00 PM  
Just over a month from now.
Anonymous Anonymous Tuesday, August 12, 2008 10:49:00 PM  
Citrix Fanboys....Citrix write some of the buggest software i know. I should know I have been working with it for over 10 years. Every roll-up causes huge problems....clients that take years to fix etc. Yes VMware have dropped the ball here....but i bet it wont happen again. Citrix on the other hand repeat the same mistakes over and over again.
Anonymous Jason Boche Tuesday, August 12, 2008 11:04:00 PM  
"Jason.....your clueless on your comment about Citrix. you shouldn't make ignorant comments."

Hello Anonymous: It's possible we have different perceptions on Citrix, but that doesn't make my perception any more ignorant than yours. Perhaps if you knew of my experiences, you would understand my point of view.

If you want to talk further about ignorance, check the grammar in your post "(your)".

Jas
Anonymous Anonymous Tuesday, August 12, 2008 11:40:00 PM  
It is always best practice for organisations to roll our a new code release in dev/test then commit to production. However, having said that - how long do you test, 1 week, 1 month, 1 year? nothing can detract that this is a serious VMware blunder.
Who would have thought their own engineering team could have created the perfect trojan horse. It is that rediculous, one would even have to be sceptical that it was an inside job to upset the apple cart? maybe one last line of code from diane?
Anonymous Anonymous Wednesday, August 13, 2008 12:02:00 AM  
A sad sad day for VMware, however, i will still continue to use VMware and recommend VMware to my entire client base. They are so far ahead of any competitor - SRM, DPM, ESXi, Storage VMotion. Looks whats round the corner - CA - Continuous Availability (Active-Active cluster VMs, SVI, Linked Clones, true Thin Provisioned vdisks.

Citrix + HyperV arent even at the same stage as ESX2.5 - so please stop adding stupid citrix + hyperv comments.

HyperV doesnt even have a VMotion equivalent.

Citrix has only just introduced HA functionality - and please - remember the first release of XenMotion !!! advised not to be done during business hours !!!

Also, Transparent page memory sharing - or memory dedupe as it is now fashionably called. Citrix + HyperV cant do it ! A simple thing like VMs sharing identical pages of memory !!

VMware, you are like my Alfa Spider. You cost me a lot of money, technology advanced - when things go wrong - it hurts - alot of people knock you - but I am prepared to put up with a little pain for the privelage of having such a state of the art setup for the last 5 years.

But please release the patch soon !!
Anonymous Anonymous Wednesday, August 13, 2008 12:37:00 AM  
Check your facts before "publishing" something... Microsoft Patch Tuesday does not affect this nor does it have anything to do with this. I just rebooted one of my VMs on an affected server with no problem whatsoever. The issue occurs ONLY when attempting to POWER ON a VM. If the VM is powered on already, it will continue to remain operational unless something powers it off. Granted, not the best thing, but not quite as bad as you make it sound. Let's see how fast they get a fix out, and then compare that to a monthly "Patch Tuesday".
Anonymous Jacques Cronje Wednesday, August 13, 2008 1:00:00 AM  
They will let anyone post on the Internet these days.........
Advocating caution and diligence with changes on enterprise infrastructure on one post and then recommending Hyper-V and Citrix on another? You guys SHOULD be posting anonymously.... and btw, Jason's been posting on VMware communities before most of you made it out of high school.

Jacques Cronje
Anonymous Anonymous Wednesday, August 13, 2008 4:42:00 AM  
Glad that I migrated everything to Hyper-V last week...
Anonymous Anonymous Wednesday, August 13, 2008 5:21:00 AM  
Wow, even in a potentially bad situation, VMware still just rocks. The resolution is out: ESX doesn't even require a host reboot to resolve. ESXi requires a reboot, but can VMotion migrate all powered on VMs to another unaffected host so your VMs still have zero downtime while the host is being rebooted. This is the coolest technology ever.
Anonymous Anonymous Wednesday, August 13, 2008 7:37:00 AM  
The Patch Tuesday thing..

Wont the hosting server get a reboot if it gets patched, therefore powering off the VM's and thus causing them to require a power on causing them to break?
Anonymous Leo Bink leo@bink.nu Wednesday, August 13, 2008 7:38:00 AM  
ALl,

At 5:17 (European time) I received an email from vmware stating that there is a Patch released for this issue. A general message to the community has also been released.

http://www.vmware.com/go/esxexpresspatches

Regards,

Leo Bink
Anonymous Dycell Wednesday, August 13, 2008 10:08:00 AM  
I'm always interested in alternatives other than VMware. But the problem is that there is no alternative. I am a certified Xenserver administrator and i think the product sucks. New features include "multipath" and "fibre shared storage"?? This should have been included in the first release.

Sure VMware f*cked up big time here, but if i switched products on that count i would have dropped Citrix and Microsoft a long time ago.

I still strongly agree with Anonymous Wednesday, August 13, 2008 12:02:00 AM
Anonymous Anonymous Wednesday, August 13, 2008 11:10:00 AM  
Is VMware paying for these kind of comments from "Anonymous Wednesday, August 13, 2008 5:21:00 AM"?
Anonymous Anonymous Wednesday, August 13, 2008 11:32:00 AM  
I personally, can forgive this one incident. Nearly a hundred hosts, over 2000 guests, and I have had no major issues(knock on wood) in years.

Of course, no one on my team is crazy enough to put in a patch or update until it's been in release for a while either....short of it solving an issue. That's certainly a bit too trusting for my tastes.

Hyper V is utter crap, xen is a lot better, but neither are anywhere near vmware's level...so I'm not switching, or advising a switch over what amounts to a lot of administrators ignoring the cardinal rule of not updating until the patch is matured.

Microsoft is turfing like a demon now though, it's fun to watch. Every forum or thread I see has at least one turf of hyper v "Vmware sucks so bad, I'm moving to hyper v, because performance is awesome!"....Yeah right. Someone apparently hasn't loaded down a hyper v server and watched as it vomited on itself. If you have the need of a cheaper virtualization product that doesn't do vmotion or true HA, go with the free esx product(and still, use your damned brain and wait before patching--it's not a toy, and it's not shiny--so why act like it is?!?)
Anonymous Paul Wednesday, August 13, 2008 11:36:00 AM  
That happens on both sides e.g. did Microsoft pay for this comment "Glad that I migrated everything to Hyper-V last week..."

Reality is, as bad as this situation is had this problem happend to Hyper-V, it probably would have required a reboot. A reboot in Hyper-V means taking off your workload using "quick migration" which mean 10's of seconds or minutes of downtime per VM. Assuming your VMotion was still operable here, there would have been zero downtime caused by this patch. Even if you were unlucky and both your source and destination were impacted by this issue not allowing you to VMotion, being able to use VMotion for installing patches is certainly the norm and therefore installing VMware patches normally does not result in any downtime.
Anonymous Anonymous Wednesday, August 13, 2008 2:48:00 PM  
Well, I think the best solution is to stop arguing, and find the remedy.
Anonymous Anonymous Wednesday, August 13, 2008 3:12:00 PM  
Some of the leaders at Vmware have talked about the death of Windows, and in operating systems in general. Now that ESX is essentially an operating system in its own right, they need to prove they are up to the task of supporting a broadly-deployed OS. This requires a lot of sophistication from any company, across many organizational/functional dimensions.

This incident reminds us that it is not easy to be an OS vendor. VMware is experiencing some growing pains. Hopefully for their sake they jettison the smug attitude about taking on the world as the "new operating system" and replace it with a bit of humility and responsibility that comes with their wide adoption.
Blogger Gareth James Wednesday, August 13, 2008 4:31:00 PM  
"They are so far ahead of any competitor"?
All currently supported by Xenserver - Xen embedded,XenMotion,High Avaiability, NetApp Storage Migration. Continuous Availability with Marthon for Xenserver, not to mention SVI (When?) Citrix Provisioning Server is already doing it! Xenserver is becoming Vmware's Roadmap :)
Anonymous Anonymous Wednesday, August 13, 2008 6:15:00 PM  
"Jason.....your clueless on your comment about Citrix. you shouldn't make ignorant comments."

Hello Anonymous: It's possible we have different perceptions on Citrix, but that doesn't make my perception any more ignorant than yours. Perhaps if you knew of my experiences, you would understand my point of view.


Sorry....I'll get on my knees and praise the almighty VM gods.

You need to get out more
Anonymous Anonymous Wednesday, August 13, 2008 6:49:00 PM  
why no one mention Virtuozzo? OS virtualization is much easier to manage and support.
Anonymous Anonymous Wednesday, August 13, 2008 7:25:00 PM  
Soooo.. now that the smoke is clearing, what was the actual impact? How many of you actually saw production downtime? How many will stop using VMWare?

We had not loaded the update so saw no impact.
Anonymous Anonymous Wednesday, August 13, 2008 8:59:00 PM  
I've had some bad experience yesterday my Oracle server stop and when I power on de machine I received the error, Fortunately I saw at virtualization.info about the bug and I "fix". I will give a chance for Vmware because Citrix have too a lot of bugs and Microsoft everybody know. Let's give a last chance for Vmware.
Anonymous Fridge Wednesday, August 13, 2008 9:09:00 PM  
I'm guessing that most of the Anonymous comments bashing VMware are from people who don't like / use VMware anyways rather than being from someone who is actually using VMware and had problems caused by this update.

1) Patch Tuesday should have very little effect on anyone here. The problem manifests itself when you power on a VM, not when you reboot a VM. How many people apply patches and then power their servers off, then on rather than reboot.

2) For those people that did have a problem with a VM being powered down and then not being able to power it back up, I'm guessing it was probably one or a couple of VM's rather than dozen's or hundred's in a single company. I don't know why anyone would power off dozen's or hundreds of production VM's at the same time before trying to power one back on.

3) I don't know about Citrix products, but I know from past experience that MS has put out updates that caused more problems than this in the past. With Microsoft's patching history, I can't see going years in a Hyper-V environment w/o at least one major downtime issue caused by a MS patch.

I agree with the the Anonymous post from Wednesday, August 13, 2008 7:25:00 PM, what was the actual impact?

How many people saw actual production downtime?

In the instances of actual production downtime, how many VM's was it?

Outside of anyone who was unfortunate enough to have a hardware failure requiring vmotion, I am guessing the actual impact was very minimal compared to the firestorm of people proclaiming the sky was falling.

Personally I manage 9 ESX hosts with 90+ production VMs and did not have a single second of production impact from this update.
Anonymous Anonymous Wednesday, August 13, 2008 9:33:00 PM  
Virtuozzo??? are you serious? it's for kids not for enterprises... maybe it's easier, but worst when it comes to security and reliability
Anonymous Attila Bognár Wednesday, August 13, 2008 10:21:00 PM  
Hi Alessandro,

I am reading your blog for quite a long time, but to be honest you tend to be more speculative than objective, this last post of yours is tabloid to my taste (although I don't want to underestimate the importance of this VI3 bug).

"A VMware mistake may shutdown thousands of virtual infrastructures": this bug won't shutdown anything

"To further aggravate the situation, today is the so called Microsoft Patch Tuesday": a soft reboot is not a problem

And so on...

IMHO your credibility is more important in the long run, personally I would prefer virtualization.info as objective as possible and constructive in case of problems (regardless of vendor).

I think facts are more important than generating feelings, but it's your job, it's up to you.

Best regards,

Attila
Anonymous Anonymous Wednesday, August 13, 2008 10:38:00 PM  
I've skimmed through the posts and I'm a bit surprised not to see people questioning the fact that there is code in ESX that will prevent powering up a VM or migrating it when there is a license violation.

I'm not talking about the bug which caused the license violation. I'm talking about the fact that when a license violation occurs they take extreme action. All software, included licensing components will have bugs.

I worked with a product where a licensing related defect caused 2 solders in Iraq to lose their lives. ESX isn't a toy, it's basic infrastructure. VMware can't predict how people will use the software and how mission critical some features will be. And tt is not acceptable for VMware to cause a customer downtime so that they can project their revenue. The ramifications of a widespread failure like this a huge.
Anonymous Anonymous Wednesday, August 13, 2008 10:59:00 PM  
VMware rocks - even when MSF try to bad mouth!!
Anonymous Paul Thursday, August 14, 2008 1:51:00 AM  
Support your comments Attila
Anonymous Joe M. Thursday, August 14, 2008 3:18:00 AM  
I have to say, i keeping cool in situations like this is very important. This posting did much to confuse a lot of folks at my workplace who did think the sky had fallen...

Once I took the time to really guage the impact and speak with my VMWare SE and TAM, i quickly realized this was not the disaster that the author of this blog had claimed... I simply then waited for the patch to come in and have just now completed the remediation on my ESX35U2 farm. I suffered no production downtime or end-user impact.

To the author, as you well know many are just now coming to grips with virtualization as a concept, and your posting caused a lot of unnecessary confusion in my shop over the past 24 hours. I respect what you do, i only ask that you be a bit more careful jumping on an issue while facts are far from complete.
Anonymous Anonymous Thursday, August 14, 2008 9:23:00 AM  
Alessandro,

I read your blog almost every day and so far it has been very objective, maybe the most complete and informative blog on the topic. I just can't understand why such a misleading title for this issue. I use VMware (and Xen as well) and I have to say that I was pretty alarmed when I first read your article.
As already pointed out by other posters, there is no shutdown. The problem occurs only when you power off (not reboot) a VM, then you would be unable to power it on again. There is a significant difference.
I expect a correction on your blog.
Anonymous Jad Thursday, August 14, 2008 9:56:00 AM  
I agree with the comment that this blog is just becoming less and less objective. In this article, and especially the one about the Diane Greene firing, Alessandro gave in to sensationalism and speculation, instead of presenting the truth.

I once respected your blog. No more.
Blogger Phil Thursday, August 14, 2008 11:44:00 AM  
Note to ALL software vendors.

If you're going to build a product which in ANY of its incarnations involves time-bombed code than design it so that the product and any management console prominently displays a licence expiry message for at least one month before functionality is disabled. That way, users and the vendors get advance warnings.

Now, it could be that the code paths get so screwed up that this sort of mechanism could fail.

Oh, and if you're going to time-bomb your betas make them expire well before the RTM date, so if the wrong code gets into the RTM build it is found immediately.
Anonymous Anonymous Thursday, August 14, 2008 11:48:00 AM  
Alessandro,

This is tabloid, emotive and all written in the negative. I'm honestly wondering if Microsoft is helping you with your editing. Just in this one article

- "may shutdown thousands of virtual infrastructures" It hasn't
- "The users from all around the world are reporting failures" As above
- "The existence of such issue is more than enough to undermine the credibility of the company" First time something like this has happened
- "A 36 hours timeframe to provide a solution is just an unacceptable answer". I think their response has been excellent
- "The whole thing may severely damage the stock performance of today" Stock has gone up.
- "VMware is still unable to republish"
- "VMware just informed its customers that it cannot deliver a new, patched image of the product for the planned deadline."

Like other readers, I prefer objectivity, facts and less sensationalism.
Anonymous Anonymous Thursday, August 14, 2008 12:52:00 PM  
Agreed, but Alessandro has been a rock for years now. Like Vmware, you cannot allow a short term problem to denigrate the person or company.

I will still read the site, and learn from it. It would take many incidents to rock my faith in him, as it would vmware.
Anonymous Anonymous Thursday, August 14, 2008 4:21:00 PM  
All the VMware shills can go home now. Alessandro has been a leader in this industry for years. Period.

VMware screwed up in colossal fashion. Deal with it and shut the hell up.
Anonymous Anonymous Thursday, August 14, 2008 7:25:00 PM  
All the anti-VMware shills can go home now. VMware has been a rock for years now. Period.

The biggest VMware bug in years happened to have... hardly any impact at all on production systems.
Anonymous VirtualSynic Saturday, August 16, 2008 8:15:00 AM  
Shows you what a disgruntled employee can do.
I'm sure most VMware employees hate EMC and Microsoft.
Good way to express your feelings in support of Dianne...
Anonymous Anonymous Wednesday, November 12, 2008 4:55:00 PM  
Wow ! I can't believe admins would roll out any patch/upgrade without a solid test plan and change control. Duh, create a few simple actions in your test plan to make sure that the product is functional, like powering off/on a VM, vmotion or whatever.

Sure VMWare made a big mistake, but admins should NOT readily roll out any system changes without a solid test, approval, change, release, review process first - go ITIL !
Anonymous Anonymous Saturday, November 15, 2008 5:43:00 PM  
mr anonymous you are a moron. this was timebombed licensing. I'm sure it worked fine for a while. When do you expect someone to roll a patch to a test system, wait 30-60 days then test again to make sure there is no timebomb?

Add New Comment