Outage Update 10/23/2008
Posted by Blue Box Group on Thu Oct 23 15:50:00 UTC 2008
Today’s time line: * 2:00pm: Problem background and current state * 4:02pm: UPS hardware status update * 4:22pm: Photos of the Capacitor * 9:35pm: Transition scheduled * 11:07pm: Board Replaced, Second UPS Still Failing * 11:26pm: Transition postponed
UPDATE 1 - 2:00PM: At this point we’ve gotten a few more technical details about yesterday’s outages from our facilities management provider.
First, some more background:
The our primary data center facility has 3 large Uninterruptable Power Supplies (UPSes) which under normal circumstances run in-line with utility power. All machines housed on this floor in the facility (both ours and the other internet technology companies on the same floor of the building) have their power backed by these three large UPSes. In normal running circumstances, our machines are effectively plugged directly into a couple of large UPSes, which are in turn connected to utility power. In the event that utility power cuts out, there is no transition of power to the UPSes since our machines are already running off the UPSes– this has allowed us to survive several utility power fluctuations without a hitch.
All Blue Box Group machines are powered off UPS 1 and UPS 2 in the facility. (UPS 3 was added this summer as part of an expansion of the facility itself and so far none of our equipment is plugged into it.)
In the event that a UPS fails (or shows warning signs of failing), the generator system automatically kicks in, and load is automatically transferred to the generator, bypassing the UPS having issues. In this situation, there is a transition of power, but as long as the transfer switch in place and the electronics controlling the transition are functioning correctly, no power fluctuations should happen. And, once the transition is complete, the UPSes are completely out of the loop and can be repaired without risking a service outage.
Naturally, if regular maintenance needs to be done on the UPS hardware itself, technicians trigger the process to move over to generator power so that they can safely perform work on the UPSes without risking a service outage.
When the UPSes are behaving normally and are otherwise charged up and ready to support the facility’s load, technicians trigger the process which transfers the load back to the UPSes, which are powered by utility power.
(And just so you know how things happen in the event of a utility power outage lasting more than a few minutes: The generator system is fired up, and is fed to the UPSes (again, in this situation, our machines are plugged directly into the UPSes, so no power transition on our end occurs).)
In order to make sure the system will work in the event of an unexpected utility power outage, and in order to be able to do work on the UPS systems, our facilities management periodically runs the system through the process of transferring the load to the generator and back. They use a 3rd party power company which specializes in putting this exact equipment through these exact transitions for this type of work.
Yesterday afternoon (10/22/2008), they were in the process of completing one of these tests of the system and had reached the point where they were ready to transfer the load from the generator back to the UPSes when we had our power event: Apparently the transfer switches and associated electronics did not function as expected because when the transition attempt was made UPS 3 took over its load without problems, but UPSes 1 and 2 took the load for an instant but immediately transferred back to generator power for what is still an unknown reason. This resulted in an outage of less than one second for all systems connected to UPSes 1 and 2 (which, unfortunately, is everything that Blue Box Group owns in this facility).
The outage was short enough that many of the machines in our network kept running without problems simply because there was enough power in the form of capacitance in their power supplies to keep them going. To other machines, this outage looked like a “brown out†or low-voltage situation which put them into an unstable state which required us to reboot them after the fact. Other machines saw this as a complete loss of power and rebooted themselves as soon as power was restored.
Unfortunately, many of our distribution switches were affected by this power fluctuation and went offline until we could reboot them– which caused problems for any machines connected to them that did happen to survive the power fluctuation.
In the midst of repairing the damage caused by the first power fluctuation, Blue Box Group met briefly with the facilities management people to discuss what happened– it was believed at the time that the size of the load on the system was the leading cause of the transition switch misfire, so they went about powering down non-Blue Box Group non-vital or non-production equipment.
Our facility management again decided to attempt the transition back to UPS power at 8:08pm. This resulted in a similar power fluctuation as the 1:22pm outage, however the power blip this time was much shorter than the first one, so the only affected equipment were three of our distribution switches, which came back quickly after they had been rebooted. Blue Box Group personnel were on-site for this outage as well.
The rest of timeline here is as was detailed in yesterday’s updates.
Current state of things:
Our network is up and stable, but is currently still running on generator power.
Our facilities management company has flown in several experts from the UPS hardware manufacturer, as well as a myriad of testing, load-testing, and other diagnostic equipment in order to troubleshoot the problem of the load transfer from the generator to UPSes 1 and 2.
At this time the underlying cause of the transition switch failure is still unknown. However, as I type this they are running the equipment through many tests and repairing any problems they find. We will update this system status page once we have details on exactly what component of this system failed.
Our facilities management company expects to have the problem diagnosed and repaired later this aternoon/evening. Once the problem is repaired, they will be running the equipment through several transitions mimicking the one that failed using a dummy-load generation unit (essentially, this is a really, really big electric heater) before an attempt is made on the live, production data center load.
The generator has enough fuel for 4 days of load, however, they’ve already scheduled daily refueling of the generator reserve tanks (and in fact, the first refueling has already occurred). The generator is being inspected for proper function every hour by facilities management staff, and by the generator manufacturer once daily.
Our best estimate at this point is that things should be in a state to attempt another transition back to UPS power with high confidence in success sometime after midnight PST tonight. We will update this page with an exact time once we have this from our facilities management company. Blue Box Group personnel will be on-site to oversee our network when this transition occurs.
A couple of conclusions:
It’s important that power back-up equipment be periodically tested to ensure that it is functioning correctly in the case of a “real†emergency. It’s also obvious that these tests shouldn’t in and of themselves cause a “real†emergency. Unfortunately when making these tests, there is an unavoidable chance that that might be exactly what happens.
We’re working with our facilities management company in order to ensure that we can reduce the chances of catastrophe when maintenance on critical systems like this needs to happen– once this emergency is over and we’ve had a chance to define the best controls and practices around this maintenance possible, we’ll be updating this page with more information.
If you are still experiencing any residual problems with your systems or services possibly caused by yesterday’s outages, please don’t hesitate to contact us by either dialing 800-613-4305x1 or e-mailing support@blueboxgrp.com. Again, Blue Box Group personnel will be on-site for the transition should it occur tonight, and we will be updating this page with the exact time this is supposed to happen, once we know the time.
UPDATE 2 - 4:02PM: Our facilities management has informed us that the UPS experts they’ve brought in have determined that there is a failed board in one of the failing UPSes which is the likely cause of the outages yesterday. A replacement board is being flown in and should be here by 8:00pm. Once the replacement board is installed they will run the UPSes through several transition cycles using the dummy-load equipment. If these tests work well, and there’s no other indications of potentially unfixed problems, they will proceed with the actual transition back to UPS power tonight at midnight. We should have the go / no-go decision by around 9:30pm this evening. We’ll update this post as we know more.
UPDATE 3 - 4:22PM: Here is a photo of the current suspect for this problem. This is part of the circuit board of the secondary UPS that communicates with the static switch. The replacement board is en route.

UPDATE 4 - 9:35PM: UPS 1 has been run trough 3 transition cycles at 100% load (more than what is currently in production) without any problems. The new board for UPS 2 has arrived and is being installed. Testing transition cycles at 100% load will begin shortly. Barring any failures, or even slight problems with the new hardware during the testing that will occur shortly, we will be “Go” for the transition of data center load back onto UPSes 1 and 2. So far, the time this will happen is “sometime between midnight and 1:00am” pacitic time. We’ll keep you posted if we get a more definite time the transition will happen. Expect to see an update once UPS 2 has been tested several times.
UPDATE 5 - 11:09PM: The board in UPS2 has been replaced, but UPS2 doesn’t seem to be passing the load tests. MGE is still investigating and will make a decision as to if they want to press on tonight or pause and continue troubleshooting tomorrow. We will update this post when we know more.
UPDATE 6 - 11:26PM: Our facility will continue to troubleshoot the issue but they will not be making the transition tonight. We will have more information for you in the morning.
THE LATEST
THE ARCHIVES
- August 2010
- July 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007

