October 2008 Archive

Posted by Jesse Proudman on Wed Mar 31 15:28:00 UTC 2010


Ruby REXML Security Update

Posted by Blue Box Group about about 1 year ago...

This is just a notice to our customers to pay attention to the REXML security problem found in Ruby. It’s simple to patch and we recommend all customers update their code. Please see this URL for more information:

http://www.ruby-lang.org/en/news/2008/08/23/dos-vulnerability-in-rexml/

  • Blue Box Tech Support

Outage under investigation

Posted by Blue Box Group about about 1 year ago...

As of 1pm, our primary data center appears to have experienced a power loss. We are in the process of investigating this outage and repairing any equipment that is off line. We will keep this page updated as we learn more.

Sequence of events: * 1:22pm - First power drop occurs. * 1:40pm - Confirmed power is online and the majority of machines (80%) are up. * 2:00pm - On site staff continue to bring any downed servers online. * 2:30pm - On site staff are working on a customer by customer basis fixing lingering issues. * 3:00pm - On site staff continue to work on a customer by customer basis fixing lingering issues. * 3:39pm - All machines that were powered off or did not come back are now online. * 5:18pm - All services on all machines should be restored. * 8:08pm - Our facility attempts the cut over again. This was unsuccessful. * 8:19pm - All networking issues created by the last attempt have been resolved. No machines were down. * 9:00pm - Our facility has decided to pause on working on this issue for the evening. All services are presently stable. * 9:25pm - Final Wrap-up for 10/22/2008. More information forthcoming tomorrow.

UPDATE - 1:40PM: We have narrowed the outage down to a power loss to a portion of our datacenter. Power has been restored, and we are in the process of verifying that all machines are coming back online. If your machine or website is still inaccessible, please open a ticket in our support system to describe what you are seeing. We will continue to update this post throughout the afternoon as we have further information.

UPDATE - 2:00PM: We have staff on site and working to bring any off line servers back online. If your server is having issues, but is accessible via SSH, please note that at the moment our first priority is to bring up any servers that are inaccessible; if you have development/technical staff on hand, you may want to see if they can have a look at the issue – otherwise feel free to open a ticket and we will take care of it as soon as we can.

UPDATE - 2:30PM: We are continuing to troubleshoot issues as they arise. Support tickets are still your best bet for any specific issues.

UPDATE - 3:00PM: We are continuing to bring sites and servers back online. If you’ve opened a ticket with us, keep an eye out for an e-mail – we will be sending notifications as we have confirmation that sites and services are back online.

UPDATE - 3:39PM: All machines are now confirmed to be powered on. We are still working to resolve servers that are online but not serving pages properly.

Also, we have solved a problem on our mail server that was causing delays in sending and receiving mail for many of our users. If you are still having mail issues, please let us know through our ticketing system.

UPDATE - 5:18PM: All known services on customer machines are now back online. If you are continuing to have a problem with your website, please open a support ticket and mark it urgent. One of our staff will take a look right away.

To answer your questions as to what happened, the short story is that while our facility management were doing quarterly maintenance to the power grid, there was a momentary drop in power causing a number of our machines and access switches to reboot. In total, the majority of our customers were down for less than an hour and the remainder who had their machines loose power were down for approx 2 hours.

The long story will be documented tomorrow and posted here. We will also be issuing a credit per the terms of our SLA that should be sent out tomorrow as well.

UPDATE - 8:08PM - Our facility attempted to cut back from the generator after repairing what they believed to be the problem and it appears their designed solution did not work. We do not appear to have lost power to any customer equipment during that cut over, however, there appeared to be an issue with 2 access switches. We continue to have staff on site.

UPDATE - 8:19PM - All issues with the two previous access switches have been repaired. All network access for all customers is functioning normally.

UPDATE - 9:00PM - Our facility has suspended work tonight to prevent any further outages. Additional personnel will be brought in from the manufacturer of the UPS system tomorrow to try to diagnose the actual issue. We will continue to operate on generator power until the cut over is made. We will have staff on site all of tomorrow to resolve any issues that arise. We will update this ticket when we have more information available.

UPDATE - 9:24pm - Our facility has said they will not be making any more changes to the power distribution system tomorrow. They will be bringing in more man-power from the UPS manufacturer, as well as bringing in load testing equipment to test the UPS systems independent of putting the facilities load back on them. When the final cut over is made, it will be made after hours and we will let you know substantially beforehand through this support page.

Again, we appologize for all the trouble you’ve experienced today. Electricity is a tricky beast and if you look at major data center failures in the last year, you’ll see that most of them are due to some form of electrical issue. We feel fortunate that we were able to restore all our equipment within 2 hours. We also feel fortunate that this has been our first major outage affecting a sizable portion of our customer base. Our response plans worked as they should.

When everything is said and done and this whole situation has stabilized, we will complete our full assessment and provide it to each customer along with SLA credits for all outages incurred as a result of this.

If you as a customer have any comments on how we could have handled this situation better for your needs, we are very interested in hearing them. Please open a support request and we will work with you to make sure you concerns are addressed.

Thank you

  • Blue Box Group

Outage Update 10/23/2008

Posted by Blue Box Group about about 1 year ago...

Today’s time line: * 2:00pm: Problem background and current state * 4:02pm: UPS hardware status update * 4:22pm: Photos of the Capacitor * 9:35pm: Transition scheduled * 11:07pm: Board Replaced, Second UPS Still Failing * 11:26pm: Transition postponed



UPDATE 1 - 2:00PM: At this point we’ve gotten a few more technical details about yesterday’s outages from our facilities management provider.

First, some more background:

The our primary data center facility has 3 large Uninterruptable Power Supplies (UPSes) which under normal circumstances run in-line with utility power. All machines housed on this floor in the facility (both ours and the other internet technology companies on the same floor of the building) have their power backed by these three large UPSes. In normal running circumstances, our machines are effectively plugged directly into a couple of large UPSes, which are in turn connected to utility power. In the event that utility power cuts out, there is no transition of power to the UPSes since our machines are already running off the UPSes– this has allowed us to survive several utility power fluctuations without a hitch.

All Blue Box Group machines are powered off UPS 1 and UPS 2 in the facility. (UPS 3 was added this summer as part of an expansion of the facility itself and so far none of our equipment is plugged into it.)

In the event that a UPS fails (or shows warning signs of failing), the generator system automatically kicks in, and load is automatically transferred to the generator, bypassing the UPS having issues. In this situation, there is a transition of power, but as long as the transfer switch in place and the electronics controlling the transition are functioning correctly, no power fluctuations should happen. And, once the transition is complete, the UPSes are completely out of the loop and can be repaired without risking a service outage.

Naturally, if regular maintenance needs to be done on the UPS hardware itself, technicians trigger the process to move over to generator power so that they can safely perform work on the UPSes without risking a service outage.

When the UPSes are behaving normally and are otherwise charged up and ready to support the facility’s load, technicians trigger the process which transfers the load back to the UPSes, which are powered by utility power.

(And just so you know how things happen in the event of a utility power outage lasting more than a few minutes: The generator system is fired up, and is fed to the UPSes (again, in this situation, our machines are plugged directly into the UPSes, so no power transition on our end occurs).)

In order to make sure the system will work in the event of an unexpected utility power outage, and in order to be able to do work on the UPS systems, our facilities management periodically runs the system through the process of transferring the load to the generator and back. They use a 3rd party power company which specializes in putting this exact equipment through these exact transitions for this type of work.

Yesterday afternoon (10/22/2008), they were in the process of completing one of these tests of the system and had reached the point where they were ready to transfer the load from the generator back to the UPSes when we had our power event: Apparently the transfer switches and associated electronics did not function as expected because when the transition attempt was made UPS 3 took over its load without problems, but UPSes 1 and 2 took the load for an instant but immediately transferred back to generator power for what is still an unknown reason. This resulted in an outage of less than one second for all systems connected to UPSes 1 and 2 (which, unfortunately, is everything that Blue Box Group owns in this facility).

The outage was short enough that many of the machines in our network kept running without problems simply because there was enough power in the form of capacitance in their power supplies to keep them going. To other machines, this outage looked like a “brown out” or low-voltage situation which put them into an unstable state which required us to reboot them after the fact. Other machines saw this as a complete loss of power and rebooted themselves as soon as power was restored.

Unfortunately, many of our distribution switches were affected by this power fluctuation and went offline until we could reboot them– which caused problems for any machines connected to them that did happen to survive the power fluctuation.

In the midst of repairing the damage caused by the first power fluctuation, Blue Box Group met briefly with the facilities management people to discuss what happened– it was believed at the time that the size of the load on the system was the leading cause of the transition switch misfire, so they went about powering down non-Blue Box Group non-vital or non-production equipment.

Our facility management again decided to attempt the transition back to UPS power at 8:08pm. This resulted in a similar power fluctuation as the 1:22pm outage, however the power blip this time was much shorter than the first one, so the only affected equipment were three of our distribution switches, which came back quickly after they had been rebooted. Blue Box Group personnel were on-site for this outage as well.

The rest of timeline here is as was detailed in yesterday’s updates.

Current state of things:

Our network is up and stable, but is currently still running on generator power.

Our facilities management company has flown in several experts from the UPS hardware manufacturer, as well as a myriad of testing, load-testing, and other diagnostic equipment in order to troubleshoot the problem of the load transfer from the generator to UPSes 1 and 2.

At this time the underlying cause of the transition switch failure is still unknown. However, as I type this they are running the equipment through many tests and repairing any problems they find. We will update this system status page once we have details on exactly what component of this system failed.

Our facilities management company expects to have the problem diagnosed and repaired later this aternoon/evening. Once the problem is repaired, they will be running the equipment through several transitions mimicking the one that failed using a dummy-load generation unit (essentially, this is a really, really big electric heater) before an attempt is made on the live, production data center load.

The generator has enough fuel for 4 days of load, however, they’ve already scheduled daily refueling of the generator reserve tanks (and in fact, the first refueling has already occurred). The generator is being inspected for proper function every hour by facilities management staff, and by the generator manufacturer once daily.

Our best estimate at this point is that things should be in a state to attempt another transition back to UPS power with high confidence in success sometime after midnight PST tonight. We will update this page with an exact time once we have this from our facilities management company. Blue Box Group personnel will be on-site to oversee our network when this transition occurs.

A couple of conclusions:

It’s important that power back-up equipment be periodically tested to ensure that it is functioning correctly in the case of a “real” emergency. It’s also obvious that these tests shouldn’t in and of themselves cause a “real” emergency. Unfortunately when making these tests, there is an unavoidable chance that that might be exactly what happens.

We’re working with our facilities management company in order to ensure that we can reduce the chances of catastrophe when maintenance on critical systems like this needs to happen– once this emergency is over and we’ve had a chance to define the best controls and practices around this maintenance possible, we’ll be updating this page with more information.

If you are still experiencing any residual problems with your systems or services possibly caused by yesterday’s outages, please don’t hesitate to contact us by either dialing 800-613-4305x1 or e-mailing support@blueboxgrp.com. Again, Blue Box Group personnel will be on-site for the transition should it occur tonight, and we will be updating this page with the exact time this is supposed to happen, once we know the time.


UPDATE 2 - 4:02PM: Our facilities management has informed us that the UPS experts they’ve brought in have determined that there is a failed board in one of the failing UPSes which is the likely cause of the outages yesterday. A replacement board is being flown in and should be here by 8:00pm. Once the replacement board is installed they will run the UPSes through several transition cycles using the dummy-load equipment. If these tests work well, and there’s no other indications of potentially unfixed problems, they will proceed with the actual transition back to UPS power tonight at midnight. We should have the go / no-go decision by around 9:30pm this evening. We’ll update this post as we know more.

UPDATE 3 - 4:22PM: Here is a photo of the current suspect for this problem. This is part of the circuit board of the secondary UPS that communicates with the static switch. The replacement board is en route.

Bad Capacitor!

UPDATE 4 - 9:35PM: UPS 1 has been run trough 3 transition cycles at 100% load (more than what is currently in production) without any problems. The new board for UPS 2 has arrived and is being installed. Testing transition cycles at 100% load will begin shortly. Barring any failures, or even slight problems with the new hardware during the testing that will occur shortly, we will be “Go” for the transition of data center load back onto UPSes 1 and 2. So far, the time this will happen is “sometime between midnight and 1:00am” pacitic time. We’ll keep you posted if we get a more definite time the transition will happen. Expect to see an update once UPS 2 has been tested several times.

UPDATE 5 - 11:09PM: The board in UPS2 has been replaced, but UPS2 doesn’t seem to be passing the load tests. MGE is still investigating and will make a decision as to if they want to press on tonight or pause and continue troubleshooting tomorrow. We will update this post when we know more.

UPDATE 6 - 11:26PM: Our facility will continue to troubleshoot the issue but they will not be making the transition tonight. We will have more information for you in the morning.


Outage Update 10/24/2008

Posted by Blue Box Group about about 1 year ago...

Today’s time line: * 9:21am: New Parts On The Way * 3:50pm: Another defect found * 5:09pm: UPS 2 functional * 9:00pm: “Go” for transition to utility power * 9:58pm: Go Time * 10:09pm: We’re back on UPS Power!



UPDATE 1 - 9:21am: We continue to run on generator power this morning after our facility was unable to repair UPS2 last night. Our facility took a fuel delivery yesterday and has enough fuel to run on generator until Monday. Additional fuel deliveries have been scheduled.

After further study of the problems with UPS2 from last evening, MGE has decided to fly in few more repair parts from California. Those should arrive this morning on a commercial flight. Once they’re here, they will be installed and UPS2 will under go the same level of testing UPS1 went under yesterday. Assuming that goes well, the planned transfer will occur tonight after 12am. We will keep you posted throughout the day as we find out more.


UPDATE 2 - 3:50pm: A small defect has been found in the static switch used to transfer the load. This switch is being replaced. Also, today’s refuelling has occurred– there is now enough fuel on site to get us through Tuesday on generator power.


UPDATE 3 - 5:09pm: The repairs to UPS 2 are complete and it has passed load transfer tests. Technicians are now in the process of testing UPS 1 and UPS 2 together to make sure these fail over together correctly under 100% load. The replacement of the static switch defective part will happen after 7:00pm this evening, and this is a very simple replacement. At this time the cut-over back to utility power has been tentatively scheduled for 10:00pm this evening. Again, we’ll post more as we know more.


UPDATE 4 - 9:00pm: The parts for the static switch have arrived and are being installed. We should be good to go for the transition back to utility power in one hour.


UPDATE 5 - 10:00pm: Our staff is in our data center and our facility is waiting on the arrival of a few other people an they will make the cut over within the next few minutes. We will post here when it’s complete. They have done a very great job with research on the cause of the failure which we will pass on in our full write up on Monday.


UPDATE 6 - 10:10pm: The transfer was a complete success and we are now running on UPS backed power. Thank you for your patience throughout this process. As promised, we will have a full documented write up of the entire outage available on Monday along with information on SLA credits.

  • Blue Box Group Tech Support