February 2009 Archive

Posted by Jesse Proudman on Wed Mar 31 15:28:00 UTC 2010


February 12th Network Outage

Posted by Blue Box Group about about 1 year ago...

We experienced a network outage beginning at approximately 2:00 PM PST and ending around 2:45 PM PST. We are currently investigating the cause of this outage and will have more information available in the near future. Please keep checking the system status page for continued updates.

Please note, this outage also has taken out our corporate phones.

If you’re noticing a problem, please open a ticket at https://support.blueboxgrp.com.

Update 3:00pm - Our office phones are now back online. Please call us if you are noticing any lingering issues. We are continuing to diagnose the issue and will provide an update here in 15 minutes.

Update 3:15pm - We have a pretty solid discovery of the cause of this outage, but we are continuing to identify a few elements. Please expect our formal write up by 4pm.</strong

Update 4pm As of 4pm, here is what have learned: At approx 2pm Pacific, the passive member of one of our redundant F5 Load Balancer configurations switched modes from passive to active, ignoring the fact that the existing active member was already active and operating as expected. This caused the two devices to experience MAC address conflicts on the many sets of shared MAC addresses they maintain. It is these shared MAC addresses that allow the “high availability” functionality of most “high availability” devices to work. Unfortunately, this dual announcement of MAC addresses sent our distribution layer switches into spanning tree loop as they both tried to re-learn the physical MAC addresses / IP Address relationships of these dual announcements.

At 2:20, we had discovered the root cause of the issue (the secondary load balancer) and had physically unplugged the offending load balancer’s power. This appeared to resolve the issue for for a moment, and our network came back online for a short period of time (5-10 minutes), only to drop back off line shortly there after.

By 2:40, we identified a looping condition in spanning tree between our two redundant core distribution switches that was causing a packet loop, causing elevated CPU usage and blocking the ability of these switches to continue to switch packets. We unplugged our second redundant switch to break the loop and force spanning tree to properly rebuild on our primary distribution switch. This action broke the cycle and allowed the spanning tree tables to properly rebuild on the primary switch.

As of 4pm, our secondary distribution switch remains disconnected from the network and will remain as such in the short term. Further, our secondary load balancer remains unplugged until it can be inspected by F5 to ensure that is in proper operating order.

We will post here prior to making any recovery efforts to bring the secondary network gear online. We expect that to occur tomorrow night after 12am but will post here before any action is initiated.

It’s worth pointing out that this event was very similar in nature to what happened to Slashdot last week as written by Pingdom. Fortunately, our staff was able to resolve what often can become a multi hour outage very quickly through the use of the networking discovery tools we have previously implemented.

As with every outage that we experience, our staff will step back and take a look at what could have been done to expedite the recovery from an event such as this. We will post any further modifications or changes we make to our policy here as they are determined.

Please note, we will be issuing SLA credits for this outage to all customers including shared hosting. We expect these to be distributed on Friday, February 13th.

Update 5:00pm Further analysis has indicated that this issue did not effect all customers. Rather, it was limited to specific VLANs. It does appear that there was a full system wide outage for approx 10 minutes. We will continue to research this issue throughout the evenig and will provide more information tomorrow.

Thank you for your patience and your business.

  • Blue Box Tech Support

February 12th Network Outage Follow Up

Posted by Blue Box Group about about 1 year ago...

We will be posting updates on our discovery of the February 12th outage in this entry.

Update - February 13th, 2:15pm - We are continuing an exhaustive diagnosis of the outage, going line by line through all the log files of the associated devices. We want to make sure we have a full explanation and plan to avoid this issue in the future before we post. We will be delaying our full write up until next week.

Update - February 19th, 5:15pm - We believe we have a fairly strong grasp of what caused this outage and how to resolve it. We are now building a mock network in our office to try to replicate the problem to confirm our suspicions. We expect to have that testing complete by next week and will then modify our document accordingly and sent out our credits and our write up.

Update - February 24th, 5:45pm - We have completed our write up of this incident and have sent it to our customer base via email. The problem was traced down to what is believed to be a faulty piece of hardware and was caused by one of our standby F5 units going active. If you did not receive your copy of the write up, or have further questions, please do contact us.

Thank you for your patience.

  • Blue Box Tech Support

February 16th Upstream Network Issue

Posted by Blue Box Group about about 1 year ago...

One of our upstream network providers had a problem this morning that caused a failure in their routers. We immediately shut down our announcements through their network and are waiting on BGP to reconverge for those routes to disappear. This is only effecting customers who were previously connecting through that path.

This is unrelated to the February 12th outage. All of our internal systems are functioning normally.

Update: 9:02am - Routes are now properly updated across the global routing table.

Update 9:13am - The root cause of this event, a mis-configured router from Europe sending bad data into the global BGP table, is being discussed on NANOG (North American Network Operators Group). The event is apparently causing a number of transit providers trouble this morning.

Update 9:28am - Here are some examples of the effects of this type of update. Our routers stayed operational but upstream routers bounced traffic during this event. That BGP session bounce is what caused the BGP recovergence and why we shut down our connection with that provider.

Update 9:35am - More news is coming out on NANOG and through our communications channels with our upstream provider that this issue affected some major backbone transit providers.

Update 6:42pm - The story has finally been picked up by Slash Dot. You can read about it here: http://tech.slashdot.org/article.pl?sid=09/02/16/2233207

Update 7pm - For those of you who are curious, here’s what our bandwidth usage looked like during the period. You can see the highlighted sections where we shut down our BGP session with the broken upstream provider, and you can see the dips in traffic where other providers throughout the internet had problems with their routers.

Fortunately, our routers stayed online and were not brought down by the cisco bug that caused the most of the damage. They continued to route all traffic they received just fine.

Update 10:15pm - Two more interesting articles have been posted about this event. One was posted by Renesys and one by Arbor Networks.

  • Blue Box Tech Support

UPS System Maintenance

Posted by Blue Box Group about about 1 year ago...

On Thursday, March 5th starting at 06:00 hrs PST and ending at 18:30 hrs PST, our primary facility will perform annual maintenance on their UPS systems. This maintenance will include replacing the batteries in UPS 1 and upgrading the capacitors in UPS 3. Our equipment utilizes power from both UPS systems.

For the duration of this maintenance, the data center electrical load will be transferred to generator power. This event should cause no outage for any of our customers. However, with the events of October fresh in our mind, we are writing to inform you of the pending work, and to let you know of the precautions our facility is taking to ensure that a similar power failure does not happen again.

This maintenance window has been selected by our facility based on factors designed to provide the greatest degree of protection for its tenants. These factors include the availability of the most senior technicians from their UPS vendor to oversee the transition on site, and the ability to mobilize parts and additional service personnel in the extremely unlikely event of a component malfunction or failure during the maintenance. Additionally, by servicing both UPS systems during the same maintenance window, the facility provider eliminates an additional operation of our maintenance bypass switch thus limiting exposure to a potential voltage drop to data center critical equipment.

The schedule for this maintenance is as follows:

06:00 - Data center Temperature Reduction: Our facility will force the temperature of the data center toward the lower portion of the ASHRAE allowable envelope prior to transferring power to the back-up generator. When the transfer takes place, the HVAC systems will power-cycle causing a small (3 to 5 degree Fahrenheit) thermal inclination in the data center. By lowering the overall space temperature, we can be assured that equipment temperatures will not be adversely impacted by the 10-minute restart period of the HVAC systems.

07:00 to 07:30 - Generator pre-flight evaluation: Check fluids, connections, air inlets/exhaust, fuel supply, fuel lines, filters and separators. Log results.

07:30 - Generator start-up and warm-up: Start the generator and monitor performance stats, check for fluid leaks, supply artificial load and evaluate voltage, amperage, frequency and engine stats against established baselines.

08:00 - Transfer from grid power to generator power and wrap power around UPS: Manually operate the ATS and UPS maintenance bypass to force the data center electrical load to the generator. At this time our facilities HVAC system will power cycle as described above.

08:00 to 12:00 - Replace batteries in UPS 1: The UPS system must be completely powered off for life safety while the batteries are removed and replaced.

12:00 to 12:30 - Power on UPS system and perform artificial load testing: Following the removal and replacement of components, the UPS systems will be powered on and connected to an artificial load (not data center equipment, servers or other infrastructure) for testing. With this equipment, our facility will test the transfer process and mechanism as well as simulate a load of 80% of the UPS system capacity. This will catch any failure similar to the one that occurred last October prior to putting live load on the system.

13:00 - Transfer data center load back to UPS 1: Once the process and mechanism has been validated, our facility will transfer the data center load back to UPS 1. Power will continue to be supplied to the UPS from the back-up generator.

13:30 to 17:30 - Replace capacitors in UPS 3: The UPS system must be completely powered off for life safety while the capacitors are removed and replaced.

17:30 to 18:00 Power on UPS 3 and transfer data center load: Once the capacitors have been replaced and charged, our facility will transfer the data center load back onto UPS 3.

18:00 - Transfer data center live load back to grid power: Once all testing has been completed and both UPS systems are online and functioning within specifications, our facility will transfer the data center live load back to grid power.

18:00 to 18:30 - Generator Cool-Down and Post-Flight Evaluation: Following the operation of our generator, our facility will perform the same evaluations they performed during the pre-flight as well as allow the generator to cool prior to shutting it down.

One to two weeks following this maintenance, our facility will have to repeat this process to validate the condition of the new batteries and replace any questionable jars. This preventive maintenance will follow the same procedure as above but require significantly less time to complete. A notice will be posted to our system status page prior to this future maintenance.

We recognize that you may be sensitive to such power work based on the issue that occurred in October. We’re extremely confident in the testing measures our facility is taking in this case to ensure the freak incident that occurred does not occur again. We’re also happy to schedule a conference call between our customers and the facility management to answer any further questions about this work. If you are interested in that call, please send an email to jesse.proudman (at) blueboxgrp.com and we will send you details.

Update March 5th - 8:24 AM (Thu Mar 05 16:24:04 UTC 2009) - As of 8am Pacific, our on site staff have found out from our facilities provider that this work will be postponed due to a complication with one of their UPS vendors. This work will be rescheduled in the next 2-3 weeks and we will post the updated time frame here when it becomes available.

Update March 9th - 10:48 AM (Mon Mar 09 17:48:06 UTC 2009) - This work has been rescheduled for March 19th. We will use this thread to continue to track this work.

Update March 19th - 8:23 AM (Thu Mar 19 15:23:47 UTC 2009) - Our staff are on site and our facility is proceeding with this work as planned. As of 8:23am, the UPS system has been removed from the loop and we are operating on generator power. All is going as to plan.

Update March 19th - 12:55 PM (Thu Mar 19 19:55:38 UTC 2009)Our facility has completed the work on the UPS unit powering customer services without a problem. They will be doing some additional work on a secondary UPS unit that powers our networking core this afternoon, and our staff will remain on site.

Update March 19th - 3:53 PM (Thu Mar 19 22:53:58 UTC 2009)This work has been completed with no problems.

This work can be considered closed and was a complete success.
  • Blue Box Tech Support

Domain Platform Maintenance

Posted by Blue Box Group about about 1 year ago...

On the following date and time, our domain platform provider will be doing maintenance on the both the registration / renewal servers, and the domain management servers. During this time frame, access to those resources will be intermittent.

Date: Saturday, February 28 - Sunday, March 1, 2009 Time: 23:00 - 03:00 UTC

Thank you

  • Blue Box Tech Support

Network Maintenance: February 25, 2009

Posted by Blue Box Group about about 1 year ago...

On Wednesday, February 25 starting at 11:00pm PST, we will be performing the first of a series of upgrades to our network core to address some of the issues discovered during the February 12th network event. While no noticeable network outage is expected for any customer services during this time, the work we are doing will directly affect our network core, and therefore an increased risk of outage will exist. Work may continue until 3:00am PST. During this time:

  • Our secondary distribution switch will be upgraded
  • Access-level switches will have one side of their redundant uplinks moved to the new hardware.
  • The BIG-IP unit with the possible hardware fault will be taken offline.

Update Feb. 26, 2:52am: Tonight’s network maintenance is complete. We accomplished all the above tasks, and except for one customer machine that was inadvertently unplugged when moving network gear, there was no impact on customer services.

If you are experiencing any service problems as a result of tonight’s work, please open an urgent support ticket and we’ll take care of the problem ASAP.


vsh17 lock up

Posted by Blue Box Group about about 1 year ago...

Virtual Private Server host vsh17.blueboxgrid.com locked up today at around 4:40pm. We are presently in the process of rebooting the box and applying what we believe will be a permanent fix to a rare but known problem with OpenVZ on AMD hardware.

Update at 4:55pm: vsh17 is back up and running and the fix has been applied. All VPSes that were on this host are back up and running again.

(Web only post)


Secondary Name Server (ns2) Offline

Posted by Blue Box Group about about 1 year ago...

This outage is of a redundant system that will not make any sites inaccessible.

The facility where we host our secondary name server (ns2.blueboxgrid.com) is experiencing a system wide outage with their routing core that has temporarily made that server inaccessible. That server is part of a redundant pair and the primary server (ns1.blueboxgrid.com) is up and is resolving names just fine. There is no present impact as a result of this outage. This outage began at 01:37:26 am. The facility has staff working on the issue but is unable to provide an ETA for recovery. We will update this post when the issue has been resolved.

Update 7:48am - The provider’s network at this second facility came back online at 03:22:14 am and as of that time, our name server began responding to requests normally. We are waiting on the write up from that facility as to what happened to their network and will share that information here.

Update 12:50pm - Here is the update as provided by the facility provider:

Our facility experienced network-wide connectivity issues between approximately 1:30 AM and 3:00 AM PST. During this time, customers may have experienced issues connecting to their sites. The event was a result of a legacy configuration on our switched network that interacted badly with configuration changes made by one of our customers. During the event, our core network switches were experiencing an unusually high processor load that inhibited their ability to forward traffic.

Network Maintenance: February 26, 2009

Posted by Blue Box Group about about 1 year ago...

We will be continuing work on the network core to address issues discovered as a result of the February 12th outage. This network maintenance will start at 11:00pm on Thursday, February 26 and may continue until 3:00am. While no outage is expected as a result of this work, an increased risk of outage exists for the duration of the maintenance.

During this round of maintenance we will be performing the following:

  • Our other distribution switch will be upgraded (note that it is already acting as ‘secondary’ as a result of the network maintenance on Feb. 25)
  • Access-level switches will have the other side of their redundant uplinks moved to the new hardware.
  • Our two OpenBSD firewalls will be removed from the network core, and the border firewall functions will be moved over to our new core routing gear.

Update Feb 26, 2:57am PST (10:57 UTC): We’re done performing network maintenance for the night. All of the above goals were accomplished. Some customers (notably VPS customers on our shared VPS infrastructure and shared hosting customers) may have noticed around 30 seconds of down-time as gateway addresses were moved from the OpenBSD firewalls to the core routing gear. At this point we’ve addressed the lion’s share of the concerns brought up by February 12th’s network outage. We will be addressing the last of these in coming weeks, however none of this affects the core routing infrastrusture or network topology, so network maintenance events may not be posted for this.

If any customers are experiencing any problems on the network as a result of tonight’s work, we urge you to open urgent support tickets so we can address them ASAP.


vsh17 Second Lock Up

Posted by Blue Box Group about about 1 year ago...

vsh17 appears to have locked up again. It appears the problem which caused a lock up 2 days ago persists. We’re rebooting the machine again now, and once it is up we will begin live migrations of VPSes hosted on it to other VPS hosts on our network.

Update 7:19am Pacific (3:19 pm UTC) vsh17 is back online and we are beginning the live migration process for affected customers. Customers should not notice any additional down time as part of the migration. We will update this thread when those migrations are complete.

Update 10:01am (Fri Feb 27 18:01:56 UTC 2009) All customers from vsh17 have been migrated off onto other hardware. We will consider this issue resolved.


Upstream Network Maintenance

Posted by Blue Box Group about about 1 year ago...

On March 1st, two of our upstream providers will be completing network maintenance during different windows. Neither should be service affecting.

The first event will begin at 01:00 PST (Sunday March 1 09:00:00 UTC 2009) and will be complete by 03:00 PST (Sunday March 1 11:00:00 UTC 2009). The provider will be doing a software upgrade on our directly connected router. During this time, we will be shutting down our announcements through this provider to prevent any misconfiguration in their systems as a result of this work from affecting our network. Our other carriers will carry our traffic without problem during this window.

The second event will begin at 22:00 PST (Monday March 2 06:00:00 UTC 2009) and will be complete by 23:59 PST. During this event, one of our carriers will be turning up another provider on one of their routers in their facility. This work is not being done on our directly connected switch, and customers should see no direct impact as a result beyond a brief period of routing reconvergence when the additional carrier is brought online.

Please let us know if you have any questions.

Update Mon Mar 02 07:00:19 UTC 2009 - Both events have been completed with no impact on customer service.

  • Blue Box Group Tech Support