Scaling up in a hurry: A practical example

Posted by Nathan Kaiser on Wed Feb 11 09:32:00 UTC 2009


When I tell people I work in the server hosting business, and that along with Ruby on Rails, we specialize in helping businesses scale their applications from those which see little traffic and sit comfortably on a shared-hosting server, all the way up to sites which see tens of thousands of unique visitors per hour and require large clusters of powerful machines to meet the demand, one of the questions they frequently ask is, “How do you do that?”

There is no simple answer to this question. I usually start by telling people, “It depends on the customer’s application,” which is true, but not very satisfying. A more precise (if no more satisfying) answer would be to say, “We do it by discovering and alleviating bottlenecks.” Doing this generally requires intimate knowledge of the application in question and the underlying subsystems on which it relies– which can vary a lot from application to application. Having a good team of smart, experienced technical people who are quick on their feet always helps. Sometimes we can help stream-line logical processes within an application or adjust configurations so that major hardware upgrades are unnecessary; Sometimes scaling up the hardware is the best solution to the problem. Really, it depends.

Still, we’d like to be able to give people a better idea about how the process might be done from a practical perspective. And since we had the opportunity to do this yesterday for one of our customers, we thought it would be a good opportunity to write about how we did it in this case.

(As a side note, I should say that we generally have a bit more time to go about the scaling up process– we keep trend graphs and otherwise monitor our customers’ machines specifically so that we do have a window of warning to deal with scaling issues before they become critical. Occasionally, though, an event can happen which can throw an otherwise moderate-to-low usage web site into a situation where it sees 10 to 20 times its usual traffic load overnight. It is this exact scenario that we bumped into yesterday.)

Some background on the “normal” state of affairs for this website:

  • Moderate site traffic: Generally about 5-6 Mbps of outbound traffic at peak on most days, and less than a thousand unique website visitors in an hour.
  • Server has 2 CPU cores, 2GB of RAM, and 160GB of SCSI disk space in a hardware RAID-1
  • Website is apache+php, including SSL components and is mainly driven by a Wordpress back-end along with a custom PHP engine.
  • The site is primarily an e-commerce site but contains much other general content outside the e-commerce site.


And here’s what happened today:

Our monitoring system first warned us that the site was starting to time-out on web page loads, that database query times were above a slow-query threshold, and the server load was starting to rise rapidly. Other monitoring checks start timing out. Our technicians start to look into the problem, a few minutes into this, the server effectively becomes unresponsive (it’s not down, just taking minutes to execute simple commands). A quick check of the traffic and trend graphs confirm that the machine is seeing a traffic spike unlike anything it’s seen before. Since the site is effectively down anyway, we initiate a reboot of the server to clear out the processing load and allow our technicians access to make changes to the server configuration.

As soon as the server is up, it becomes apparent that the machine is running itself into the ground by spawning too many apache children, thus using too much memory and thrashing swap on the disks. Whenever swap is being actively used on a server, it is always the performance bottleneck. We estimate the amount of RAM each apache child is using and adjust the apache configuration such that it will never spawn so many children that it runs into swap again. We also turn off “KeepAlive” in the server configuration so that apache children aren’t tied up waiting for idle clients to respond. This stabilizes the server so that it doesn’t become unresponsive, and for a while the site handles the load again. However, page load times are being adversely affected by the traffic spike. We start investigating the site code, check to see whether the site itself is under a denial of service attack, and initiate contact with the customer to keep them informed about what we’re seeing and to see whether they’re aware of what’s going on.

As page load times become steadily worse, we get confirmation from the customer that an event has happened which has triggered the sudden increase in traffic, and with that we know the site is not being DOSed. Turning on server-status pages for apache, we’re able to confirm that apache has spawned its configured maximum number of child processes. At this point, the number of apache child processes is the performance bottleneck– however, we can’t increase this without running the server into swap again, so we start looking into alternate solutions. Site traffic is plateauing at about 8 Mbps outbound– about twice the server’s normal load. Most of the traffic appears to be hits to either very specific graphics, or Wordpress pages.

We decide to deploy nginx as a proxy layer in front of apache. Nginx is a very fast, lightweight web server which excels at handling a large number of end-user connections. In our opinion, it is also one of the best web proxies available today. Further, it’s threaded, so it doesn’t usually need to spawn that many children in order to handle a large number of clients. (Technically, apache can run in threaded mode too– which might enable it to return performance numbers that rival nginx’s. However, not all the php modules our customer’s application is using are thread-safe, so that isn’t an option for us.)

Our nginx installation checks to see if the request is a static file on disk (say a style sheet, image, or javascript file). If it is, nginx serves the file out directly. If it is not, nginx then passes the request onto apache. Apache generates the dynamic content and then sends the data back to nginx. Nginx can then spoon feed that content out to users without tying up additional connections within Apache.

After deploying nginx, site load times are much improved, but still not as fast as they should be. Site traffic is now plateauing at around 11 Mbps outbound. Further investigation shows that the number of apache children is still being maxed out in many cases, but not all the time. It turns out the CPU is now 100% busy and has become the bottleneck. And, traffic is still ramping up, so page load times are getting steadily longer.

Looking closer at the application code and the hits the site is receiving, we discover that the Wordpress content is now the largest dynamic module being accessed and the processing of each Wordpress page is causing the CPU load to shoot through the rough. We worked with the customer to quickly install the Super Cache plugin into Wordpress. This plugin allows for creation of a static cache that scales back the number of times the server needs to create certain pages and hopefully alleviates the load on the CPU. The server is now pushing about 15-20 Mbps outbound and it’s still climbing and at an exponential rate as the lunch rush approaches. Page load times are in the 2-5 second range: Acceptable given the circumstances but not good. (And yes– a common trend with many websites we see is a large spike in traffic around 12:00pm PST as presumably a lot of workers around the country and especially apparently on the west coast all go on their lunch breaks and start surfing the web. We kid you not.)

Watching CPU utilization, we notice that while the CPU load has been mostly alleviated, we are still seeing occasional periods of 100% usage. Further, looking at the number of utilized apache children, while on average the usage is somewhat lower, we occasionally still see the spikes of all apache children in use. We posit that although the static cache is helping, the overhead of loading the Apache listeners to read the cache is still enough to be the bottleneck.

So, we alter nginx’s configuration to access the static cache directly where appropriate without loading the wordpress PHP code. This modification involved the implementation of some fancy nginx rewrite rules, as the Super Cache plugin uses a bit of a unique file path for its cache directories. At this point, our performance issues are mostly resolved– page load times are reduced to sub-second intervals. The lunch rush hits and site traffic jumps to 90+ Mbps outbound (it’s also possible that sudden TV publicity of the event which triggered this whole load spike in general has just occurred and people are going online to read about it. This is a pretty typical traffic pattern for TV publicity. But we don’t have confirmation on exactly what caused the sudden jump to 90+ Mbps: It probably wasn’t just the optimizations we’ve made or the lunch rush.) But we’re not quite at the last bottleneck: This server has a 100 Mbps uplink to the internet and we’re closing in on it fast.

So, we configure a gigabit uplink for the server and move it to this link almost instantly. The server itself saw a peak of around 115 Mbps of outbound traffic (sustained over the 5-minute polling interval), and continues to push 20+ Mbps well into the evening. And all this, on just 2GB of memory and 2 CPU cores.

2000% scaling in 24 hours

Now, we’re not saying that we can wave our hands and magically make every website suddenly handle 20 times its load, and all that without upgrading the hardware. This was in many ways a special case which lent itself to a proxy / caching solution– and this approach probably wouldn’t have been applicable to the vast majority of the websites we host (because these usually already have a proxy / caching layer in them).

However, what we are saying is that we can help businesses scale their websites effectively. And we’re the people to call on if suddenly you’re seeing traffic many times what you ever expected.

I should also mention that in handling today’s load spike crisis for this customer, we didn’t need to pull out some of the more versatile tools in our toolbox. Some of these include:

  • Virtualization: With this, we have lot more flexibility in what we can do, from quickly and easily cloning an existing server, to doing live-migrations of the server to more powerful hardware, to dynamically scaling the capabilities of any server up or down depending on its needs– all with little to no down time of your application.
  • Load balancing: When you can’t squeeze any more juice out of a single machine, the ultimate solution to the ramping-up problem is to scale horizontally. And running behind our capable F5 hardware load balancers makes adding and removing servers from your application cluster seamless.
  • Content Delivery Network (CDN): Blue Box Group is in final beta testing of a CDN partnership with Internap in order to provide CDN services to our customers. If a significant portion of your website content can be cached (as is usually the case with most web applications we encounter), then it makes a lot of sense to off-load the work of serving it to a large array of capable caching servers distributed all over the globe. In this scenario above, being able to use our CDN to cache both images and Wordpress content from the get go should have prevented the site from ever really seeing an outage.
    Anyway, there you have it: An example of one customer that needed to scale up quickly. And didn’t even have to contact us because we were already on the problem. Really, this is the sort of thing we get to do all the time. Which is why I love working at Blue Box Group so much. –Stephen Balukoff Blue Box Group (Additional content contributed by Jesse Proudman)