Pages

Rabu, 25 April 2012

Stress-testing a web service: The basics

Takeaway: Nick Hardiman describes his preparations to stress-test his simple Drupal-based web service built on an Amazon EC2 machine.

I built a Drupal installation on Amazon EC2, and now I want to find out how scalable my new Drupal service is. To do this, I am going to run my first set of load tests (AKA stress tests), to push my service to the limit.
When more requests arrive at my website, how does it behave? Does it handle the increased load without affecting the performance? Does it remain reliable?
Right now, I don’t know. I cannot offer a fully operational cloud service without knowing how it scales, so I need to kick off a few tests.
I use a few free tools to torture the homepage of my new customer service, rather than using a commercial service like Loadstorm. I put my service under increasingly heavy loads to see what happens and measure the results.
Later, I will need to fix the problems I find.

This service is born to lose

One small Amazon EC2 machine is weak. A small physical machine may handle a reasonable web workload, but a small EC2 machine won’t. These tests illustrate why you need to be able to scale up (use a bigger machine) and scale out (use many machines).
I know my service will perform poorly. It’s so poor it can’t even afford to pay attention. This is how I ensure awful service.
  • I use a small EC2 machine type. Apache VMs come in different sizes, from micro to massive. A small VM has few resources, which makes it a tight fit for a web service.
  • I only use one EC2 machine. I have two identical VMs to share the work, but I run the tests on one machine only.
  • MySQL has not been tuned for Drupal. I’ve made no changes to buffer size, not hunted for slow queries, and run no engine checks.
  • I don’t use a cache. Caching using an application like memcached or varnish is a popular way of increasing speed. Just turning on the Drupal cache speeds up response many times.

Figure A


My load testing toolkit

I could use a third-party service to run more comprehensive tests with web interfaces, such as the Jmeter cloud and SOASTA CloudTest but I don’t really need anything that clever for now.
I generate and monitor the extra load using a few applications. These are all command line tools, so they are not very intuitive.
  • top (top process statistics). This is a process monitor that shows me what is happening to the system.
  • vmstat (virtual memory statistics). This is like top, but displays information in a different way.
  • ab (Apache HTTP server benchmarking tool). This is a web site load generator. I use ab to give my service an increasingly hard time.

top (top process statistics)

The top command is one of the ten most useful Linux commands. Top displays information about processes and what they are doing to the system.
I want to get a baseline by using top while the system is idle, before I run the load tests.

root@ip-1-2-3-4:~# top
 
top - 17:12:40 up 15 days,  3:33,  2 users,  load average: 0.46, 0.69, 0.34
 
Tasks:  71 total,   1 running,  70 sleeping,   0 stopped,   0 zombie
 
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 
Mem:   1737564k total,  1117352k used,   620212k free,   168248k buffers
 
Swap:  3020212k total,        0k used,  3020212k free,   518672k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 
    1 root      20   0  8356  800  672 S  0.0  0.0   0:12.68 init
 
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
 
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
 
    4 root      20   0     0    0    0 S  0.0  0.0   0:01.08 ksoftirqd/0
Understanding the numbers is difficult. The top command fills up my command line interface with a lot of data packed into two halves.
  • The top half - about six rows - is a dense display of information about the state of the system, such as how busy it’s been, memory used and uptime.
  • The bottom half is a list of the top processes. They are ordered by how much CPU they use, with the biggest CPU hog first.
The procedure for using top is pretty straightforward.
  1. Open a CLI.
  2. Run the top command. A display like the one above appears.
  3. Watch the numbers. Every few seconds a few of the numbers change.
  4. When you’ve had enough, type the letter q to quit. The command prompt appears.
  5. Close the CLI.
There is a lot of information here: it is compressed to pack a lot into a small space. The more you use top the more numbers you can understand. It’s a bit like staring at a stereogram until a 3D picture appears.

My first measurements

Even before I run my first load test, I can make some useful observations about my EC2 machine.
In the example above I see an idle system. The CPU is 100% idle and no swap space is being used. It’s pretty obvious to a system administrator that this EC2 machine is doing nothing.
The load average is 0.46. The load average is an estimate of how much the box is doing compared to what it can handle - 1 is roughly 1 CPU working flat out, but keeping up with its workload.
Strangely, this idle box is putting in about half a server’s worth of effort. Shouldn’t a box doing nothing have a load average of zero? What’s happening is the hypervisor is stealing my VM’s capacity and giving it to other busier (and maybe higher-paying) customers. It’s the same theory that an airline uses when it over-sells seats on a plane, relying on some passengers to not show up.