One of the things I enjoy doing most is performance testing and tuning web systems, especially high-performance transactional systems. I’m interested in all aspects of making these systems go fast and scale high. J2EE, .NET, Rails - I don’t care. From performance specification and capacity planning to design, development, testing and the inevitable iterative system tuning and benchmarking process. I think it’s all a kick in the pants.I learned much of my knowledge when I was the architect for a large J2EE product for the financial services industry. The product supported thousands of concurrent users in a call center, operated on hundreds of millions of records and targeted sub-second response times. Oh yeah! If I was Tim Allen at this point I might mutter, “Grrrr...grunt grunt grunt.....yeeaahhh (reaching down and grabbing my belt buckle), I know a little something about performance!”. But enough bad visual imagery.
These days I’m working with teams trying to get their systems to go fast. Or fast enough. In fact, tomorrow morning I’m off to a client to do some Saturday performance testing of their new system a few weeks before it’s scheduled to go live. Like most organizations I work with, they’ve been over-focused on features and under-focused on performance and initial rounds of performance testing bear this out. Simply said, there’s a big performance gap between where they are now and where they need to be to handle their peak traffic. And their peak traffic is expected to occur within a week or two after they go live. Super.
I could go on and on about what should have been done and when, but it’s a bit late to start that now. Instead, I’d rather talk about how you can avoid these problems in the first place. So, how does an agile engineer deal with performance qualities for a system?
It starts with clear specifications of performance qualities before design begins. There, I said it - “specification before design begins”. Now, some of you agile purists may say, “that’s not very agile, that’s big up front design!”. But I say, “Bullshit. You’re getting ready to spend hundreds of thousands (or millions) of dollars of someone else’s money - take a few minutes and document your stakeholder’s expectations of performance before you start writing user stories and cranking out code. You’ll be thankful you did.”
While the best agile code bases can be refactored later to meet performance qualities, it’s much more expensive (trust me, I’ve got first hand experience here, even with lots of automated tests). Plus, sooner or later you’ll have to figure out how fast the system needs to be anyway, so might as well ask up front before design and development begins.
Contrary to popular belief, specifying a system’s performance qualities doesn’t have to be heavyweight, in fact it be done in a lightweight “agile” manner. The point of this post is to show you, an agile engineer, how to do this.
The 3 Key Performance Qualities
One could argue that there are lots of meaningful performance qualities for systems, but if I had to pick the most important ones, I’d say they are:
Availability
Response Time
Peak Throughput
Availability refers to when the system is fully operational and accessible by your users. Response Time is how long a user waits between when they click a link or button and when they get a correct result. Throughput refers to how many transactions a system can support at any one time. Sometimes this can be thought of as concurrent users, but as I’ve learned it’s not the number of users, it’s their pace of work that matters most. I prefer to measure throughput using transactions.
I’d argue that Scalability, Reliability and Recoverability are important qualities too, but if I had to start somewhere, I’d start with these three. Given that clear specifications are the first step, here’s how I would define these performance qualities using Planguage.
Let’s start with the simplest first:
Peak Throughput
Scale: Transactions per second (TPS) with 99% of Response Times below constraint values
Meter: Mercury LoadRunner test script running Standard Load scenario
There’s a few things to understand about this scale of measure:
The meter assumes you’ve defined a Standard Load scenario for your system, meaning a realistic mix of transactions that your users perform. How to define a Standard Load is beyond this post, but for existing systems, looking through your current web traffic to examine user traffic patterns can often yield clues about how to recreate your standard load scenario. In short, it should mimic the real world way in which your system is used (including type of transaction mix and pace of work).
Transactions per second (TPS) implies that you develop a application-specific notion of a “transaction”. This could be performing a search, retrieving a report, making a payment, pulling up a customer summary screen, or simply an application page view. You may have lighter and heavier transactions, but the mix of transactions in your standard load scenario will take this into consideration.
99% of Response Times below constraint values implies you’ve defined target and constraint response times (such as a target of < 1 second and constraint of > 3 seconds) and that you have the ability to determine the TPS at the 99th percentile of response times. Depending on your testing tool, this can be difficult and you may be able to relax this to 95% or 90%, but resist the urge to use “average” or “median” response time. These simply mean that roughly half the people got some level of performance while the other half did not. This is not a good measure.
One other thing, note it’s called “Peak Throughput”. You need to design systems to handle worse case (peak) loads, not average loads. One simple way to figure this is out is to find out the peak hour of the peak day in a year and use this to determine TPS. For example, if you had 100,000 page views in the peak hour of last year, and you consider page views as your transactions, then your TPS would be:
100,000 per hour = 1,666.67 per minute = 27.78 TPS
Of course if you’re getting 20% more traffic this year, you’d want to account for that:
Target [2008]: 33.33 TPS <- 20% increase over 2007 peak
If our target response time was < 1 second and constraint (failure) was > 3 seconds, this means that our peak throughput target (or constraint...depending) would be 33.33 TPS with 99% of response times < 3 seconds. The reason we don’t specify 100% is because of things like garbage collection and other anomalies that might happen. This means we’re guaranteeing that 99 out of 100 users will have a response time of < 3 seconds. Depending on your environment, this may or may not be appropriate, but you get the general idea and can adjust as necessary.
OK, this post is getting long, I’ll stop here and do Part 2 in a subsequent post and cover Response Time and Availability.
Hopefully you’ve already started to see that we’re able to create a very clear, quantified specification for Throughput with a few lines of text. This lightweight-but-sufficient specification will help us design and build high performance systems and yet be agile in our approach to specification. This, in a nutshell, is the essence of agile engineering.
What do you think?
