Recent Blog Posts

  • Inbox Zero is Hard for Me
    By Johanna Rothman - Tuesday Jan, 6
    This year, after I archived my last year’s inbox, I decided my email problem was getting worse, not better. “I’m Johanna Rothman, and I have a problem collecting email in my inbox.&#... more »
  • Career Survey
    By Jared Richardson - Sunday Jan, 4
    The upcoming Career 2.0 book is in high gear, but we'd like to include more than our experiences. Over on the Career 2.0 blog we've posted a few questions about your career. Best moves, worst experien... more »
  • Tactics vs. Strategy (SOA & The Tarpit of Irrelevancy)
    By Neal Ford - Friday Jan, 2
    This is the first in a series of blog posts where I discuss what I see wrong with SOA (Service Oriented Architecture) in the way that it's being sold by vendors. The first installment is about how the... more »
  • Collaborating with Other Writers
    By Johanna Rothman - Friday Jan, 2
    Merlin, via 43 Folders Clips has a video of Eric Idle, on John Cleese’s Approach to Writing. Aside from John Cleese’s specificity, Idle talks about how he had trouble finding collaborators until... more »
  • Happy New Year
    By Johanna Rothman - Thursday Jan, 1
    Everyone, I thank you for reading and commenting. I hope you have a healthy and happy 2009. ... more »

The 3 Key Performance Qualities for all web systems (Part 1)

Posted by: Ryan Shriver on 11/07/2008
One of the things I enjoy doing most is performance testing and tuning web systems, especially high-performance transactional systems. I’m interested in all aspects of making these systems go fast and scale high. J2EE, .NET, Rails - I don’t care. From performance specification and capacity planning to design, development, testing and the inevitable iterative system tuning and benchmarking process. I think it’s all a kick in the pants.

I learned much of my knowledge when I was the architect for a large J2EE product for the financial services industry. The product supported thousands of concurrent users in a call center, operated on hundreds of millions of records and targeted sub-second response times. Oh yeah! If I was Tim Allen at this point I might mutter, “Grrrr...grunt grunt grunt.....yeeaahhh (reaching down and grabbing my belt buckle), I know a little something about performance!”. But enough bad visual imagery.

These days I’m working with teams trying to get their systems to go fast. Or fast enough. In fact, tomorrow morning I’m off to a client to do some Saturday performance testing of their new system a few weeks before it’s scheduled to go live. Like most organizations I work with, they’ve been over-focused on features and under-focused on performance and initial rounds of performance testing bear this out. Simply said, there’s a big performance gap between where they are now and where they need to be to handle their peak traffic. And their peak traffic is expected to occur within a week or two after they go live. Super.

I could go on and on about what should have been done and when, but it’s a bit late to start that now. Instead, I’d rather talk about how you can avoid these problems in the first place. So, how does an agile engineer deal with performance qualities for a system?

It starts with clear specifications of performance qualities before design begins. There, I said it - “specification before design begins”. Now, some of you agile purists may say, “that’s not very agile, that’s big up front design!”. But I say, “Bullshit. You’re getting ready to spend hundreds of thousands (or millions) of dollars of someone else’s money - take a few minutes and document your stakeholder’s expectations of performance before you start writing user stories and cranking out code. You’ll be thankful you did.”

While the best agile code bases can be refactored later to meet performance qualities, it’s much more expensive (trust me, I’ve got first hand experience here, even with lots of automated tests). Plus, sooner or later you’ll have to figure out how fast the system needs to be anyway, so might as well ask up front before design and development begins.

Contrary to popular belief, specifying a system’s performance qualities doesn’t have to be heavyweight, in fact it be done in a lightweight “agile” manner. The point of this post is to show you, an agile engineer, how to do this.

The 3 Key Performance Qualities

One could argue that there are lots of meaningful performance qualities for systems, but if I had to pick the most important ones, I’d say they are:

Availability
Response Time
Peak Throughput

Availability refers to when the system is fully operational and accessible by your users. Response Time is how long a user waits between when they click a link or button and when they get a correct result. Throughput refers to how many transactions a system can support at any one time. Sometimes this can be thought of as concurrent users, but as I’ve learned it’s not the number of users, it’s their pace of work that matters most. I prefer to measure throughput using transactions.

I’d argue that Scalability, Reliability and Recoverability are important qualities too, but if I had to start somewhere, I’d start with these three. Given that clear specifications are the first step, here’s how I would define these performance qualities using Planguage.

Let’s start with the simplest first:

Peak Throughput
Scale: Transactions per second (TPS) with 99% of Response Times below constraint values
Meter: Mercury LoadRunner test script running Standard Load scenario

There’s a few things to understand about this scale of measure:

The meter assumes you’ve defined a Standard Load scenario for your system, meaning a realistic mix of transactions that your users perform. How to define a Standard Load is beyond this post, but for existing systems, looking through your current web traffic to examine user traffic patterns can often yield clues about how to recreate your standard load scenario. In short, it should mimic the real world way in which your system is used (including type of transaction mix and pace of work).
Transactions per second (TPS) implies that you develop a application-specific notion of a “transaction”. This could be performing a search, retrieving a report, making a payment, pulling up a customer summary screen, or simply an application page view. You may have lighter and heavier transactions, but the mix of transactions in your standard load scenario will take this into consideration.
99% of Response Times below constraint values implies you’ve defined target and constraint response times (such as a target of < 1 second and constraint of > 3 seconds) and that you have the ability to determine the TPS at the 99th percentile of response times. Depending on your testing tool, this can be difficult and you may be able to relax this to 95% or 90%, but resist the urge to use “average” or “median” response time. These simply mean that roughly half the people got some level of performance while the other half did not. This is not a good measure.

One other thing, note it’s called “Peak Throughput”. You need to design systems to handle worse case (peak) loads, not average loads. One simple way to figure this is out is to find out the peak hour of the peak day in a year and use this to determine TPS. For example, if you had 100,000 page views in the peak hour of last year, and you consider page views as your transactions, then your TPS would be:

100,000 per hour = 1,666.67 per minute = 27.78 TPS

Of course if you’re getting 20% more traffic this year, you’d want to account for that:

Target [2008]: 33.33 TPS <- 20% increase over 2007 peak

If our target response time was < 1 second and constraint (failure) was > 3 seconds, this means that our peak throughput target (or constraint...depending) would be 33.33 TPS with 99% of response times < 3 seconds. The reason we don’t specify 100% is because of things like garbage collection and other anomalies that might happen. This means we’re guaranteeing that 99 out of 100 users will have a response time of < 3 seconds. Depending on your environment, this may or may not be appropriate, but you get the general idea and can adjust as necessary.

OK, this post is getting long, I’ll stop here and do Part 2 in a subsequent post and cover Response Time and Availability.

Hopefully you’ve already started to see that we’re able to create a very clear, quantified specification for Throughput with a few lines of text. This lightweight-but-sufficient specification will help us design and build high performance systems and yet be agile in our approach to specification. This, in a nutshell, is the essence of agile engineering.

What do you think?
be the first to rate this blog


About Ryan Shriver

Ryan Shriver is a Managing Consultant with Dominion Digital, a Virginia-based Business & Technology Consulting firm where he's a leader in their Agile practice (dominiondigital.com/agile). He helps organizations and teams transition to Agile ways of thinking about solving problems, ranging from new product lines to operational performance improvements. Ryan's solutions typically use some combination of people, process and technology to deliver measurable results.

With a deep background in software architecture and enterprise Java, Ryan understands the challenges and issues facing development teams to deliver predictable results. His approach to getting senior leaders to define measurable objectives and priorities for their organizations, projects and development teams helps bring focus to the highest priority initiatives. Using agile methods like Scrum, Ryan helps teams iteratively deliver value quickly to the business...often in a matter of weeks.

Ryan's experiences with diverse companies and teams are the basis for his presentations on Agile subjects.