Sunday, July 23, 2006

Un-writing code

Recently I got a chance to clean up a real mess in our system-monitoring code. Our application was starting to run very slowly when hundreds of users were connected to a certain service. This was puzzling, as the service was intended to handle thousands of users simultaneously -- why was it bogged down with such a small load?

This had been a nagging problem for several months, and many developers had gone into the system to try to fix it. Some improvements had been made, but we were nowhere near the needed level of performance. The general consensus was that we needed a bigger server -- thow hardware at the problem, as the software was just fine.

I spent a few days reading the code, then talking with developers. After some simple code changes, the system was easily able to handle tens of thousands of users, without having to upgrade the hardware.

What was the problem, and why hadn't it been discovered earlier?

Here's what I found -- you will easily see how the problem developed ...

The first developer to work on this system wrote some prototype code that efficiently handled twelve users. He had designed a simple linear array of user-records, and when a certain operation needed to scan through the list of users, that table got built, sorted (using a bubble sort), and then traversed to make the report.

A few weeks later, a different developer came in, recognized that the code was only able to handle twelve users, and rewrote the code to use an array of arbitrary size. The rewrite allowed the array to be created when the report needed to be made. The array would be created with whatever length was necessary, then the contents of the array would be filled-in, the array would be sorted, and the report would be generated. Note that the bubble-sort method was retained.

After that, a developer rewrote the code yet again ... this developer apparently felt that fixed-length arrays were evil, so a doubly-linked list was used. Bubble-sort was retained, but the comparison function was re-written so that more fields of the user record were examined.

After this rewrite, the code seemed to scale arbitrarily -- the problem, as you can guess, was that the bubble-sort caused the performance to lag noticeably.

This lag prompted the next set of changes -- somebody noticed that the system was taking too long to produce the report. The front-end webpage which displayed the report was not getting its results fast enough from the back-end code we're discussing here, so the front-end webpage code was timing out and displaying an empty report.

The next fix was to lengthen the timeout: wait five seconds, instead of one second, for this particular report to be generated.

And this is where I stepped in, to fix the whole mess.

First, I noticed that the web front-end applied its own sort method to the result set, so there was no need to do back-end sorting -- the code which generated the report could just traverse the user-list, and pass an unsorted array of user records to the web front end. Problem solved.

Well, almost solved. I went back to each developer who had touched the code, and gave them a thorough description of how poorly the entire project had gone -- and reinforced the need to really understand the problem before writing code.

And thus I was able to save the project, just by removing a whole bunch of un-necessary code.

Thursday, June 01, 2006

MSProject is Evil

OK, here's my Microsoft Project rant. But before I get into the profanity, let's look at the problem that MSProject tries to solve ...

Software scheduling is hard. It is provably harder than software development itself. How can I say that? Well, if software development was a fully solved problem, it would be predictable, quantifiable, and most importantly, schedulable. Software development, especially the innovative software engineering that drives most high-tech companies today, is inherently unpredictable. You don't know how long or how difficult the development process will be until you are finished, and then you can look backward and see how long it took.

One way to do predictable software development is to exhaustively analyze the problem before the development process begins. A team can study, debate, and document the project before the first line of code is written. This team identifies all of the hazard areas, proposes ways to address those hazards, and writes a full specification of the project and the development process.

How long will this analysis phase take? You don't know. So, we're back to the original problem: until you know the difficulty of the project, you won't know how long the project will take. How long will it take to understand the difficulty? Nobody can predict that either.

Enter Microsoft Project.

MSProject provides a framework for breaking down large problems into smaller problems. Each task can be assigned an expected time duration, start date, and one or more "resources" (i.e. developers) can be assigned to the task.

Once an MSProject schedule is created, you have a picture of the future -- you know how long the project will take, who is involved in the project, and what everyone's role will be for the remainder of the project.

The problem is that this schedule is completely false, because it is based upon faulty assumptions. For example, the schedule shows that Bill will be working on the "Faucet" class from August 7th to August 10th. It is now June 1st. If everything works correctly and everyone does everything according to the schedule, then indeed Bill will start working on "Faucet" on August 7th.

But Bill will never get to do that. Guaranteed. Bill will have to fix a bug that comes up on August 6th, or Sally won't finish the "Handle" class which is a predecessor to "Faucet", or Jim will decide, in July, that a new set of requirements needs to be fulfilled and the entire schedule has to be rewritten.

The fundamental problem with MSProject is false accuracy. MSProject allows you to believe that the Faucet class will take four days. What is that based upon? Guesswork. Hunches. Not science. The net effect is that MSProject attempts to apply metrics to what is inherently un-quantifiable.

What is the alternative to MSProject? Agile development, plus a healthy dose of maturity and restraint, at least for the sort of cutting-edge projects I've worked on. Life at the cutting edge means unpredictability. Don't try to squeeze open-ended projects into a fixed schedule, unless you've allowed for the unpredictability via wide error bands.

MSProject can be effectively applied to "cookie-cutter" projects, where the development effort involves small modifications to existing code, or replication of work previously done. In those cases, you can quantify the effort needed to do the work, and you can do it with some accuracy.

MSProject is a way of automating GANTT charts, and GANTT charts came from the construction industry. The construction industry has been using practices, materials, and methods that have been refined for hundreds of years and typically involve replication of known practices, not cutting-edge exploratory methods.

So leave MSProject for those environments, and don't try to apply precision and accuracy to unquantifiable activities.

End of rant.

Sunday, April 09, 2006

Brooks' Law and debugging

Fred Brooks' book "The Mythical Man-Month" has many gems, the brightest of which is the aphorism "Adding manpower to a late software project makes it later."

And, there is what Eric Scott Raymond calls "Linus' Law": "Given enough eyeballs, all bugs are shallow."

Is it possible to weave these two principles into a method to expedite software development? Sure -- employ a mob of debuggers. Debugging (and, to an equal extend, QA testing) benefits from scale. Once primary development of a software project is complete, it is possible and profitable to have a large number of folks banging away at the application and reporting bugs. You can also have a squadron of engineers peering through the code looking for the flaws that lie behind the bugs. And, these folks can write up their discoveries for review by the original developers.

In a future column, I'll talk about the debugging process in detail, but the above process does work. Often, the squadron examining the code can be junior developers who are learning the structure of the code. Or, you can have your wizened geniuses participating: either way (or in mixture), you're subjecting the code to a peer review which may not be possible or convenient during the original development cycle.

If you're Microsoft, just call it a "beta test", and give your product to tens of thousands of motivated users and developers. Or, call it a "technology preview". Just don't call it an "eternal beta".

Saturday, March 18, 2006

Crystal Clarity

In most American high-end department stores, there is a housewares department with a section that sells fine china and crystal. The display of crystal drinkware, vases, and other glassware is often the most radiant location in the store. The exquisitely-cut goblets and candlesticks are carefully arranged under high-intensity halogen spotlights, with a matte black backdrop providing a dramatic contrast.

The image is of precise, almost mathematical purity. The eye is dazzled by the reflected and refracted light. You want to transplant these sparkling objects to your own house, and let their light shine in your everyday surroundings.

I feel the same way about code: textbooks show algorithms implemented in clean, clear fashion and express a purity of thought that I love to transport into my commercial and hobby programs. Further, I want to clean, polish, and refine my own code so that it expresses the same clarity as the textbook examples.

That's why I've enjoyed applying Martin Fowler's techniques from "Refactoring" into my own code. Intentionally or not, he shows how code can be refined, buffed, and polished so that it shows the same simplicity and brilliance as a vase by Waterford, Swarovski, or Steuben.

Thursday, March 09, 2006

The Peter Principle

"The Mythical Man Month" is one of the Ancient Tomes of software engineering, revered for its wisdom. Another book which should be canonized is "The Peter Principle" by Laurence J. Peter.

Peter's book describes the inevitable consequence of modern organizational dynamics: "In a hierarchy every employee tends to rise to his level of incompetence."

How can we apply this maxim to software engineering? Peter explains the inefficiencies and irrationalities of bureaucracy with this principle, but we see similar effects in any mature body of software. Without ongoing scrutiny and diligence, software-maintenance efforts create results that are bloated, unclear, and poorly-organized. This isn't the fault of the software itself, but rather how it is designed, written, and maintained.

Often, the first version of a software program is written in haste, to solve a particular problem. Over time, the problem-definition changes and broadens. As that happens, the software expands to solve the larger problem. If this is done carefully, the result matches the problem definition and retains clarity. But, so often the result is exactly the opposite: ongoing casual bugfixing, hasty extensions, and inattention cause the program to become riddled with inefficiencies, redundant code and unused code. The result is software that is performing far below its potential.

Organizationally, the Peter Principle ensures that a collection of humans will function in a less-than-optimum manner. In software, the Peter Principle ensures that code will contain bugs, is inefficient, and is difficult to understand.

How to escape the Peter Principle? A much-less popular book, "The Peter Prescription", outlines some remedies. In the realm of software engineering, the current vogue is extreme programming, which via oversight, test cases, and a mood of continual monitoring and improvement attempts to keep the code clean.

Monday, February 27, 2006

Antibodies

The old saw that "bureaucracies grow to fill the need of a growing bureaucracy" means that eventually your organization will acquire a few (or more) antibodies.

What's an antibody? I first heard the term at Sun Microsystems, referring (in a derogatory fashion) to chair-warming bureaucrats who are resistant to change. Change, to an antibody, represents danger, rather than opportunity. Hence, the antibodies swarm around the new idea and attempt to neutralize it. With the new idea succcessfully destroyed, the antibody can return to its relaxed, inactive state.

At Adobe Systems, we used a different term: the "yeahbuts". A yeahbut, upon hearing a new idea, would respond with "Yeah, but ..." and then a set of canned reasons why innovation wasn't practical, prudent, or necessary at this particular time.

The problem with antibodies is that they're hard to get out of the system, short of simply firing them. So often, the antibody will cloak their resistive actions in a set of plausible justifications:

"We haven't completed our analysis of the product requirements, so it is really premature to commit to this course of action."

"My staff is fully occupied with maintenance on the most-recent release, and we haven't had time to review your proposal."

or, even more insidious, the trap of false comparatives:

"Your plan to put out mouse traps sounds very similar to Pete's proposal to fumigate the flooded toolshed, and we simply can't undertake the poison-abatement process with our current staff. Therefore, I won't approve your request for new mouse-traps."

Antibodies find strength in numbers, and it isn't hard to find them clustered around the drinking fountain, discussing tactics for resisting the latest "change agent".

Tuesday, February 07, 2006

This year's must-attend conference ...

Let's all go to Waterfall 2006, and use the processes of yesterday to build the products of tomorrow!