Un-writing code
Recently I got a chance to clean up a real mess in our system-monitoring code. Our application was starting to run very slowly when hundreds of users were connected to a certain service. This was puzzling, as the service was intended to handle thousands of users simultaneously -- why was it bogged down with such a small load?
This had been a nagging problem for several months, and many developers had gone into the system to try to fix it. Some improvements had been made, but we were nowhere near the needed level of performance. The general consensus was that we needed a bigger server -- thow hardware at the problem, as the software was just fine.
I spent a few days reading the code, then talking with developers. After some simple code changes, the system was easily able to handle tens of thousands of users, without having to upgrade the hardware.
What was the problem, and why hadn't it been discovered earlier?
Here's what I found -- you will easily see how the problem developed ...
The first developer to work on this system wrote some prototype code that efficiently handled twelve users. He had designed a simple linear array of user-records, and when a certain operation needed to scan through the list of users, that table got built, sorted (using a bubble sort), and then traversed to make the report.
A few weeks later, a different developer came in, recognized that the code was only able to handle twelve users, and rewrote the code to use an array of arbitrary size. The rewrite allowed the array to be created when the report needed to be made. The array would be created with whatever length was necessary, then the contents of the array would be filled-in, the array would be sorted, and the report would be generated. Note that the bubble-sort method was retained.
After that, a developer rewrote the code yet again ... this developer apparently felt that fixed-length arrays were evil, so a doubly-linked list was used. Bubble-sort was retained, but the comparison function was re-written so that more fields of the user record were examined.
After this rewrite, the code seemed to scale arbitrarily -- the problem, as you can guess, was that the bubble-sort caused the performance to lag noticeably.
This lag prompted the next set of changes -- somebody noticed that the system was taking too long to produce the report. The front-end webpage which displayed the report was not getting its results fast enough from the back-end code we're discussing here, so the front-end webpage code was timing out and displaying an empty report.
The next fix was to lengthen the timeout: wait five seconds, instead of one second, for this particular report to be generated.
And this is where I stepped in, to fix the whole mess.
First, I noticed that the web front-end applied its own sort method to the result set, so there was no need to do back-end sorting -- the code which generated the report could just traverse the user-list, and pass an unsorted array of user records to the web front end. Problem solved.
Well, almost solved. I went back to each developer who had touched the code, and gave them a thorough description of how poorly the entire project had gone -- and reinforced the need to really understand the problem before writing code.
And thus I was able to save the project, just by removing a whole bunch of un-necessary code.
This had been a nagging problem for several months, and many developers had gone into the system to try to fix it. Some improvements had been made, but we were nowhere near the needed level of performance. The general consensus was that we needed a bigger server -- thow hardware at the problem, as the software was just fine.
I spent a few days reading the code, then talking with developers. After some simple code changes, the system was easily able to handle tens of thousands of users, without having to upgrade the hardware.
What was the problem, and why hadn't it been discovered earlier?
Here's what I found -- you will easily see how the problem developed ...
The first developer to work on this system wrote some prototype code that efficiently handled twelve users. He had designed a simple linear array of user-records, and when a certain operation needed to scan through the list of users, that table got built, sorted (using a bubble sort), and then traversed to make the report.
A few weeks later, a different developer came in, recognized that the code was only able to handle twelve users, and rewrote the code to use an array of arbitrary size. The rewrite allowed the array to be created when the report needed to be made. The array would be created with whatever length was necessary, then the contents of the array would be filled-in, the array would be sorted, and the report would be generated. Note that the bubble-sort method was retained.
After that, a developer rewrote the code yet again ... this developer apparently felt that fixed-length arrays were evil, so a doubly-linked list was used. Bubble-sort was retained, but the comparison function was re-written so that more fields of the user record were examined.
After this rewrite, the code seemed to scale arbitrarily -- the problem, as you can guess, was that the bubble-sort caused the performance to lag noticeably.
This lag prompted the next set of changes -- somebody noticed that the system was taking too long to produce the report. The front-end webpage which displayed the report was not getting its results fast enough from the back-end code we're discussing here, so the front-end webpage code was timing out and displaying an empty report.
The next fix was to lengthen the timeout: wait five seconds, instead of one second, for this particular report to be generated.
And this is where I stepped in, to fix the whole mess.
First, I noticed that the web front-end applied its own sort method to the result set, so there was no need to do back-end sorting -- the code which generated the report could just traverse the user-list, and pass an unsorted array of user records to the web front end. Problem solved.
Well, almost solved. I went back to each developer who had touched the code, and gave them a thorough description of how poorly the entire project had gone -- and reinforced the need to really understand the problem before writing code.
And thus I was able to save the project, just by removing a whole bunch of un-necessary code.
