Taking Over Before a Deadline
My cow-orker, Greg, had worked for some time on a software component, when he was dragged onto another part of the project at short notice, and I was asked to take over his work.
There wasn’t much to do on the component – perhaps a couple of weeks’ work – but the deadline was a big one; it had to be finished by the end of the last Friday of the month to be part of a major delivery. I actually completed the last of the work on the Thursday afternoon before the deadline. All that remained to do was to “check in” the files – to mark them as officially delivered so that the weekend compilation process would include them in the next version of the software.
Slow Diff
Have I mentioned that the Rational 1000 was slow?
One of the slowest operations was the check-in operation. A check-in operation was like an elaborate save command, that told other developers that this particular file was ready to be used.
The check-in operation would produce a list of differences between the latest version of the file and the previous one; it is a standard part of compressing the storage of multiple versions of the code. Unfortunately the diff function (that found the differences) was particularly inefficient on the Rational.
A simple check-in would take at least 1 or 2 minutes to run. I had heard complaints that a complex check-in would take over an hour to run!
Performing the Check-In
Rather than wait, I left the check-in process running when I went home on Thursday night, patting myself on the back for finishing a full day early before a major deadline.
When I came in on Friday morning, I found the check-in process had frozen up. That wasn’t such an unusual occurrence. I simply aborted it, and ran it again.
About an hour later, a horrid thought occurred to me… To explain it, I need to give some background.
Unit Testing
The Rational 1000 had a built-in Unit Test infrastructure.
All the unit tests for a component could be placed in a folder. In another folder, you could put the “Golden Results” – a set of files containing the expected output (with a few regular expressions to handle expected variations like timestamps).
With one button, all the tests would be run, and the output redirected to another folder. It would then be compared against the Golden Results, and an appropriate report produced.
These types of unit-testing environments have become popular now, but at the time it was a great improvement over the other IDEs I had used.
Unit Testing Approach
Greg used an interesting tactic in the generation of his Golden Results. He computed the results of various functions, and put a lot of effort in outputting their values in the same format as Ada literals. (If you are familiar with the Python repr()
function, you’ll understand the kind of code he wrote.)
His unit tests generated 20,000 lines of output, which he added to the Golden Results, and then methodically worked through each line, reviewing its accuracy. As he reviewed a data structure, and confirmed that it was the result that he was happy with, he replaced the output code with an assert statement (or at least an equivalent if
statement.)
My Ada is pretty rusty, but let me try an example. The first round of the code might have looked like this:
result := Shakespeare.Get_First_Name();
Ada.Text_IO.Put_Line("""" & result & """");
and used to output:
“William”
Having carefully checked that William was the right result, Greg substituted in the generated literals:
result := Shakespeare.Get_First_Name();
if result ="William" then
Ada.Text_IO.Put_Line("OK");
else
Ada.Text_IO.Put_Line("Failed");
end if;
which would output:
OK
It was really easy to see whether the tests worked after this. Instead of 20,000 lines full of hard-to-check values, there were 500 lines of “OK” printed. A nice trick, Greg!
Horrid Realisation
So my horrid thought was the realisation that Rational’s incredibly slow diff function, running as part of the check-in process, was trying hard to match up the lines between a 20,000 lines of Ada-ish code, and 500 lines of output that simply said “OK”. There was nothing in common between the two files, but the check-in process didn’t know that.
The check-in process I had aborted in the morning hadn’t frozen overnight. It was still alive when I killed it! Repeating the check-in process was going to take over 16 hours, but I had about 7 left before my deadline! I wasn’t going to make it, because I couldn’t save Greg’s file in time!
Coda
Careful grovelling got me special dispensation from my boss’s boss to deliver the code without delivering the Golden Results. I tried moving the two versions of the file onto a nearby VAX VMS machine, and ran the “diff” command there; it ran in under five seconds!
The experience left a bad taste in my mouth. A year or two later, I got the opportunity to choose an algorithm to analyse a part of an algorithms course. I chose “diff”, in an attempt to understand the tradeoffs between choices in the diff algorithm that might explain the large discrepancy. I was left unsatisfied; even the most basic implementations of diff do not have scalability issues in handling very large files which are nothing like each other. I still have no idea why the Rational diff command was so slow.
Comment by Aristotle Pagaltzis on January 15, 2006
Maybe it was simply a naïve implementation. After all, the most straightforward approach that would occur to someone who doesn’t know any better has polynomial complexity – as is the case with most any algorithm that involves comparisons, now that I think about. (String search and sorting are other examples, off the top of my head.)
Comment by Julian on January 15, 2006
I agree it was a naïve implementation, but it was worse than the canonical naïve implementation.
The canonical naïve solution is O(sizeOfFile1 * sizeOfFile2) and, perhaps surprisingly, isn’t related to the number of differences.
The small file would have been 500 lines * 3 characters = approx 1500 bytes.
The large file would have been 20,000 lines * approximately 15 characters = 300,000 bytes.
Sqrt(1500 bytes * 300000 bytes) = 21KB
So it should have taken equivalent time to comparing two identical 21 KB files to each other – hardly >16 hours elapsed-time worth (which included more than 12 hours of CPU time).
Reference: My Uni analysis that I might post here, if I can convert the LaTeX to HTML.
Comment by Aristotle Pagaltzis on January 15, 2006
Ah. I didn’t have numbers on the size of your input; that indeed sounds miniscule.
Maybe it did something like what the user search code in the control panel of Ultimate Bulletin Board 5.x did: it called a function to get a list of usernames, then it went over the list a loop in which it called another function to get the details about the user in question. Which would be sane enough if the function that returned account details didn’t do its job by opening the user accounts database flatfile, reading and parsing it all and storing the entire thing in memory, only to return a single record and throw away all the other data.
This is how you turn any problem of linear complexity into an algorithm of quadratic complexity.
(And how you inflate the runtime of a search function on a system with 3,000 users from 7 seconds to 3 minutes.
These are actual times measured after and before my refactoring. The additional half magnitude not accounted for by the linear vs. quadratic complexities is due to the fact that, unsurprisingly, the old code also did a lot of useless work by not storing parsed data in an appropriate data structure, instead painstakingly gluing everything back into strings at the end of a step and only to unravel them from scratch during the next step.
Needless to mention, besides orders of magnitude faster, my code was also much shorter (by about 3× if memory serves) and at the same time it was actually readable because the (trivial) intent was explicit.)
Maybe someone at Rational wrote the R1000’s diff algorithm in a similarly “ingenious†way.