Wednesday, June 27, 2007

Late Late Late Night Benediction

So I'm sitting here at work, it's ten o'clock, there's a new build to test, and the lab computer goes belly-up, BSOD, down for the count. At least until reboot.

Why am I sitting here in the lab at ten o'clock at night?

Because the software has to go out tomorrow.

(Actually, it was supposed to go out last Friday, but that was a billion bugs ago, and there's only a million more to go.)

It was thought to be 'done' this afternoon around 4 p.m., when the tests passed on the Test Rig across the street, and everyone in the group laughed and smiled and threw up their hands in celebration, and went home.

But something in the log file nagged at me, something that should not have been in there. The test was reporting a Pass even though the log data showed very clearly that it had failed.

Why wasn't anyone else worried?

They're all so eager to just be done with it. This project has dragged on long enough, and we have to get something down to Chicago next week to throw in the thermal chambers, and right now they're probably willing to put the box in the chamber even if it doesn't boot up, just to get it out of the way.

But not me. I'm a bloody perfectionist.

So I'm slogging through the code, adding debug routines here and there to try and figure out what's going on (this is, after all, someone else's code, so it takes a while to figure out what was going on -- so much for documentation!!), loading up the box and re-running the tests, gathering more data, analyzing it, going back over the code, feeling like I'm banging my head into a brick wall -

When suddenly it becomes clear. There's a major disconnect between the guy who wrote the software, and the guys who expected to use it. The guy who wrote it, he expected the code to be run once, and continue running forever. The guys who expect to use it, they are going to be starting and stopping and re-starting this software throughout the whole test sequence. But it wasn't written in such a way as to handle that case. It's definitely a run-once kind of code.

Oops.

So the next question becomes, how can this code be massaged to work the right way, without having to rewrite the entire thing from scratch? I'd really like to go home now...

---
{much later, after much coding & caffeine}

Massaged the code for awhile, to no avail. Still failing. Ran it in both manual and automatic mode. Thought it was fixed because it was passing in manual mode, but then it failed in automatic mode. Weird. Tossed in some more debug code.

---
{a little while later}

Something odd going on here. It passes twice, then fails on the third try. And it seems to fail ONLY when restarting. Like there's something getting hosed up during the restart. But what's so special about the restart? The process is just supposed to suspend itself, not discard all the memory. But it's acting like something in memory is getting overwritten while the process is in a suspended state. So much for the partitioned system.

---
{really, really late}

It's really odd, but I don't have time for odd things right now. I'm tired and want to go home. Looking objectively at the problem, it's obvious that something is getting messed up during the first re-run of the test, something left in an unknown state so that it causes the comparison logic to fail. And the logic is failing in the middle of a loop, so it's a good guess that restarting the loop might reset the logic somehow. But there's no easy way to drop out of the loop ... or is there?

OK, so if I put a special variable in the failure-detection logic, I can use that to ignore any failures right after a restart. It gets set by the restart call, then when the comparison fails, it simply ignores the failure and keeps ignoring it until the loop finishes. Then the special variable gets reset, the comparison logic gets reset, and we get a 'clean slate'.

Let's see if it works...

---
{a little while later - midnight}

It works! Hooray!

Now let's zip this puppy up and get on home...

1 comment:

Anonymous said...

glad to hear you are concerned about your work being done properly.
some of the outfits who have done stuff for the hospital have not been so diligent.