Vigorous Debuggery

posted in The Bag of Holding

Published July 15, 2006

I'm on the verge of screaming.

Either that, or breaking down and weeping in a high-pitched wail until I drown in a pool of my own tears. I'm still trying to decide.

The last couple of days have been a nightmarish marathon of convoluted debugging work. It started yesterday, at some early hour, I can't recall when. (Things have literally been so crazy that I don't know when I've been awake, when I've been asleep, or anything. I just remember a long blur and my butt getting really sore from sitting so long.)

The first bug was with some modifications to some XML loading code. There was a node in the XML tree that had been loading since Day One and working great. I needed to add a couple of subnodes to it for a new feature. Things seemed to work fine; my ad-hoc unit tests showed that everything was running great, provided the XML was loaded correctly.

Naturally, the XML wasn't loaded correctly.

I spent hours poking around and rearranging code, trying to find where the load was going wrong. I finally convinced myself that the loader was indeed to blame, but I was staring right at the code - there was clearly a loop that iterated through each subnode and loaded it.

I output the number of subnodes before running the loop; the count was correct. I manually output the entire subnode tree fragment to convince myself it was loading. Then, in desperation, I added a simple line to the loop that would write a log entry each time the loop iterated.

It only ran once.

This prompted me to jump about my apartment in a furious rage, shouting obsenities like Samuel L. Jackson himself, and pounding on various household objects with a soft rag (even in my psychotic outburts I try not to break stuff if I can avoid it). After doing my impression of a gorilla that just got shot in the ass for a few minutes, I sat down again, ready to tear the loop apart and totally rewrite it - some crazy evil voodoo was breaking the laws of physics. I mean, the loop constraints clearly said to loop through each subnode - that worked fine (I even made a separate dummy loop with those constraints just to convince myself they worked). The subnode loading code worked. The function wasn't being aborted early. It was almost as if the loop was being explicitly aborted after the first iteration.

... and then I found it. For some unfathomably moronic reason, I had a break; line at the end of the loop, hidden out of sight between a couple of closing brackets, just lurking in the shadows, waiting until it could lunge out and slice off the ankles of passerby.

I honestly can't tell you how stupid I felt. I meekly ripped out all my debug logging (the more recent of which was littered with rude remarks about the nature of deterministic computing machines) and checked in the bug fix - one whole line of code removed.

Damnit.

As if that weren't bad enough, today had to happen. I originally wasn't going to work at all this weekend - just play some games, go run around in the Big Blue Room a bit, and catch up on some reading. The old fashioned kind of reading, with the little analog thingies - you know, that have all the ink and slices of dead trees and stuff. It was going to be a good weekend. It was going to be the way weekends should be, which is to say, not full of work.

I woke up last night around 10 PM, thanks to my body's utter lack of a functioning circadian rhythm machine (I think it has a QPC dualcore bug and I haven't updated the drivers yet). I waffled around for a bit until around 5 AM, trying to force myself into not working. Unfortunately, I'm a bit obsessive, and I'm also a day or two behind on my estimated progress - largely due to stupid bugs, which I won't mention.

Oh wait, I already told you about that. Anyways.

So finally about 5, I convinced myself to tackle this next feature. It's no big deal, you know, just tweak a couple things, refactor a few functions, and wham - all set. Nothing to it. I had complete plans for the changes written out on my notepad, with every single step needed to implement and test the additions. Foolproof.

It went smoothly for a while, until I got to the last bullet point on the list. Up until now, this system has only understood statically timed sequences of events - the XML tells it when every single operation starts, and when it stops. When the XML is loaded, a lot of relative timing offsets are converted into absolute time, and condensed into a "playback operations list," which is basically just a queue of "start this process, stop that process" commands and a timestamp of when they are to be executed.

The plan for this morning was to add the ability for commands to be triggered dynamically from code, rather than have statically defined timings. This was broken into two parts: starting events dynamically, and running a started process until an event occurs, and then shutting it down.

Around 8 AM, I had the first half totally working - it was tested out fully and running beautifully. Not a hitch. So, being an idiot, I told myself that I could get the second bit done in maybe an hour or so, and then go watch THX1138 and get on with all the relaxing and not-working I was planning on.

Around 1 PM, the debugging had become so frustrating that I came over here to write up my tale of woe, hoping that in the process of explaining my predicament I might figure out what I had overlooked.

What I had discovered was truly bizarre. An operation with a dynamic duration would never get started. It would load from disk correctly, and when the start event was raised, the commands to start the operation would be inserted into the playback op list correctly. They would even be popped off the queue and executed correctly. The entire command to start the operation worked flawlessly - no problems, no errors, all the expected values spewed out into the debug logs.

Except that, unlike every other op-start command, this one had no effect. Nothing would appear on the screen. A few seconds later, the op-stop command would be triggered, the process would be terminated, and everything would clean up. There were no errors to indicate that the start had failed - the shutdown was clean, and produced output that indicated that the operation had indeed been running in the background. Except nothing ever showed on the screen.

My attempt to articulate the story of the bug almost worked, except that just when I was getting to the good stuff, I got distracted. My stupid pea-brain decided that it wanted to figure out what random keystrokes did - in my browser window with the half-written entry.

One of the combinations I happened to press was Ctrl+W.

DAMNIT.

I was too drained to even be mad - after all the frustration of the morning and the last couple of days, I was just depressed. Heavy of heart and despairing for my life (well, ok, just despairing for my weekend) I turned back to my IDE and began to poke things again.

I double-checked everything for the command, running it over in my mind, and trying to see exactly what possible flow of logic could cause what I was seeing. There just was no way - by all measurable signs, the process was running.

But how could it have been running if it didn't draw anything onto the screen?

I was close to giving up and just spazzing out again, when I happened to scroll past a function that I wrote ages ago, and long since signed off as working correctly. Something about it niggled at the back of my subconscious, so I checked it over - hadn't been changed, and still passed all the tests. Nothing wrong with it.

Then I stopped and thought about what precisely it was doing. Due to the structure of the data, there is precisely one case where a node in the tree doesn't handle all of its own operations - one operation is handled by the parent node, and propagated down to the child node.

That couldn't have been the bug, though, because it was passing tests! No way was that function at fault.

But what if it was?

I checked over quickly what would happen if something went wrong - say, for instance, if it failed to find a child node and propagate the data. The net effect, in that impossible and totally not-happening case, was that the clipping region for the node would be set to a 0 pixels square.

In other words, if that particular function failed to find a child node, it would have the net effect that the child node would run perfectly as expected - with the sole exception that it would show nothing on the screen.

DAMNIT!

I checked up on the function that searched for the child node, and sure enough, there it was - a tiny little assumption that the child nodes would all be statically timed. Removing that assumption was a two line change.

(And, as if to celebrate the happy news, one of my favorite songs - Sleep by Conjure One - just started playing on DI. Woohoo!)

A quick compile and test seems to indicate that I've nailed the little rat bastard - except I can't tell for sure. Somewhere along the line, out of desperation, I commented out the ability to stop processes from running, on the off-chance that the dynamic process was getting killed before it could display anything. So now I have two processes trying to draw their output into the same rectangle, and all I get is chaos. D'oh.

This is where things get truly agonizing - I think I've nailed it, but I don't know. It should just be a quick recompile and retest to find out for sure. Except when I say "quick" I mean "butt-ass freaking slower than frozen molassess running uphill on the surface of a neutron star." Or, in slightly more clear terms, "not quick."

A compile can take between 30 seconds and 2 minutes, depending on what got affected. Launching the game and getting it to a point where I can run the test takes about 3-4 minutes. After all that, I can run the actual test, which takes 15 seconds. And due to data dependencies in the sytem I'm working on, I can't split it out into a separate app/testing framework to get it all developed.

This is the main reason why I'm looking forward to a new development machine so much - anything to reduce the perceived slowness and torment of the compile/debug/run cycle. It's enough to make me wish we were using Lisp or Haskell or something so I wouldn't have to recompile and restart and all.

So... with everything restored (I hope) to its correct logic, I set up one more run. I'm stupid enough to hope this one will be the last, but cynical enough to expect that it won't be.

Boy, can I call 'em or what? This time, everything runs correctly, and the dynamic process starts as expected. It even stays running! ... and running... and stays running. Either the command to stop it isn't being obeyed, or the event that triggers the stop command isn't being thrown correctly. According to the debug logs, the event isn't firing.

Now I have to trace the call upwards a layer, to see why it isn't being fired from the client code. It looks like there's a data screwup in the raw scripts, and unfortunately the guy who understands those scripts (and the code that runs them) is off not-working today. He's most likely doing exactly what I was supposed to be doing today... bastard [razz]

Thankfully, after a little bit of thought, I figured out my own mistake. A short tweak to the data, removal of some debug logging and small hackery, and we're off for (yet another) test run. Bizarrely, this new test data is churning out a lot of bogus negative numbers in weird places.

WILL IT NEVER END?!?

It takes quite a bit more time to straighten out a series of stupid typos and brainless mistakes in the test data. Finally, I reach a nasty inversion of my earlier problem: dynamic-length processes won't stop. The event fires, the operations to stop the process are queued and executed... and the bugger just keeps on running.

Finally, the guy who knows the scripts shows up online, and we hash out the problem and a working fix. There were actually two or three little things all going wrong at the same time, which led to the bogus behavior.

At long last, everything is debugged, and running as intended. The last scraps of temporary debug spam are snipped out, and it's time for one final test run to reassure myself that it hasn't spontaneously decided to break again in the meantime.

For a pleasant change of pace, it works flawlessly.

So here it is, 4PM. I've just finished a marathon of close to 11 hours of work I swore (on things very... precious to me, like my Yoda plushy collection, which I only own in my imagination) that I would not do today. Crap.

This story - and indeed pretty much all of this past 24 hours - has dragged on more than I really care to think about. I was planning on going off and cashing in on my hard-earned relaxation, but it seems the universe just isn't done taking a piss all over me yet.

Now there's a huge thunderstorm kicking up, and it looks like my gaming-slash-catharsis session is about to be killed before it begins.

Now I'm going to go cry.

Previous Entry The Desktop Metaphor Done Right

Next Entry Gaming, and such

0 likes 5 comments

Comments

Mike Bossy

Dude, I feel for ya. Your personal trip to hell sounds a lot like my recent one.

July 15, 2006 03:51 PM

mldaalder

I'm wondering, how does your boss know when your working and for how long?

From what I've read, you work from home through a VPN. So, does that mean that your boss looks at the VPN logs and sees your progress and pays you from that?

Or do you fill in a timesheet?
At school we are forced to fill in a time sheet, but unless I keep interrupting myself (yeah, the zone stuff you mentioned), I can't keep it accurate.

July 15, 2006 05:45 PM

ApochPiQ

My schedule is pure flextime - there is no tracking of my work hours at all. I suppose if someone wanted to get really anal they could check the VPN connection logs and maybe source control, but neither one would mean much; I'm logged into the VPN 24/7 on auto-redial for convenience, and I check in stuff sporadically depending on the nature of the work I'm doing. Today, for instance, would have shown a single checkin in a few tweaked lines, which isn't really representative of the massive amount of work I clocked [wink]

The idea basically is that everyone on the team knows what they need to get done, and we get it done. Aside from informal sorta-weekly status reports, there's extremely little in the way of management overhead. It's a great arrangement, really.

July 15, 2006 06:03 PM

Samsonite

/pat [smile]

July 15, 2006 06:38 PM

Daerax

aww have a /hug. we are here for you both now and after you are smacking yourself in the face for the silliness of the error.

July 15, 2006 11:24 PM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

ApochPiQ

Author

🎉 Celebrating 25 Years of GameDev.net! 🎉

Vigorous Debuggery

Comments

ApochPiQ

Latest Entries

A Few Farewells

Code Reuse In Actual Practice

Source-Level Debugging For Epoch Programs

Using Poison to Reverse Engineer Code

Using Poison to Reverse Engineer Code

Debugging Information Success

Debugging Information Success

Debugging Epoch Programs

Debugging Epoch Programs

Epoch 64-bit compiler progress

🎉 Celebrating 25 Years of GameDev.net! 🎉

Vigorous Debuggery

Comments

ApochPiQ

Latest Entries

A Few Farewells

Code Reuse In Actual Practice

Source-Level Debugging For Epoch Programs

Using Poison to Reverse Engineer Code

Using Poison to Reverse Engineer Code

Debugging Information Success

Debugging Information Success

Debugging Epoch Programs

Debugging Epoch Programs

Epoch 64-bit compiler progress

Reticulating splines