Sunday, April 28, 2013

Chasing stack overflows

Recently, one of my programs crashed with a segmentation fault error.

For the user of a program, bugs and crashes are bad.  For a programmer, bugs and crashes are ok. Almost everything I learn is via bugs and crashes.  If I knew a better way to implement something, I probably would have done it that way in the first place.  Watching your program crash and burn or compute at a snail's pace is great motivation to write better code.  And making a code change that fixes your buggy program is a powerful way to learn.  One might call this Crash driven development.

Here are a few things I learned:
1. Get comfortable using debuggers and examining core files.  It took me a bit of time and mental energy to set up the debugger that by the time I wanted to analyze things, I was sort of already mentally checked out.  I took a look at the stack, I dumped some local variables and I kind of gave up.  Using debuggers should be as easy as opening text files and reading what is going on.  

It turns out the reason for the crash could have been found if I spent a little more time with the debugger.  However, I gave up too soon.  The reason I gave up was because I don't have much experience working with debuggers.  I figured I'd be better off just reading my code, running tests and going from there.  Unfortunately, the crash wasn't really reproducible on my development machine, so using the debugger with the crash environment saved was the way to go.

2. Prove the bug, before implementing a fix.  After some time hopelessly trying to reproduce the crash a coworker told me that the machine where my program crashed was set up with a relatively small default stack size*.  He suggested that I raise the stack size.  This led me to a whirlwind of reading and code fixes, and tests that led me to conclusions like, if I overflow my stack, my program will crash.  While that statement is true, it didn't really prove that the cause of my crash was a stack overflow.  I should have spent more time definitively proving the suspected cause of my bug was indeed the cause or not.

3. Ask for help.  One colleague suggested that I had a stack size issue, which wasn't the case. Another suggested that I refactor my code a bit to make it a little more clear what was happening at the time of the crash.  This second discussion led me to re-evaluating a different section of my code and somehow seeing the problem.  Even though the second colleague didn't see the bug, just discussing the issues with him led me to solving the issue.  I have no idea how the brain actually works, but somehow just explaining my bug to another person who just asks questions is enough to re-wire my brain to see previously before hidden problems.

Just as a final note, the cause of my bug was an off-by-one error.  I was reading from a dynamically generated set of data records that could have variable number of fields and records.  I was using the wrong bounds, and sometimes (only on one type of machine and sporadically on some queries) the program would crash.  The fix turned out to be just a one line code change.

*Whenever I face a problem I'm not familiar with (in this case, stack sizes and stack overflows) I  typically turn to Google to help me research stack overflows.  How can I reproduce one, what signs are there that I might be doing something bad, are there messages that will help me identify what is going on, etc ... Unfortunately, searching the internet for information on stack overflows has been rendered virtually impossible by the ubiquitous programming Q&A site, stackoverflow.

No comments:

Post a Comment