7 Keys to Systematic Debugging

Learn to recognise the various bug breeds and build awareness of how they appear in packs.

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

1. Know That the Foundation of Debugging Is Legible Information Gleaned at the Exact Moment When the Bug Strikes

It’s hardly a point of contention that grappling with hex numbers is infinitely less effective than working with the same information represented as human-readable ASCII. Similarly, a graph representation of your commit history can convey—in seconds—what might take hours to decipher through manual inspection of linear files changes. Thus we say that legibility is critical in debugging. It should be noted that this requirement of legibility applies rather extensively—to program inputs, outputs, internal states, source code, external libraries, etc.

It’s also self-evident that information available about your program’s state before it crashed pales in comparison to information gleaned at the exact moment of meltdown.

Both of these taken together should form your bedrock of debugging informational requirements. The next question, then, is where to get access to current, legible info. I’ll start you off with a laundry list of helpful tools (tinted, unavoidably, by my experience as a Rails programmer). This is neither prescriptive nor exhaustive—just a "best of" from my perspective.

(Rails programmers might want to supplement the following list with the specific recommendations I make in my Comprehensive Guide to Debugging Rails.)

1. The mighty debugger

In my education as a web developer, my biggest regret was not learning to pilot a debugger earlier. The difference between programming without a debugger and with one is like the difference between walking across Siberia on foot and flying over it in Air Force One. There may not be any silver bullets in software development, but there certainly is a gold one. Debuggers shuttle you right into the conflict zone and grant you a first-hand view of what the problem is.

One debugger is highly unlikely to be enough. Most web applications are written in two languages—for example JavaScript for the frontend and a language like Ruby for the backend. That’s two debuggers so far. But you’re not done yet! Remember that these programming languages may themselves be built in lower-level languages. And occasionally you’ll became snared at the higher level and the only way to become unstuck is to debug the underlying layer. For example, any Ruby programmer worth their salt knows that the C debugger is a godsend for analysing frozen Ruby programs.

As well as learning the ins and outs of commands within the various debuggers, it’s also worth thinking about how and when these debuggers will be launched. If, for example, a bug is sporadic in nature or laborious to re-trigger, you wouldn’t want to miss the opportunity to launch a debugger as soon as you get the chance. An example might clarify what I mean here: I usually equip my programs with the ability to launch into their debuggers whenever I send them OS-level signals. That enables me to jump in and debug whenever I notice odd behaviour.

2. Convenient text editor/debugger commands for jumping to the source code of any function in your code or in the code of the libraries you use

You have to be able to first see source code in order to be able to understand and fix it. As such, you’ll definitely want lightning-fast tools for navigating to the broken code. Way too many programmers resort to awkward measures when investigating bugs they find in external libraries. Ensure that your code teleporting techniques extend this far—for god’s sake, you don’t want to be browsing GitHub in the heat of a bug hunt!

3. Interactive code consoles (REPLs) available in all environments (testing, production, development)** **—complete with access to your actual data

You will frequently load up these tiny code laboratories to test out hypotheses about this or that, to inspect state, to write the first draft of new functions, and so on. Having your real data on hand lets you debug real problems.

4. Direct access to the underlying components of the stack

For example: When investigating a bug that has something to do with how your Ruby on Rails app interfaces with its database, you’ll certainly want to interface with your database (e.g. Postgres) directly. You’d better ensure you know how to do this.

5. Fluency with your revision history tools

Assuming your team is using source control properly, the project’s revision history should illuminate past programmers’ motivations behind various pieces of functionality. By studying this history, you’ll be better able to pinpoint problem areas, such as where the code that was written failed to match the original committer’s intention.

The specific abilities you’ll want from your toolset are as follows: You’ll want a way to attribute each and every line in a file both to a particular programmer and to a particular commit message. You’ll also want a way to search through the entire history of commits for a particular token (e.g. a function name, class name, variable name, etc.) and get an overview of every change that ever affected this token.

Finally, you’ll definitely want a three-way GUI merge tool for commit conflicts. Three-way tools are superior because they make it a hell of a lot easier to understand what’s going on when two differing versions of a file have a common ancestor. I use KDiff3.

6. Colour coding for all your languages

Colour coding should be available not only within your text editor but also in your REPL, your debugger, your exception report traces, and your revision control repository. Colour coding quite literally highlights aspects of your code’s structure, thereby boosting your comprehension speed—this is important when time is of the essence, as is often the case when debugging.

7. Diagnostic tools for external software services

The modern web application relies on a constellation of supporting actors. Since these actors ultimately form parts of your stack, you will need to be able to peep inside the internals of these external programs. Within this category of software I include things like Memcached, full-text search engines, Google Analytics, and your payment provider’s financial transaction history.

8. Operating system-level diagnostic tools

You’ll need to be knowledgeable in these tools so as to inspect whether processes are running properly, whether hardware resources are being suspiciously drained, whether the right files are getting accessed, whether the appropriate operating system calls are being made by a library, and whether the correct environmental variables are set.

I’ve made a list of useful operating system-level debugging tools here.

9. Easily perused logs

These ought to (1) track sufficient data for piecing together the trajectory of the most puzzling of bugs, and (2) have sufficiently advanced filtering and data-transformation abilities so that you get answers to your questions in usable, information-dense forms. I use Logentries and it has never let me down.

10. Performance measurement tools

These metaphorical stopwatches and gas meters help you identify and remove performance bottlenecks and resource drains. They are a necessity when determining why something is running too slowly.

The keyword to search for here is "code profiler". Learn how to use one of these for each language in your web app.

11. Tools for grouping, parsing, and summarising exception reports

If the actions of one thousand users execute a buggy path in your production code, then a thousand near-identical exception reports will get generated. To avoid administrative hell, you’ll need some appropriate gear to sort through this mess. There are plenty of online services that’ll do this for you. I use Rollbar and am very happy with it

12. Well-placed pre- and post-assertions peppered within your codebase

Assertions (wiki page) are a godsend for debugging because they alert you to aberrations which might not be sufficient to trigger full blow exceptions but which nevertheless communicate that something is askew.

13. A means of detecting and disentangling memory leaks

Memory leaks are insidious problems that wreak havoc on a program’s performance and reliability and can take weeks to debug if you don’t know what you’re doing.

Google "debugging memory leaks {YOUR PROGRAMMING LANGUAGE}" and learn the techniques now so you won’t get stuck in an emergency situation.

14. Terminal and IDE plugins that display (and therefore remind you of) important contextual information.

Here I have in mind things such as: (1) which source control branch you’re currently on, or (2) which version of your programming language is currently active (this can change as you navigate from folder to folder).

If the cause of your bug is that the wrong Ruby version was active, then you’ll know this straight away if you have a command line prompt that always informs you about what version is currently active.

15. Command histories (OS prompt and REPL)

Having a decent history of previous commands on hand gives you an effective audit trail for helping you get to the bottom of a problem. And even if it doesn’t, it will, at least, inform you of the steps necessary for recreating the problem. The command histories of all users on your production server should ideally be pooled so that the CTO can see what other team members have run.

Knowing the ins and outs of these individual tools is but the first step. Atop this foundation, you need to learn to connect the various components together in clever, time-saving ways. For example, instead of needing to leave the debugger to read the source code of a library, you’ll want the capability to view the source code from within the debugger. Instead of wishing you had inserted a debugger before you ran a ten-minute calculation that crashed, you want to have had functionality that automatically loaded up a debugging window as soon as the exception occurred—obviating the need to rerun the slow code. Instead of comparing two versions of a function by hand, you’ll want to pipe the same output to a side-by-side coloured diff representation typical of Git, thereby rendering the task so easy that a four-year-old could spot the difference.

2. Develop a Reasonable Spider Sense about Who or What to Blame

This section is not intended to turbocharge your finger-pointing abilities; rather it serves the more benign purpose of guiding you toward a bug’s most probable cause, thereby speeding up said bug’s extrication.

1. You

It’s been my experience that the more questionable the programmer, the more likely they are to jump to the conclusion that other people’s code caused a bug and not their own code. But generally speaking, this isn’t true. We are the most likely causes of our own bugs—not the battle-tested web framework, not the programming language used by millions of professionals, and almost certainly not the operating system whose stability is so assured that it powers spaceships. Thus, before attributing blame to an external entity, you really ought to rule out that you (/your team) didn’t introduce the bug yourselves.

2. Recent change

When debugging, a solid opening gambit is to review the most recent commits affecting the sector of code that ran into issues. Since certain branches of code are exceedingly rarely executed, this could imply that the most "recent" change be months and months old—even if there were much more recent modifications to other files or classes that had nothing to do with the bug.

Change, in the sense I mean here, doesn’t just refer to code edits; it also refers to novel ways of interacting with the same code. By way of example, a sector of my software remained unchanged for a year, only to fail out of the blue. Upon investigation I learned that an employee administering the website through its admin dashboard had recently switched to a new workflow. This had the effect of exposing a latent bug. This bug managed to remain invisible for so long because it resided in a code path that hadn’t been used until the administrator changed his workflow.

Using similar arguments, the same could be said for bugs triggered by unexpected/malformed user input. Once more, the program’s code stays the same, whereas the program’s users interact with it in a novel way (i.e. by feeding it surprising new inputs). This exposes an already existent yet undiscovered flaw.

Version updates are also a source of change. When you bump an external library—be it a JS framework, a Ruby gem, or an operating system tool like ssh or sed—you should remain moderately suspicious of this component for the ensuing days. In particular, watch out for changes to a library’s API. Even though the library may be bug free, your usage of it may become outdated. Automated tests, reference to and discipline with major/minor version numbering, and attention to the library’s change logs should steer you clear of these problems.

3. Resource exhaustion

In contrast to the previous section, there’s a class of bug that suddenly shows itself following a long period of system stability. These bugs tend to be due to maxing out some system resources (like RAM, hard drive space, or some finite repository of unique tokens, as might happen with regard to file names). These bugs might equally well be due to maxed out software services (such as outgoing email credits, indexed entries in a paid search service, or memcached allocations).

If your time budget allows for it (or the security needs of your domain demand it), you should anticipate that certain resources will one day get used up and have in-place fail-safes.

4. Less popular/unmaintained libraries

Third party libraries—in particular, ones that are less commonly used by the community or no longer maintained—should be treated with a dose of suspicion, moderated according to how trustworthy you perceive a given package. (Though you should have already ruled out self-caused error before pointing the finger at a third-party library.)

3. Learn to Speed Read Backtraces

This section will be obvious for seasoned programmers, so if that’s you, then feel free to skip; I’m writing for the benefit of programmers at the start of their careers.

When something blows up in software, the interpreter usually spits out an error report known as a backtrace. This can be hundreds of lines long (depending on the size of the stack) and is often replete with references to methods, classes, and files you never knew existed in your program.

Even though these backtraces look about as user-friendly as the small print in a mobile phone contract, they are nevertheless a godsend when debugging—as long as you know how to speed read them.

Below I’ve included a heavily truncated backtrace. As you can see, it’s quite a bewildering beast:

A NoMethodError occurred in background at 2015-03-12 15:43:01 UTC :

undefined method `notes_files' for nil:NilClass /app/app/models/notes_file.rb:445:in `format_and_set_released_on' /app/vendor/bundle/ruby/2.1.0/gems/activesupport-3.2.21/lib/active_support/callbacks.rb:407:in `_run__829461286475176671__create__3353320491559728596__callbacks' /app/vendor/bundle/ruby/2.1.0/gems/activesupport-3.2.21/lib/active_support/callbacks.rb:405:in `__run_callback' . . . /app/vendor/bundle/ruby/2.1.0/gems/paperclip-3.4.2/lib/paperclip/attachment.rb:380:in `valid_assignment?' /app/lib/patches/paperclip_attachment_decorator.rb:26:in `assign' /app/vendor/bundle/ruby/2.1.0/gems/paperclip-3.4.2/lib/paperclip.rb:199:in `block in has_attached_file' /app/vendor/bundle/ruby/2.1.0/gems/activerecord-3.2.21/lib/active_record/attribute_assignment.rb:85:in `block in assign_attributes' . . . /app/vendor/bundle/ruby/2.1.0/gems/activerecord-3.2.21/lib/active_record/associations/collection_proxy.rb:46:in `build' /app/app/models/notes_file.rb:315:in `create_from_s3_upload' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/performable_method.rb:26:in `perform' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/backend/base.rb:102:in `invoke_job' (eval):3:in `block in invoke_job_with_newrelic_transaction_trace' /app/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.7.0.177/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:339:in `perform_action_with_newrelic_trace' (eval):2:in `invoke_job_with_newrelic_transaction_trace' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:206:in `block (2 levels) in run' /app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:206:in `block in run' /app/vendor/ruby-2.1.2/lib/ruby/2.1.0/benchmark.rb:294:in `realtime' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:205:in `run' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:153:in `block (4 levels) in start' /app/vendor/ruby-2.1.2/lib/ruby/2.1.0/benchmark.rb:294:in `realtime' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:152:in `block (3 levels) in start' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/tasks.rb:9:in `block (2 levels) in ' /app/vendor/bundle/ruby/2.1.0/gems/rake-10.4.2/lib/rake/task.rb:240:in `call' /app/vendor/bundle/ruby/2.1.0/gems/rake-10.4.2/lib/rake/task.rb:240:in `block in execute' . . . `standard_exception_handling' /app/vendor/bundle/ruby/2.1.0/gems/rake-10.4.2/lib/rake/application.rb:75:in `run' . . . /app/vendor/bundle/ruby/2.1.0/bin/rake:23:in `

'

How should you tackle this deluge of information? The first step should be to pay attention to folder names. Underneath this paragraph I reprinted the same backtrace, except this time with the first instance of any folder bolded:

A NoMethodError occurred in background at 2015-03-12 15:43:01 UTC : undefined method `notes_files' for nil:NilClass /app/app/models/notes_file.rb:445:in `format_and_set_released_on' /app/vendor/bundle/ruby/2.1.0/gems/activesupport-3.2.21/lib/active_support/callbacks.rb:407:in `_run__829461286475176671__create__3353320491559728596__callbacks' /app/vendor/bundle/ruby/2.1.0/gems/activesupport-3.2.21/lib/active_support/callbacks.rb:405:in `__run_callback' . . .' /app/vendor/bundle/ruby/2.1.0/gems/paperclip-3.4.2/lib/paperclip/attachment.rb:380:in `valid_assignment?' /app/lib/patches/paperclip_attachment_decorator.rb:26:in `assign' /app/vendor/bundle/ruby/2.1.0/gems/paperclip-3.4.2/lib/paperclip.rb:199:in `block in has_attached_file' /app/vendor/bundle/ruby/2.1.0/gems/activerecord-3.2.21/lib/active_record/attribute_assignment.rb:85:in `block in assign_attributes' . . . /app/vendor/bundle/ruby/2.1.0/gems/activerecord-3.2.21/lib/active_record/associations/collection_proxy.rb:46:in `build' /app/app/models/notes_file.rb:315:in `create_from_s3_upload' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/performable_method.rb:26:in `perform' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/backend/base.rb:102:in `invoke_job' (eval):3:in `block in invoke_job_with_newrelic_transaction_trace' /app/vendor/bundle/ruby/2.1.0/gems/newrelic_rpm-3.7.0.177/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:339:in `perform_action_with_newrelic_trace' (eval):2:in `invoke_job_with_newrelic_transaction_trace' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:206:in `block (2 levels) in run' /app/vendor/ruby-2.1.2/lib/ruby/2.1.0/timeout.rb:91:in `block in timeout' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:206:in `block in run' /app/vendor/ruby-2.1.2/lib/ruby/2.1.0/benchmark.rb:294:in `realtime' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:205:in `run' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:153:in `block (4 levels) in start' /app/vendor/ruby-2.1.2/lib/ruby/2.1.0/benchmark.rb:294:in `realtime' /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/worker.rb:152:in `block (3 levels) in start' . . . /app/vendor/bundle/ruby/2.1.0/gems/delayed_job-4.0.0/lib/delayed/tasks.rb:9:in `block (2 levels) in ' /app/vendor/bundle/ruby/2.1.0/gems/rake-10.4.2/lib/rake/task.rb:240:in `call' /app/vendor/bundle/ruby/2.1.0/gems/rake-10.4.2/lib/rake/task.rb:240:in `block in execute' . . . `standard_exception_handling' /app/vendor/bundle/ruby/2.1.0/gems/rake-10.4.2/lib/rake/application.rb:75:in `run' . . . /app/vendor/bundle/ruby/2.1.0/bin/rake:23:in `

'

By training your attention on folders, you can simplify the backtrace, thereby helping you locate the triggers of a bug.

Now, given what I said earlier about blaming yourself for errors before pointing the finger at third parties, you should first focus on the folders that contain code you yourself wrote. All my third-party dependencies went into /app/vendor, so I just filter these entries out. (This could be done with a tool like grep.) Upon completion, I am left with this greatly reduced backtrace:

A NoMethodError occurred in background at 2015-03-12 15:43:01 UTC : undefined method `notes_files' for nil:NilClass /app/app/models/notes_file.rb:445:in `format_and_set_released_on' /app/app/models/notes_file.rb:315:in `create_from_s3_upload'

Besides reducing the backtrace length to manageable proportions, this act of filtering did something else helpful: Entries that were previously separated by hundreds of lines of noise (i.e. all the /app/vendor entries) now appear on consecutive lines, making the bug’s causal chain easier to establish. We clearly see that #create_from_s3_upload was called, which led to #format_and_set_released_on getting called, which blew up because it called #notes_file on a nil object.

4. Never Nest Bug Hunts

The act of debugging has you dealing with code from your past­—and as is often the case with reminiscence, you won’t always like what you see. When you revisit old code, you may find yourself overwhelmed by urges to make "quick" refactors and improvements. Even more arresting is the need to fix veritable bugs you spot along the way.

This urge must be resisted. It is better to make quick GitHub issues and then stay on your original course. Why? Because getting sidetracked causes you to lose the mental context accompanying the search for the original bug. This isn’t just about lost time in context switching: More problematically, a context switch creates the risk of introducing new errors, as would happen, for example, if you forgot to remove slow logging statements or if you forgot to excise temporary code intended for debugging purposes which would have disastrous effects if released into the wild. In addition to these risks, nesting bug hunts also creates a mess in version control. Your commits will become crowded and lose their atomicity, as evidenced by when the dreaded conjunction "and" shows up in your commit messages.

5. Recognise the Various Bug Breeds

1. Sloppiness Bugs

This category contains bugs due to misspelled constants, missing method parameters, and so on. Generally they are attributable to sloppy work or poor knowledge of the interfaces in use. Luckily these bugs are curable in seconds and with just a few keystrokes. Fixing these bugs usually requires little more than proofreading the code and referring to its documentation.

2. Unanticipated Environment Bugs

Programs are designed with specific ranges of inputs and environmental circumstances in mind. These assumptions may have been consciously weaved into the code’s design or they may merely be tacitly assumed without anyone consciously considering the issue.

Regardless of where the assumptions came from, they are soon put to the test by real-world usage of the program. Problems occur when the program encounters unanticipated situations. By way of example, I’ve run into trouble with the following:

  • I did not anticipate that uploaded filenames might contain quotation marks, e.g. "Essay on 'Love in Ancient Rome'".

  • I assumed file upload sizes would be 10MB at most. But some users uploaded files that were two orders of magnitude bigger (1GB+).

  • I assumed the network would always be up, whereas in reality it turned out to be patchy. This unreliability broke the connection between my code and remote supporting services.

These bugs are special in that they are essentially due to incorrect speccing. The best prevention here is a combination of careful thought and domain-specific knowledge.

3. Hubristic Understanding Bugs

This breed of bugs is due to our misplaced belief that we understand something when we in fact do not. This certitude leads to bugs which are tricky to solve, for they require us to to question deeply held assumptions. As such, the solutions to these bugs may be invisible to us, owing to our own hubris.

Here’s an example of one such bug that slipped by me unnoticed. I have a function which is supposed to tell me whether I refunded a customer in full or partially. I discovered a bug where some of my refunds were being categorised as "Partial" when in fact they were full refunds. Here’s the relevant code:

ruby def refund_type: if refund.amount == order.amount "Full" else "Partial" end end

This function definition seemed correct so I quickly ruled it out as the source of error and began looking elsewhere. But as it turned out, I later discovered these lines were responsible for the bug. This is because the numerical type contained in the variable refund.amount was of the class BigDecimal, whereas the type in order.amount was a Float. To my disbelief, it transpired that operations of equality are unstable when done between differing numerical types:

```ruby BigDecimal.new("67.8") == "67.8".to_float

=> true

BigDecimal.new("67.9") == "67.9".to_float

=> false

```

I could have spotted this bug if I had distrusted my own assumptions and instead loaded up a debugger and worked my way through the affected code line by line, mentally pre-computing what I would expect in each evaluation before checking the computer’s actual output. The awareness and care needed to unearth this bug demonstrates that a sizeable part of acquiring debugging ability is embracing scepticism. Perhaps the programmers with the most reliable code are those who harbour the most self-doubt.

4. Hidden Universe Bugs

This breed of bugs (which overlaps with Hubristic Understanding Bugs above) is caused by things you never even considered as potential sources of error, such as constraints you were completely unaware of, system layers and protocols you never realised were present, or classes of problems you had no idea were even tractable.

These bug hunts take devastatingly long periods of time to solve. Their resolution necessitates learning how increasingly fundamental topics work—be that operating system process models, DNS resolution systems, database indexing algorithms, byte encoding rules, or debugging tools for software layers far removed from your day-to-day work. In a sense, these bugs invoke a revolution in understanding—a paradigm shift in the truest Kuhnian sense. Anyone who successfully solves a Hidden Universe Bug emerges as a stronger programmer.

In “normal” bug hunts—if you’ll permit such a thing to exist—you know roughly what’s wrong. It’s usually possible to arrive at a resolution by scouring internet forums, or personally eyeballing the code, or whipping out a debugger and revisiting your assumptions with its assistance. In these cases, you typically encounter helpful entities like Google-able exception names, standardised error codes from a payment provider’s API, or function output that shows itself to be clearly wrong in some understandable manner.

By contrast, with bugs caused by hidden universes, you may have narrowed the bug’s cause down to a single line yet still remain clueless as to what’s wrong. You will feel as if an invisible demon is haunting your processor.

Revisiting the example about numerical types from the above section on Hubristic Understanding Bugs, the very first time a programmer encounters such an error, it is a hidden universe bug because it’s not natural for a human to think that the equality operator could break in this way. This programmer’s idealistic mental model of numbers doesn’t match the thornier machine implementation, and he or she must journey to the murky details of floating-point arithmetic before being able to cure the broken code. But the next time this programmer encounters such a bug, it has transformed into a known, effable entity—an issue with floating-point arithmetic. The programmer’s perceptual understanding has evolved here, just as it does with these other, perhaps familiar, situations:

  • Results from multithreaded counters that were formerly inexplicably incorrect become perceived as "race conditions caused by processor context switches".

  • A web server stopping in the middle of the night for no apparent reason becomes an "init.d script that didn’t restart Nginx following an emergency reboot of the server".

  • Mysterious database records that turn up during browser integration tests become "artefacts from database transactions".

It is for these bugs—those attributable to hidden universes—that I most especially direct the advice in the later section "Pay Attention to All Oddities".

5. Clumpy Bugs

Clumpy bugs consist of knots of small, seemingly unrelated issues which collude into forming a more spectacular failure than any individual component could cause on its own.

The principal danger arising from these bugs is the premature presumption that there was a unitary cause of failure, as would happen when the programmer analysing the broken code notices something amiss, presumes this was the sole cause of the reported failure, and then writes a unit test for this particular component before going on to fix the narrow issue. Not until after deploying will this programmer discover that the primary bug still exists, due to there being outstanding unresolved issues outside the remit of the newly written unit test.

The first step to handling clumpy bugs is to recognise that any given bug can be caused not just by single errors but also by groups of errors. We incorporate this realisation into our work by writing integration tests to confirm bug resolution instead of relying on mere unit tests which may be too narrow in scope for clumpy bugs.

6. Pay Attention to All Oddities

Bug hunts bring you on a whirlwind inspection tour of databases, configuration files, function return values, log entries, logic flows, etc. As you wade through these torrents of information, it’s only natural to be tempted to discard seemingly unrelated oddities. Let’s say you are debugging something related to the Product model and you happen to notice something weird in the User model, say a flow of logic that is harmless but nevertheless weird. Or while scanning your logs you see that Memcached gave a warning.

Are such considerations worthy of your time?

They most certainly are. More often than not, these oddities are related to your bug, albeit in roundabout and unexpected ways. Remember: If code always followed your expectations, then you wouldn’t have bugs in the first place! As such, it is dangerous to assume that the aberrations you "happen" to notice when on a particular bug hunt are unrelated to the primary problem. Indeed, the fact that you notice these oddities on this particular bug hunt indicates that they inhabit the same time- and code-space as the bug under investigation.

The best course of action when you encounter such a beast is to make a note of it as you go along and consider whether there is any way it may have caused your bug. Even if it didn’t, your written record gives you something to revisit afterward, e.g. for creating separate issues for potential separate bugs.

7. Never Forget that Bugs Are Pack Animals

The word "bug" suggests undeserved individuality for the kinds of errors that typically infect a software system. Bugs are better thought of as families, as pack animals. Whenever you discover one cockroach in your apartment, you can bet there are more of these critters lurking nearby.

You should especially be on the lookout for additional related bugs whenever you’ve just squashed a bug attributable to a past misunderstanding. For example, the numerical types and refunds bug above could only be resolved by realising that floating-point math deviates from the platonic math you learned at school. This new understanding should compel you to revisit all floating-point calculations in your codebase and to check for other failures attributable to your old worldview.

In a similar vein, misunderstandings about your preferred web framework’s API are likely to be replicated everywhere else you access the same part of the API. Do a search for good measure before finishing up your debugging session.

Incidentally, this same kind of reasoning should also alert you to the possibility that whenever you find one malformed data record, there are likely to be others—or perhaps even a corrupt function that’s manufacturing defective records en masse. You ought to systematically check that no other records in the database are similarly afflicted and that no functions are actively polluting your data.


More Articles: Click here for full archive

Ad Creative Writing Basics

Tips for image choice, headline, and the body


A Tour of Targeting

The general principles of targeting followed by a tour of the available options on the major platforms


Hidden costs of accepting online payments

How hidden fees stack up so as to quadruple the cost of using online payments providers