Viruses, Errors, and You

Not that I would ever wish this upon anyone, but let’s pretend that you just became one of the unlucky 1.5 million people (at the time of this writing) to be infected with COVID-19.  Since this post is primarily about error handling — don’t worry, I’ll get there — let’s think about the errors that may have occurred up to the point when you became sick. In other words, what went wrong, what mistakes were made, to cause this undesirable end result.

    • Maybe Patient Zero was a jerk who coughed on dozens of other people, accelerating the initial spread.
    • Maybe China’s government bungled the initial response, allowing the virus to leave its borders and/or misleadingly downplaying the threat.
    • Maybe your own government didn’t act quickly enough to enact social distancing and ramp up testing.
    • Maybe testing was ramped up, but there weren’t enough supplies or staff members to process those tests.
    • Maybe one of your coworkers or family members ignored social distancing rules and got you infected.
    • Or maybe none of the above happened, and you just screwed it up for yourself by trying to break the doorknob licking world record.

Naturally, at the top of that list, we could put “COVID-19 came into existence.” And with the exception of that mutation “error,” what all of the above problems have in common is that they are preventable. If the right people take the right actions, at the right times, the number of cases of COVID-19 — perhaps including yours — can be reduced.

By this point, however, containment isn’t practical, and with the long incubation period, it may never have been. (You could argue that, if China had acted aggressively and early enough, COVID-19 could’ve been completely contained. My analogy kind of breaks down on that issue, so I won’t get into it.) In any system that is large or complex enough, you will inevitably have failures — people make mistakes, governments dawdle, our scientific understanding proves insufficient. Members of society at large will get sick, with or without showing any symptoms, and everyone else has to figure out how to deal with that.

What does all of this have to do with Spycursion? One of my (many) quarantine projects has been improving the game’s error handling. Just like with a pandemic, any large software project has a massive number of moving parts, any number of which can and will fail. Preventing failures entirely is a lost cause, just like relying on 7.2 billion humans to follow hygiene guidelines flawlessly is a lost cause.

To hammer this point home, consider a few possible sources of errors we, as the game developers, might have to deal with:

    • Server-side infrastructure problems, such as a database becoming unavailable
    • A game client making invalid API calls
    • A game client making valid API calls that violate game rules — in other words, a player trying to cheat
    • Some poorly-written Slang software making an incorrect function call
    • A player running a piece of Slang software incorrectly, like putting “1 / 0” into a calculator program

You might notice that some of these error sources are entirely out of our control! We cannot control what players will do, but we can make sure that Spycursion still does The Right Thing™ when faced with an unexpected situation. This is the essence of error handling.

But what exactly is The Right Thing™ when it comes to, say, a database failure? What happens when the game server calls login_player and the result is error: database unreachable? We don’t just immediately show this message to the player verbatim, partly because it’s meaningless to them, but also because the problem may still be fixable. Ideally, this database will have a backup copy ready to go, so the code can try the same call to a different database server and everything is hunky dory. If the backup is down too, then the player can’t log in, and we have no choice but to tell them something. So that message error: database unreachable bubbles up through a series of constructs, which I won’t get into details about but often looks something like try { stuff } catch(errors) { other-stuff }. (Common Lisp has its own powerful tools for handling errors, but that’s another post for another time.) In the end, the player receives a less confusing message, like: “Oh crap, something went wrong with Spycursion, but don’t worry, we’re gonna fix it ASAP, promise, please don’t talk smack about us on Twitter, kthxbai.”

The Slang calculator program is an interesting scenario, and somewhat unique to Spycursion, because it deals with player-written code. We do provide (simple) features in Slang to deal with errors and warnings, so a well-written calculator program would throw a “divide by zero” error on its own. But there’s always the possibility of a poorly-written calculator that forces us to deal with that division by zero. In that scenario, we still handle the error, but not necessarily without a penalty to the player who wrote the faulty Slang code. (Don’t say I didn’t warn you…)

In all cases, a lower-level error bubbles up to higher levels until it is either handled or we give up and send a pleading message to the player. Going back to our unfortunate COVID-19 scenario, perhaps you can see a similar process playing out:

Error: Virus created

Error: Containment failure

Error: Not enough tests

Error: Unable to process tests

Error: Social distancing breach detected

Error: Too many doorknobs licked

Just like a well-designed software system, a well-designed public health system will reduce illness by handling these “errors.” Well, except that last one. There’s still no cure for stupidity.