The Art of Failing Gracefully

Solomon Hawk, Senior Developer

Article Categories: #Code, #Front-end Engineering, #Back-end Engineering, #User Experience, #Accessibility, #Security, #Performance, #UI Design

Posted on August 16, 2022

What happens when things fall apart is a critical part of users’ journeys interacting with software. It’s often an inflection point that deeply influences their perception of whether their time is being wasted or not. Not all failures can be anticipated. How we architect our applications from systems to interfaces must reflect that truth. We have a responsibility to give a best-faith effort to provide users the best possible experience when they’re dealing with issues that arise, whether they are anticipated or not.

Error handling is one of those things that gets complicated quickly and having a systematic approach is crucial for building resilient applications. It’s tempting to prioritize the Happy Path while deferring work on failure modes, but when time and budget are limited you run the risk of ending up with an application that treats errors imprecisely, alienates users, and frustrates customer support and engineers.

Prioritizing handling failures gracefully also requires buy-in from project team members across disciplines. Design and UX experts need to account for an application’s failure modes in order to support a holistic error handling approach that gives users the best possible experience even when things don’t go as planned.

Here are some of the things I try to be mindful of when architecting applications to handle errors gracefully.

Every error has 2 sides. #

The user’s experience, which involves interacting with your application’s UI
The engineer’s experience, which involves observing, investigating, and trying to understand the failure

We’ve all encountered a “Sorry, it broke!” page, maybe even built one or two. Unless something completely unexpected occurred (e.g., HTTP 500), this kind of result is imprecise and frustrating. It fails to communicate the “why” and the “what next” which we owe our users, assuming we respect them, their time, and their choice to use our products.

On the other side, engineers need sufficient information in order to understand what went wrong, why, and how to reproduce the problem in order to fix it. Poor observability leads to gaps in coverage which compromises our ability to understand the failure modes of our applications. Insufficient metadata or reproducibility undermines our ability to determine appropriate steps to take in order to mitigate the problem.

Not all errors can be anticipated. #

Errors come in one of two flavors:

expected errors (that are recoverable), such as authentication/authorization errors, validation errors
unexpected errors (that are unrecoverable), such as rendering errors, or service availability errors

For recoverable errors, we need to let the user know what went wrong and give them options for how to proceed.

Help the user understand why the task they were trying to do failed and what steps they need to take in order to avoid additional failures
If there’s a context-specific action the user can take in order to recover from the failure state, ideally we should provide it to them (e.g., if a component fails to load data, provide a button to “try again” or instruct a user to try reloading the application)
If there isn’t, we should have a fallback strategy for resolving the issue (such as retrying automatically) or hiding the component that is in a compromised state
Example: validation errors when saving a form, or a calendar or map widget that fails to load data

For unrecoverable errors, we should apologize and explain that something failed, and ideally give users some assurance that we are aware of the failure and are taking appropriate steps to resolve the issue.

It’s our responsibility to disclose that there was an unexpected problem, and give the user some options for how to proceed (such as waiting and trying again later, filing a bug report, or contacting support directly)
Due to their nature, there is often no clear path to get back to a known, working state
Where appropriate, we should provide users with visibility into the health and status of our systems and the status of triage (whether it’s https://status.io/ or a self-hosted version) and provide post-mortem’s explaining how we are adapting and evolving our software to become more resilient over time
Example: the server returns a 500 Internal Server Error, because a critical service could not be communicated with

Most applications share a similar set of common, known failure modes. #

Not all software will need to account for every one of these, and some software will have additional failure modes to consider. However, these are common enough that we can leverage well-established best practices for addressing them.

Authentication #

💡 A user tries to take an action or access a page and we need to identify them in order to respond, but they aren’t logged-in.

This kind of failure is fairly well-understood and most applications take a standard approach: return a HTTP 401 Unauthenticated response and redirect them to a login page. Ideally, after logging in, the user is returned to the place they were trying to go or the action is re-attempted.

Authorization #

💡 A user attempts to take an action or visit a page that requires permissions that they do not have.

Many systems have multiple user types with limited permissions. When architected properly, this kind of failure shouldn’t happen often and the standard approach is sufficient. Our applications have enough context to avoid surfacing UI that we know an authenticated user does not have permission to interact with.

The standard approach includes returning a HTTP 403 Forbidden response and/or displaying an “Access denied” page. For some applications a better solution may involve helping a user request additional permissions from an administrator.

If your application exposes UI to users that frequently results in 403 responses, consider employing more granular role or permission checking to prevent those scenarios from arising in the first place.

Validation #

💡 Incomplete or invalid user-supplied data (often when interacting with forms).

In order to build reliant, resilient applications we must take care to maintain the integrity of our data which requires us to validate user input before creating or updating records. Validation errors are quite common and handling them gracefully is fairly well-understood. Ideally we can give users direct, actionable guidance on what changes they need to make in order for their request to succeed. WebAIM provides some advice for accessibility considerations. The WCAG has some guidance on handling user input in forms which includes some helpful suggestions such as:

Be forgiving of different input formats. Accept multiple common formats for fields such as telephone numbers, dates or times. Do not use form field types that are overly restrictive (such as number inputs for zip codes).
Require user confirmation for irreversible actions. For example, when a user makes a request to delete something, require them to confirm that action before committing the permanent side effect.
Provide undo functionality. Instead of immediately deleting records, consider soft-deletion where records are marked as deleted but can be later recovered if necessary.

Rendering #

💡 While preparing a response to a request, the application fails to render the response.

These (often fatal) bugs can be caused by an incorrect assumption about the shape of data, failure to load data, or even a typo in rendering code. There are a number of decisions that can help mitigate the risk of these kinds of problems arising:

Strongly typed programming languages give us some compile-time validation that we haven’t violated any contracts when accessing data but are only as smart as the static analysis allows. When interfacing with external systems, there’s still a risk that the types we define are incorrect or become invalidated by a future change.
Defensive programming techniques can give our rendering code some flexibility in handling unanticipated changes at the cost of complexity. Conditional property access, presence or empty checking and fallbacks or defaults may help us avoid errors but what to do when data is missing can be less clear. Is it better to show a UI with missing or placeholder data than to show users an error? The answer to this question is likely specific to your application and domain.
Framework-supported features can give us alternate, powerful ways of accounting for unanticipated problems with rendering. A good example of this is Error Boundaries in React, Vue or SolidJS. This kind of feature allows us to handle errors in component hierarchies in an exception-like way and declare fallback UIs to display if the framework encounters a rendering error.

Service Availability #

💡 A service our software depends on is temporarily unavailable.

Many kinds of software, especially on the web, rely on other services for things like authentication, data persistence, background processing, sending emails, push notifications, and many other features. If these services are critical, like authentication, then a service interruption can mean your application is temporarily unusable. For other things like sending emails, a service interruption doesn’t necessarily mean your application is unusable. In scenarios like this you may be able to enqueue the email to be retried at a later time and notify your users of the delay.

Observability is critical in triaging and fixing problems. #

Without tools to monitor and capture issues, investigating problems and fixing them is difficult or impossible. This is especially true in distributed systems (such as microservices-oriented architectures).

Error monitoring:

fixing problems begins with knowing when they happen, and services like Sentry are a key component of building resilient systems
it’s important to include sufficient metadata alongside errors when sending them to an error monitoring service so that engineers can better understand the context in which the error occurred
capturing application state at the time of an error can be challenging in distributed systems or microservices-oriented architectures, requiring more sophisticated methods of logging and stitching together session information (events tagged with transaction ID’s can help connect the states of different systems)

Replaying user sessions:

services like LogRocket or Datadog give us the ability to replay user sessions and experience first-hand what the application’s behavior was when something went wrong
helps build empathy but also can be a crucial tool in analyzing the cause-effect chain in order to understand why something broke

Avoiding regressions:

when a bug is found and fixed, it’s important to add tests that cover that specific scenario if possible
associating that test with a reported bug (by linking it to a ticket number, for example) helps communicate its purpose and importance to future engineers

Final Thoughts #

There are many considerations that impact our ability to deliver applications that handle errors gracefully. The responsibility to account for these failure scenarios is distributed across project teams and product owners.

Engineers are well-positioned to navigate the technical decisions that affect the level of risk of some of these scenarios occurring and to employ techniques at the system level which facilitate debugging.

Design, UX, and product are well-positioned to account for users’ experiences by providing domain-specific workflows in failure scenarios that give users easy remediation or escalation while communicating transparently and respectfully.

Ultimately, all of these things need to come together for an application to truly deliver a great user experience even when things fall apart which is often a crucial inflection point in users’ journeys. How we handle these situations can, if done poorly, alienate and frustrate users or, if done well, build loyalty and respect.

Stay tuned for more technical deep dives related to handling errors in React applications.

What is a Headless CMS and When Should You Use One?

What is a Headless CMS and When Should I Use One?

The Art of Failing Gracefully

Every error has 2 sides. #

Not all errors can be anticipated. #

Most applications share a similar set of common, known failure modes. #

Authentication #

Authorization #

Validation #

Rendering #

Service Availability #

Observability is critical in triaging and fixing problems. #

Final Thoughts #

Related Articles

Typing Components in Svelte

Maintenance Matters

Pandoc: A Tool I Use and Like

The Viget Newsletter