Software Postmortem

A walk-in when a Database Fails

·

2 min read

Software Postmortem

Introduction

A software postmortem, also known as a postmortem analysis or a retrospective, is a process of reviewing and analyzing a software project or system after it has been completed or experienced a significant failure or incident. The purpose of a postmortem is to identify the root causes of the failure or incident, learn from the experience, and make improvements to prevent similar issues in the future.

To write a comprehensive postmortem, you need to cover these 4 parts:

  1. Issue Summary

  2. Timeline

  3. Root Cause and Resolution

  4. Corrective and preventative measures

We are going to use an example of Database failure to see an example of how to write a good postmortem. Let's get started.

MY DATABASE POSTMORTEM

  1. Issue Summary

On October 15, 2023, from 3:00 PM to 5:00 PM (EST), our user authentication service was down for 2 hours. This affected approximately 80% of our users, preventing them from logging into their accounts. The root cause was a misconfiguration in our database.

  1. Timeline

  • 3:00 PM: The issue was detected when customers started complaining about login issues.

  • 3:15 PM: The customer support team escalated the issue to the engineering team.

  • 3:30 PM: The engineering team started investigating the issue, initially suspecting a code bug.

  • 4:00 PM: After ruling out a code bug, the team started investigating potential issues with the database.

  • 4:30 PM: The team identified a misconfiguration in the database as the root cause.

  • 5:00 PM: The database was reconfigured correctly, resolving the incident.

  1. Root Cause and Resolution

The issue was caused by a misconfiguration in our database. Specifically, the user authentication table was accidentally dropped during routine maintenance, causing all authentication requests to fail.

The issue was resolved by restoring the dropped table from a recent backup and reconfiguring the database correctly. This immediately restored the user authentication service.

  1. Corrective and Preventative Measures

To prevent similar incidents in the future, we need to improve our database management and maintenance procedures. Specific tasks include:

  • Restoring the dropped table from a backup.

  • Develop a comprehensive incident response plan, including clear roles and responsibilities for each team member.

  • Implementing stricter access controls to prevent accidental modifications to the database.

  • Training the engineering team on proper database maintenance procedures.