DEV Community

Tony
Tony

Posted on

Postmortem

error

Issue Summary:

  • Duration of the outage: May 7, 2023, at 8:15 AM EDT - May 7, 2023, at 9:00 AM EDT
  • Impact: Users trying to access the second web server were unable to do so during the outage.
  • The service was completely down for 100% of the users trying to access it.
  • The root cause was a misconfigured firewall rule that was blocking incoming traffic.

Timeline:

  • May 7, 2023, at 8:15 AM EDT: The issue was detected by a monitoring alert that showed that the second server was not responding to requests.
  • Actions taken: The team investigated the second server logs and found no issues with the server.
  • They then checked the load balancer logs and saw that requests were not reaching the second server.
  • Misleading investigation/debugging paths that were taken: The team initially suspected a problem with the second server and spent time checking the server logs.
  • The investigation was a misleading path because the issue was not with the server itself, but with incoming traffic.
  • The incident was escalated to the network engineering team to investigate the network configuration.
  • The incident was resolved by the network engineering team identifying the misconfigured firewall rule that was blocking incoming traffic to the second server and fixing it.

Root cause and resolution:

  • The root cause was a misconfigured firewall rule that was blocking incoming traffic to this server.
  • The issue was fixed by the network engineering team identifying the misconfigured firewall rule and fixing it.

Corrective and preventative measures:

  • To prevent similar issues from happening in the future, the team will implement regular network configuration reviews to ensure that firewall rules are correctly configured.
  • Tasks to address the issue include:
    • Reviewing firewall rules for all servers to ensure that they are correctly configured
    • Implementing additional monitoring for incoming traffic to identify any further misconfigurations or anomalies
    • Reviewing incident response procedures to ensure that misconfigurations are quickly identified and resolved.

Top comments (0)