Search Cluster Issue
Incident Report for SimpleReach
Postmortem

Customer Impact

Beginning around 1700 ET on Monday 8/1, all queries running through our search cluster began to slow down. Within about 5 more minutes, we lost one of the nodes and shortly thereafter, the rest of them died. This led to the dashboards, tweets, and certain API queries to become unavailable. Everything became operational again by 1755 ET on Monday 8/1.

Incident Description

We received our first notification of dashboards being slow to load at 1701 ET on Monday 8/1. Due to an excessively complex query on the search cluster, the memory was being over used to attempt to properly service that query. After about 20 min, we lost of the nodes in the cluster with an OOM error (out of memory). The rest of the machines in the search cluster were incapable of handling the rest of the query and subsequently became unresponsive.

Steps Taken To Resolution

At 1701, once the first alert was received, we immediately shut down all extraneous queries to our search cluster. While things initially became more responsive, they were (apparently) still mid-downturn. At 1715, the first machine left the cluster with an OOM error. We immediately restarted search on that machine. The other instances in the search cluster took a lot longer to recover. It wasn't until roughly 1730 that we started to see some of the machines become responsive again.

At 1730, we began a rolling restart of the entire search cluster hoping to remove any remnant of the problematic query. Nearly all of the machines in the cluster were back online by 1750 except the initial problem node. We restarted that node one more time and it eventually came back online. Once we saw no more history of slow performing queries in the execution logs, we allowed the rest of the production load back on to the search cluster.

Scope

The scope of the problem was limited to application (web, API, and reporting) availability. During the period of 1700 - 1730, these would have been available in a limited fashion based on the specific page or end point that was attempted. After 1730, nearly all pages and API end points were available. By 1740, everything but the one problem node was fully operational. There was no loss of data at any point during this incident.

Lessons Learned

The following outlines the major takeaways from this incident:

  • Ensure query times have strict monitoring and alerting templates around them.

We know that you rely on SimpleReach for accurate analytics and apologize for any inconvenience this has caused. We will continue to work hard to reduce the impact of incidents like this in the future.

Posted about 3 years ago. Aug 01, 2016 - 18:26 EDT

Resolved
All operations have returned to normal.
Posted about 3 years ago. Aug 01, 2016 - 18:14 EDT
Monitoring
The search cluster is back online. We are looking in to what happened.
Posted about 3 years ago. Aug 01, 2016 - 17:57 EDT
Investigating
There is an issue with the search cluster which is preventing the site from loading. Data collection is unaffected. We will update as we get more information.
Posted about 3 years ago. Aug 01, 2016 - 17:32 EDT