Beginning around 1700 ET on Monday 8/1, all queries running through our search cluster began to slow down. Within about 5 more minutes, we lost one of the nodes and shortly thereafter, the rest of them died. This led to the dashboards, tweets, and certain API queries to become unavailable. Everything became operational again by 1755 ET on Monday 8/1.
We received our first notification of dashboards being slow to load at 1701 ET on Monday 8/1. Due to an excessively complex query on the search cluster, the memory was being over used to attempt to properly service that query. After about 20 min, we lost of the nodes in the cluster with an OOM error (out of memory). The rest of the machines in the search cluster were incapable of handling the rest of the query and subsequently became unresponsive.
At 1701, once the first alert was received, we immediately shut down all extraneous queries to our search cluster. While things initially became more responsive, they were (apparently) still mid-downturn. At 1715, the first machine left the cluster with an OOM error. We immediately restarted search on that machine. The other instances in the search cluster took a lot longer to recover. It wasn't until roughly 1730 that we started to see some of the machines become responsive again.
At 1730, we began a rolling restart of the entire search cluster hoping to remove any remnant of the problematic query. Nearly all of the machines in the cluster were back online by 1750 except the initial problem node. We restarted that node one more time and it eventually came back online. Once we saw no more history of slow performing queries in the execution logs, we allowed the rest of the production load back on to the search cluster.
The scope of the problem was limited to application (web, API, and reporting) availability. During the period of 1700 - 1730, these would have been available in a limited fashion based on the specific page or end point that was attempted. After 1730, nearly all pages and API end points were available. By 1740, everything but the one problem node was fully operational. There was no loss of data at any point during this incident.
The following outlines the major takeaways from this incident:
We know that you rely on SimpleReach for accurate analytics and apologize for any inconvenience this has caused. We will continue to work hard to reduce the impact of incidents like this in the future.