20241119 – Essential Email (Recent Stability Review)
On 13th November (04:00) GMT we encountered a hardware failure which caused IMAP and Webmail services to fail on one of the mailstores.
The failure and the normal inbound flow of requests resulted in an unexpected increase in the number of active requests, which caused performance issues on the surviving members of the cluster. The net result was a degraded service level that did not self-remedy after the failover as expected.
This resulted in a subset of users unable to access their mailbox during the ongoing maintenance.
The increase in active requests as peak hours approached further compounded the issue, affecting the load on other platform components. This resulted in issues that affected normal access to IMAP and Webmail services.
Root Causes Found
The cause was due to hardware failure compounded with an increase of workload demands on the surviving network elements.
Solution
In order to recover from the peak backlog, we had to throttle connections and slowly enable service to stabilise the cluster.
Post Mortem
To further address recurrence of issues like this, we have modified the way in which users are assigned to different platform components. We have begun work to transfer user mailboxes to other elements to improve the resource requirements across the board.