CoreDial Root-cause-analysis (RCA) for Service events the week of October 16, 2017
CONFIDENTIAL
The information in this document is confidential and proprietary, and intended only for CoreDial’s Service Providers and their select End Users, and should not be disclosed to any other person. It may not be reproduced in whole, or in part, nor may any of the information contained therein be disclosed without the prior consent of CoreDial, LLC. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and or publication of this material is strictly prohibited.
General Overview of the Service impacting events during the Week of October 16, 2017
The week of October 16, 2017, CoreDial Services experienced a series of events, directly caused by our customer-facing SBC architecture, which we estimate impacted 40% of End Users. In the portal, we refer to the use of this architecture as running a customer in “SIP Proxy” mode. There are many advantages to using this type of framework, but during this week, the SIP Proxy solution failed to perform as designed, as follows:
A failure of our commercial Load Balancers, fronting the SBCs, preventing them from taking in the traffic which the phones/devices use to peer, setup, and tear down calls.
The SBCs became overwhelmed by inbound traffic after the Load Balancers failed, on two different occasions.
Several of our Asterisk servers (roughly 5% of total servers), had issues managing to keep up with specific customers phone traffic, mostly due to the backlog of work created by 1 and 2, above.
Additionally, and separate from 1-3 above, on Thursday we experienced a very short-lived Service impacting event from an External DDoS attack, which was quickly responded to and remedied by our edge firewalls and engineering team.
Delays in restoring BLF functionality as a result of our efforts to prioritize the restoration and stabilization of calling services, and subsequently the gradual restoration of BLF capabilities.
These sorts of extended events are very unusual for CoreDial. Our network uptime this year, until October 2017, was 100%, and our Services uptime was 99.98%. While we experienced a service interruption in July 2017, due to a carrier issue, this issue has since been remedied in such a way that it will not recur. During the 16 months leading to July 2017, we experienced a total of 122.86 minutes of service downtime, none of which was due to core network failure.
As of Monday, October 23, 2017 around 3:15 a.m. EDT, we have restored full service to our infrastructure, and have hardened it for resiliency in the following ways:
Added networking rules on the SBC infrastructure to prevent each machine from taking on any more inbound work than it can handle.
Added IP-based rate limiting for voice traffic behind the customer facing SBCs.
Switched over from a cluster of F5 2200 load balancers to an upgraded cluster of F5 4800 load balancers.
In summary, we clearly understand the root cause of last week’s events, and we are implementing an immediate and long term solution to eliminate these issues going forward. Specifically, last Friday we invested more than $1million to purchase new Oracle Acme Packet SBCs to replace our current customer-facing SBCs. These are industrial strength, purpose-built SBCs that will perform these functions in a world-class manner, and prevent recurrence of these events going forward.
This is a high priority network infrastructure improvement project, and we expect to install this new SBC infrastructure in the coming weeks, and make them available to partners and customers as soon as possible thereafter.