Below is a summary of the network event that occured yesterday. Please contact me directly if you need further information.
On Monday, March 31 at 2:17 PM EST, GNAX routers were unable to effectively route traffic to the internet. The issue stemmed from a large peer at the TIE peering fabric flooding the peer routers with unproductive routes, which crippled our route tables on the adjacent routers and then propagated and affected our core routers as BGP neighbors. The immediate fix re-converged routes at 3:12 PM EST.
To prevent this type of incident occurring again in the future, our network team has applied more stringent access lists in those peers. Also, our stricter configuration will terminate a BGP peer if they show a sudden, unexpected increase in routes, further protecting our customers from this type of occurrence in the future.
The immediate fix was determined and implementation was started in less than 30 minutes, as our network team launched in to action. However, due to the scale and variety of our network infrastructure, it took a few hours to fully diagnose and confirm the issue from the logs, design a more permanent resolution and carefully test it.
We apologize for the inconvenience and trouble this disruption caused to our customers and sincerely thank you for your patience and understanding as we worked through the issue. We know how critical our services are to our customers. We will do everything we can to learn from this event over the coming days and weeks to further understand the details and refine our resolution and processes. We are committed to providing our customers mission-critical IT infrastructure, therefore we are implementing a status page that will give periodic updates during any future issues. We will only update events as they are confirmed with factual information.
Many have asked for information before all the facts were understood. We do not believe in speculating on these issues until we understand them fully. As issues unfold, information changes and speculations can lead to improper actions based on incorrect information.
Also, we typically do not have the manpower to answer the flood of calls asking the same information during an issue like this. We prefer our technicians to be focusing on resolving issues rather than giving repetitive partial updates. We believe this is what our customers would prefer as well. The status page should relieve this need for up-to-date information and alleviate the frustration around those unanswered informational requests.
Additionally, thank you for your suggestions during this time. We are always open to good ideas from our customers.
If you have any further complications or need support as a result of this incident, please login to our customer portal and create a ticket, and we will promptly respond.
VP Engineering & Online Operations
404.230.9150 ext. 241