HostDime USA Data Center Outage Report Event Summary
Time: 01:59PM EST - 3:10PM EST
Services Affected: Public Network
Summary of outage:
At approximately 1:59 PM EST time zone our monitoring systems detected an internal network issue affecting a good portion of our network to be publicly inaccessible. HostDime Network Engineers immediately began to troubleshoot the issue. The troubleshooting revealed that one of our main distribution switches had a corrupted VLAN database file. The file became corrupted via another infrastructure switch which was failing in the same VTP domain. Upon reboot it had an incorrect VLAN database file / VTP mode but with a current VTP revision number. A simple VLAN change on the switch made it think it had the latest version so it then announced this newer incorrect version to all other switches in the domain which included the main distribution switch it was trunked to. Once the engineers determined which switch was responsible and removed it from the topology for safety sake, they then proceeded to copy a current VLAN database backup which restored service immediately.
Corrective action to prevent: VTP is a common tool used in cisco networks to alleviate the administrative burden of synchronizing VLAN information to every switch in a particular LAN topology. Its use in our topology has been going on since the inception of this facility and has proven itself to be useful and robust for our particular needs. The switch which experienced the initial failure had a number of memory issues as well as issues with its NVRAM storage. This switch was removed from the VTP domain and has been retired from use in our facility.
After a thorough analysis and engineering discussion we decided the best course of action would be to modify the VTP topology and remove all other switches from the distribution switch domain. This would ensure that there is a reduced risk of a faulty switch in the domain from impacting our main distribution switches.