Greatresponder.com on Oct. 09, 2012: In an apology statement Mr. Auguste Goldman, chief information officer at GoDaddy wrote, “For more than a decade, we have provided an uptime of “five-9s” in our DNS infrastructure. We view any disruption as a serious concern and we are confident we have improved our system on the heels of this event. We know we have a responsibility to our customers and the entire Internet ecosystem with the volume of DNS traffic we handle every day.” This apology to its customer came out in public through an official blog post by the management of GoDaddy web hosting and IP registrar on Thursday last week.
It was disclosed in the blog post that the major reason for this nasty outage of September 10, 2012 was the combination of both hardware and volume of queries, which were not handled properly by the routers. It was termed as “Perfect Storm” in the blog post.
It was further stated that there was no hacking on the web hosting network; and also the sole reason of the outage was completely based the failure of one router hardware. The router started to switch to ‘software switching mode’ but the number of DNS queries was so high that they were not being responded in proper time and thus many queries were getting timed out. The resolution of the problem was done by correcting the hardware and throttling the internet links, removing errant routes for the routing tables. This throttling caused rather rejections of the queries. If the throttling had not been implemented, it was feared that the network may not be able to get online behind such a huge volume of traffic.
Mr. Goldman further wrote,” Within minutes of the beginning of the event, a recovery procedure was executed and the errant routes were removed from the routing protocol of all of our routers. The procedure relied on a standard response from the routers’ software – remove the routes from the FIB and begin forwarding in hardware again. This coupled with normal tiered DNS caching should have minimized any service disruption that could possibly have been caused by the change. This timeout mechanism did not execute.”