Millions of users across the globe today have experienced the Facebook outage. As stated by the Facebook engineers, “This is the worst outage we’ve had in over four years”. This morning, the Facebook engineers have experienced some issue in the Facebook’s automated systems to corrupt services and cause more damage than it could fix.
The reason for this issue was mishandling an error condition by the automated systems. One of the Facebook’s engineer, Robert Johnson have posted, “The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.”
Facebook later explained that the site outage was due to some work they were doing on the site’s DNS servers. Early this morning, the engineers have made a change to the DNS infrastructure and that lead to the temporary unavailability for the users!
During this outage, there was a significant growth the twitter traffic as you can see in the image below!
In fact, the major problem which is the cause to this outage is the users itself. Yes, you read it right!
Every attempt of the user to access Facebook resulted in a query which in turn interpreted as an invalid value, further, deleting the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover, stated by the Facebook engineer!
Sources tell us that to sort out this problem, they had to shut down the server for a while and fixed the root cause of this problem. The Facebook engineers have even turned off the automated system that attempts to correct the configuration values and exploring the configuration design patterns at other systems at Facebook.
Facebook now currently is up and running well without any issues and also the Facebook engineers have mentioned that they take the performance and reliability of Facebook very seriously.