Google Apps Incident Report Google Services July 10, 2013 Prepared for Google Apps customers
The following is the incident report for the Google services disruption that occurred on July 10, 2013. We understand this service issue has impacted our valued customers and users, and we apologize to everyone who was affected. Issue Summary
From 6:12 AM to 7:37 AM PDT, a fraction of users in West Virginia, North Carolina, Nebraska, and Georgia experienced errors when trying to access many of Google services, including Search, Gmail, Google Drive, Google Calendar, Google Groups, Google Sites, and the Google Admin console. During this time, between 0.2% and 0.5% of Google’s traffic was affected. The root cause of this service disruption was a novel failure of a network router. Actions and Root Cause Analysis
Background: Routers take incoming traffic, analyze the information, and then direct users requests to other Google systems for processing and serving content, such as generating search results and accessing Gmail. At 6:12 AM PDT, a bug in a thirdparty software update caused a partial failure of a Google network router in the Atlanta region. Normally when a network router fails, it alerts Google Engineering of the problem, and redundant network devices quickly reroute traffic. This novel bug caused the router to incorrectly identify a significant portion of traffic as invalid and discard it, but not treat this behavior as a failure. As the result, the router silently dropped traffic for some users in specific areas of the United States, and the affected users found their access to Google services limited or unavailable. Google’s monitoring systems detected the outage at 7:03 AM PDT, and Google Engineering immediately began to investigate. The Engineering team determined the source of the problem, escalated the software issue with the thirdparty vendor, and removed the router from production at approximately 7:35 AM PDT. Google networking devices then routed traffic around the failure, and the affected users’ access to services was restored within minutes. The thirdparty vendor continued to research the software bug, and provided a root cause and workaround later in the day. Corrective and Preventative Measures
The Google Engineering team conducted an internal review and analysis of the June 10 event. They are taking the following preliminary actions to address the underlying causes of the issue and to help prevent recurrence: ● ●
Implement tests to detect this mode of router failure during the testing process for software updates and configuration changes. Investigate methods for adding alerts for more quickly detecting a decrease in traffic, and for differentiating unintended traffic drops from intended discards, for example of denialofservice traffic.
●
Improve network monitoring for early detection of this type of issue, including monitoring router performance during upgrades and updates, and increasing the scope of monitoring across interconnected network devices.
Corrective and Preventative Measures
Google is committed to continually and quickly improving our technology and operational processes to prevent service disruptions. We appreciate your patience and again apologize for the impact to your organization. We thank you for your business and continued support. Sincerely, The Google Apps Team