Google Apps Incident Report Gmail Delivery Delays - March 16, 2010 Prepared for Google Apps Premier Customers
The following is the incident report for email delivery delays that some Google Apps Gmail customers experienced on March 16. We understand that this service issue has affected our valued customers and their users, and we apologize for the disruption. Issue Summary Beginning at 5:34 AM PDT | 12:34 GMT, Tuesday, March 16, the affected customers experienced delays with incoming messages and received errors when sending outgoing messages. As a workaround, users could try to resend outgoing messages. At 3:23 PM PDT | 22:23 GMT, Tuesday, March 16, both inbound and outbound message delivery was restored to normal for all customers. During this time, some incoming messages may have been temporarily deferred; at no time were messages lost or deleted. Actions and Root Causes To help immediately mitigate slow delivery for Gmail, the Google Engineering team increased processing resources for Gmail routing and greatly increased the number of active Gmail routers. Following an internal investigation and analysis, the Google Engineering team determined that the root cause of the Gmail issue was the incorrect allocation of network resources. This misallocation was the result of a combination of the following: • A recent update to expand network capacity generated new routing information. The mechanism that allocates bandwidth resources interpreted this routing data incorrectly and subsequently reduced the network resources available to Gmail. • In this particular case, the network resources were reduced in a way that other internal systems, which normally detect and help manage network demand for Gmail, were not triggered. To resolve the mail delay, Google Engineering corrected the mechanism that allocates bandwidth, and Gmail delivery functions returned to normal. Corrective and Preventative Measures The Google Engineering team conducted an internal review and analysis, and is performing the following actions to help address the underlying causes of the problem and to help prevent recurrence: • Changing the behavior of the resource allocation system to help prevent this class of issue. This action has been completed. • Enhancing monitoring tools to help detect and provide alerts for this misallocation condition. We appreciate your patience and again apologize for the impact to your organization. We thank you for your business and continued support.
Sincerely, The Google Apps Team
Google Apps Incident Report: Gmail Delay, March 16, 2010
Mar 16, 2010 - resources for Gmail routing and greatly increased the number of active Gmail routers. Following an internal investigation and analysis, the ...