Google Apps Incident Report Gmail Service Outage September 1, 2009 Prepared for Google Apps Premier Edition Customers

Incident Summary Between 12:45 PM to 2:15 PM PDT | 19:45 - 21:15 GMT on Tuesday, September 1, 2009, Google Apps Gmail users were unable to access their accounts through the Gmail web interface. Users could continue to access their accounts via IMAP and POP. No data was lost during this time; messages were received and delivered, but could not be displayed. We understand that this service outage has affected our valued customers and their users, and we sincerely apologize for the disruption and any impact. Actions and Root Cause Analysis On Tuesday, September 1, a small portion of Gmail's web capacity was taken offline during a routine upgrade and service update. This is normal operating procedure as the Gmail web interface runs in multiple locations, and Gmail's request routing automatically directs users' requests to available servers. However, we underestimated the increased load that some of the new updates placed on request routing. As a result, at approximately 12:30 PDT, a few request routers became overloaded and responded by refusing all incoming requests. This response transferred the load to the other request routers, and as the effect rippled through the system, almost all of the request routers became overloaded. As a result, users could not access Gmail through the web interface since their requests could not be routed to a Gmail server. Gmail processing and access through the IMAP/POP interfaces continued as usual because these processes use different request systems. Upon receiving the error alerts, the Gmail Engineering team immediately began analyzing the issue and initiated a series of actions to help alleviate the symptoms. After determining the root cause to be insufficient available capacity, the Engineering team deployed a large-scale addition of request routers through Google's flexible capacity server systems. As they distributed incoming traffic across the expanded pool of request routers, access to the Gmail web interface returned to normal. During the incident, we published ongoing reports to the Google Apps dashboard, Gmail Help Center, the Enterprise and Gmail blogs, and the GoogleAtWork and Google Twitter feeds, to help provide customers with the latest status and available workarounds. Corrective and Preventative Measures The Gmail Engineering team conducted an internal review and analysis, and determined the following actions to address the underlying causes of the issue and help prevent recurrence: • Increasing request router capacity well beyond peak demand estimates. This action was completed immediately following the incident, and helps prevents recurrence under similar conditions.

1

September 2, 2009

• Isolating failure of request routers so that issues are limited to the specific datacenter, and do not affect servers in another datacenters. • Addressing request router behavior under load: if multiple routers are simultaneously overloaded, they should continue to perform at a reduced rate rather than refusing connections and attempting to defer their load. Over the next few weeks, we are dedicated to implementing these improvements to Gmail. We understand that system issues are inconvenient and frustrating for customers. One of Google's core values is to focus on the user, and we are committed to continually and quickly improving our technology and operational processes to help prevent any service disruptions. Once again, we apologize for the impact that this incident has caused. Thank you very much for your continued support. Sincerely, The Google Apps Team

2

September 2, 2009

Incident Report: Google Apps Mail - September 01 2009

Sep 1, 2009 - server. Gmail processing and access through the IMAP/POP interfaces ... Over the next few weeks, we are dedicated to implementing these ...

58KB Sizes 0 Downloads 213 Views

Recommend Documents

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - On Tuesday, September 1, a small portion of Gmail's web capacity was taken ... request routing automatically directs users' requests to available servers. ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report
This misconfiguration prevented changes to existing customer data for upgraded users. ... Eliminate the need for server restarts to recover from this type of error.

Google Apps Incident Report
At 7:50 AM PT | 16:00 UTC November 15, Google Calendar Engineering brought a system of servers ... your business and continued support during this time.

Google Apps Incident Report
Nov 15, 2010 - Prepared for Google Apps Customers ... Apps customers on November 15, 2010. ... your business and continued support during this time.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted ... better identify this class of bug during the software development cycle.

Google Apps Incident Report
We understand this service issue has impacted our valued customers and users, and we apologize to everyone ... At 6:12 AM PDT, a bug in a thirdparty software update caused a partial failure of a Google network router in ... escalated the software iss

Google Apps Incident Report
Mar 18, 2013 - service disruption was an issue in the network control software. Actions and Root Cause Analysis. At 6:09 AM PT, a portion of Google's network ...

Google Apps Incident Report
Mar 17, 2014 - Issue Summary. From 8:35 AM to 12:10 PM PT, Google Talk, Google Hangouts (including Chat and Video), Google. Voice, and the Google App ...

Google Apps Incident Report
Apr 17, 2013 - The following is the incident report for the Google services access ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a ...

Google Apps Incident Report
Mar 19, 2013 - Applications using the Google Drive and Docs APIs also returned errors. ... We thank you for your business and continued support. Sincerely,.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted further deployment. Restoration Process. While analyzing the issue and its root cause, Google Engineering also worked on the process to restore users' accounts. At 6:0

Google Apps Incident Report
Apr 17, 2012 - Prepared for Google Apps for Business customers. The following is the ... Enhance internal documentation for configuration management.

Google Apps Incident Report
Google Drive list. Applications using ... The Google Engineering team conducted an internal review and analysis of the March 21 event. They ... Modify the Drive software to more reliably serve user requests during short periods where overall.

Google Apps Incident Report
Dec 10, 2012 - Actions and Root Cause Analysis. Background: The load balancing software routes the millions of users' requests to Google data centers.

Google Apps Incident Report
Google Apps Incident Report. Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report ...

Google Apps Incident Report
Google Apps Incident Report. Gmail Outage - September 23, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the ...

Google Apps Incident Report: August 25, 2010 - Duplicate Mail
from the Google Engineering team traced this problem to new code introduced at 3:00 PM PDT |. 22:00 UTC August 19. The Google Engineering team repaired ...

Google Apps Incident Report - PDFKUL.COM
Apr 17, 2013 - The following is the incident report for the Google services access disruption that occurred on. April 17 ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups,. Sites, and ... misconfiguration oc

Google Apps Incident Report - PDFKUL.COM
Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the Google Docs access ...

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Between 7:00 AM - 9:50 PDT | 14:00 - 16:50 GMT, Thursday September 24, Google Apps users were unable to access the Contacts feature through the Gmail interface. However, they could view their contacts at an alternate URL. During this p

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Prepared for Google Apps Premier Edition Customers. Incident ... add users to their Google Apps accounts. ... business and continued support.

Google Apps Incident Report: Gmail Delay, March 16, 2010
Mar 16, 2010 - resources for Gmail routing and greatly increased the number of active Gmail routers. Following an internal investigation and analysis, the ...

Google Apps Incident Report Gmail Delivery Delays - June 22, 2010
Engineering was made of aware of the problem and promptly began to work to manage excessive traffic ... your business and continued support during this time.

Gmail Outage Incident Report - May 15 2009
A routing configuration file was released to production which incorrectly directed large volumes of global web traffic through ... traffic, resulting in timeouts and access delays to some Google services. The service ... continued support. Sincerely,