Google Apps Incident Report Gmail Outage - February 27, 2011 Prepared for Google Apps for Business customers

The following is the incident report for the Gmail issues experienced by a very small percentage of Google Apps customers beginning on February 27, 2011. The affected users reported empty mailboxes and login errors with Gmail and other Google Apps services. To resolve the issues, Google Engineering restored account data and user access for the affected users. During this incident, some incoming messages were automatically bounced (senders received a delivery failure notification); no email was lost from users’ mailboxes. We understand that this service outage has affected our valued customers and their users, and we sincerely apologize for the impact and disruption. Issue Analysis and Actions Note: All times listed are in Pacific Standard Time At approximately 10:00 AM, February 27, Google Support received initial reports of customers 1) finding their Gmail mailboxes empty and personal settings (such as labels and themes) reset to the defaults, or 2) receiving a 500-series error stating that their Gmail account was temporarily unavailable. After analyzing the issue, Google Engineering determined that the root cause was a bug inadvertently introduced in a Gmail storage software update. The bug caused the affected users’ messages and account settings to become temporarily unavailable from the datacenters. At 1:05 PM, February 27, Google Engineering reverted the storage software update, and halted further deployment. Restoration Process While analyzing the issue and its root cause, Google Engineering also worked on the process to restore users’ accounts. At 6:00 PM, February 27, Google Engineering temporarily disabled access to Gmail and other Google Apps services for all potentially affected users. This was a precautionary measure to prevent issues with data integrity during the mailbox restoration process. When users attempted to log in to their Gmail or Google Apps account, they received the message, “Sorry, your account has been disabled.” At 1:30 PM, February 28, following further analysis, Google Engineering identified those users not affected by the software bug, and restored their account access. For the affected users, Google Engineering restored access to all of their Google Apps services other than Gmail. Gmail stores multiple copies of users’ messages in multiple datacenters and on tape backups. With this software issue, some messages were unavailable online and required restoration from offline tape backups. Google Engineering retrieved the users’ data from tape backups, moved the data into their mailboxes, validated the data restoration, delivered all queued incoming messages, and re-enabled login access. The time required to retrieve users’ data from tape backups contributed to the extended time for the restoration. In addition, the restoration time depended on the size of the user’s mailbox: the larger the user’s mailbox, the longer the restoration. During this incident, user accounts that were programmatically updated by Google Apps Directory Sync or the Google Apps Provisioning API (utilities used by Google Apps administrators) required additional time for restoration.

During this incident, no existing messages or Gmail settings were lost from the users’ accounts. However, between 6:00 PM February 27 to 2:00 PM February 28, new incoming messages were not accepted, and the senders received a “Delivery Status Notification (Failure)” bounce notification. Messages sent after this timeframe were delivered as usual, and available once users logged in. Google Engineering worked diligently through the list of affected user accounts to restore access as quickly as possible while ensuring data integrity. By 3:40 PM, March 2, Gmail data and login access were restored to 98% of Google Apps for Business users. Google Engineering and Google Support worked directly with the remaining users as needed, and by 11:30 AM, March 3, all Google Apps for Business user accounts had been restored. Incident Communications During the incident, Google Support posted regular updates to the Apps Status Dashboard. On February 28, Google Engineering released a Gmail blog post that described the cause of the issue, included information on the account restoration process, and listed an email address for users to report any residual issues. Corrective and Preventative Measures Google Engineering and Support conducted an internal review and analysis, and have begun the following actions to help address the underlying causes of the issues and prevent recurrence: ● ● ● ● ●

Expand testing tools to better identify this class of bug during the software development cycle. Implement alerts and monitoring to detect this type of issue more quickly, and stop propagation. Speed the email restoration process by increasing the automation and performance of the tools used for identifying affected users, and for disabling and re-enabling users accounts. Develop tools that allow users to maintain account access to their Google Apps services during a Gmail service disruption. Improve support communications: When customers submit a case about a large service disruption or outage to Google Enterprise Support, they can automatically receive status/ resolution updates through email or online in their support case.

We are dedicated to making these improvements, all of which are now in progress. We understand that this issue has impacted and frustrated customers. Google is committed to continually and quickly improving our technology and operational processes to help prevent service disruptions.

Google Apps Incident Report

Feb 27, 2011 - Google Engineering reverted the storage software update, and halted ... better identify this class of bug during the software development cycle.

88KB Sizes 2 Downloads 308 Views

Recommend Documents

Google Apps Incident Report
This misconfiguration prevented changes to existing customer data for upgraded users. ... Eliminate the need for server restarts to recover from this type of error.

Google Apps Incident Report
At 7:50 AM PT | 16:00 UTC November 15, Google Calendar Engineering brought a system of servers ... your business and continued support during this time.

Google Apps Incident Report
Nov 15, 2010 - Prepared for Google Apps Customers ... Apps customers on November 15, 2010. ... your business and continued support during this time.

Google Apps Incident Report
We understand this service issue has impacted our valued customers and users, and we apologize to everyone ... At 6:12 AM PDT, a bug in a thirdparty software update caused a partial failure of a Google network router in ... escalated the software iss

Google Apps Incident Report
Mar 18, 2013 - service disruption was an issue in the network control software. Actions and Root Cause Analysis. At 6:09 AM PT, a portion of Google's network ...

Google Apps Incident Report
Mar 17, 2014 - Issue Summary. From 8:35 AM to 12:10 PM PT, Google Talk, Google Hangouts (including Chat and Video), Google. Voice, and the Google App ...

Google Apps Incident Report
Apr 17, 2013 - The following is the incident report for the Google services access ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a ...

Google Apps Incident Report
Mar 19, 2013 - Applications using the Google Drive and Docs APIs also returned errors. ... We thank you for your business and continued support. Sincerely,.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted further deployment. Restoration Process. While analyzing the issue and its root cause, Google Engineering also worked on the process to restore users' accounts. At 6:0

Google Apps Incident Report
Apr 17, 2012 - Prepared for Google Apps for Business customers. The following is the ... Enhance internal documentation for configuration management.

Google Apps Incident Report
Google Drive list. Applications using ... The Google Engineering team conducted an internal review and analysis of the March 21 event. They ... Modify the Drive software to more reliably serve user requests during short periods where overall.

Google Apps Incident Report
Dec 10, 2012 - Actions and Root Cause Analysis. Background: The load balancing software routes the millions of users' requests to Google data centers.

Google Apps Incident Report
Google Apps Incident Report. Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report ...

Google Apps Incident Report
Google Apps Incident Report. Gmail Outage - September 23, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the ...

Google Apps Incident Report - PDFKUL.COM
Apr 17, 2013 - The following is the incident report for the Google services access disruption that occurred on. April 17 ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups,. Sites, and ... misconfiguration oc

Google Apps Incident Report - PDFKUL.COM
Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the Google Docs access ...

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - On Tuesday, September 1, a small portion of Gmail's web capacity was taken ... request routing automatically directs users' requests to available servers. ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Between 7:00 AM - 9:50 PDT | 14:00 - 16:50 GMT, Thursday September 24, Google Apps users were unable to access the Contacts feature through the Gmail interface. However, they could view their contacts at an alternate URL. During this p

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Prepared for Google Apps Premier Edition Customers. Incident ... add users to their Google Apps accounts. ... business and continued support.

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - server. Gmail processing and access through the IMAP/POP interfaces ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report: Gmail Delay, March 16, 2010
Mar 16, 2010 - resources for Gmail routing and greatly increased the number of active Gmail routers. Following an internal investigation and analysis, the ...

Google Apps Incident Report Gmail Delivery Delays - June 22, 2010
Engineering was made of aware of the problem and promptly began to work to manage excessive traffic ... your business and continued support during this time.

Google Apps Incident Report: August 25, 2010 - Duplicate Mail
from the Google Engineering team traced this problem to new code introduced at 3:00 PM PDT |. 22:00 UTC August 19. The Google Engineering team repaired ...

Incident Report
Feb 24, 2009 - The root cause of the problem was a software bug that caused an ... we monitor our systems 24 x 7, we have engineers available to analyze.