Google Apps Incident Report Google Docs ­ March 18, 2013 Prepared for Google Apps customers

The following is the incident report for the Google Drive access disruption that occurred on March 18, 2013. We understand this service issue has impacted our valued customers and users, and we apologize to everyone who was affected. Issue Summary

From 6:15 AM to 9:10 AM PT, some users experienced “Server Error 503” messages, long load times, or timeouts when trying to access Google Drive. Applications using the Google Drive and Docs APIs also returned errors. The issue affected up to 33% of all user requests to Google Drive during this period. Users could continue to access individual Drive files by direct link or URL. The root cause of this service disruption was an issue in the network control software. Actions and Root Cause Analysis

At 6:09 AM PT, a portion of Google’s network capacity went offline due to a novel bug in its control software that surfaced during a scheduled network event. As designed, the Google network routing and load balancing systems responded within seconds, redirecting user traffic to other network connections and servers. The servers receiving the additional user traffic continued to function normally, and as expected, their latency increased with this load. However, the latency increase triggered a second bug in the software that manages user connections and sessions with Google Drive. This resulted in errors and timeouts for some users who were attempting to access Google Drive through the affected servers. By 6:15 AM, Google Engineering was taking steps to first mitigate the impact to user traffic and then to restore the network capacity. The Google Drive team also took actions targeted at restoring user access as well as beginning the analysis of the bug affecting the Drive interface servers. The network was restored to normal operation by 7:50 AM, after Google Engineering identified the control system bug and implemented a workaround. The recovery was prolonged in part because of the nature of the outage and the care required to avoid expanding the scope of the outage in the process of mitigating or fixing it. Access to Google Drive began to return for users at 8:10 AM, and the issue was resolved for all users by 9:10 AM. Corrective and Preventative Measures

The Google Engineering team conducted an internal review and analysis of the March 18 event. They are taking the following preliminary actions to address the underlying causes of the issue and to help prevent recurrence: ●



Fix the issue in the controls systems which led to the initial event. Completed. The Google Networking team issued a fix to the affected systems on the evening of March 18, and will release the fix to all systems within days. Implement a load balancing policy change to provide greater isolation between different services on the network in the event of a partial failure. This change will reduce the complexity of expected behaviors during partial failure, making software service testing easier and more deterministic.

● ● ●

Fix the bug within Drive and change internal structures and resources to make Drive far more resilient to latency and errors. Improve the Drive alert systems to address the delay between the initial incidence (6:15 AM) and the start of work (6:40 AM), and expand monitoring of Drive systems. Accelerate the work in progress that ensures user traffic for Drive is properly prioritized during network events.

Google is committed to continually and quickly improving our technology and operational processes to prevent service disruptions. We appreciate your patience and again apologize for the impact to your organization. We thank you for your business and continued support. Sincerely, The Google Apps Team

Google Apps Incident Report

Mar 18, 2013 - service disruption was an issue in the network control software. Actions and Root Cause Analysis. At 6:09 AM PT, a portion of Google's network ...

75KB Sizes 5 Downloads 284 Views

Recommend Documents

Google Apps Incident Report
This misconfiguration prevented changes to existing customer data for upgraded users. ... Eliminate the need for server restarts to recover from this type of error.

Google Apps Incident Report
At 7:50 AM PT | 16:00 UTC November 15, Google Calendar Engineering brought a system of servers ... your business and continued support during this time.

Google Apps Incident Report
Nov 15, 2010 - Prepared for Google Apps Customers ... Apps customers on November 15, 2010. ... your business and continued support during this time.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted ... better identify this class of bug during the software development cycle.

Google Apps Incident Report
We understand this service issue has impacted our valued customers and users, and we apologize to everyone ... At 6:12 AM PDT, a bug in a thirdparty software update caused a partial failure of a Google network router in ... escalated the software iss

Google Apps Incident Report
Mar 17, 2014 - Issue Summary. From 8:35 AM to 12:10 PM PT, Google Talk, Google Hangouts (including Chat and Video), Google. Voice, and the Google App ...

Google Apps Incident Report
Apr 17, 2013 - The following is the incident report for the Google services access ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a ...

Google Apps Incident Report
Mar 19, 2013 - Applications using the Google Drive and Docs APIs also returned errors. ... We thank you for your business and continued support. Sincerely,.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted further deployment. Restoration Process. While analyzing the issue and its root cause, Google Engineering also worked on the process to restore users' accounts. At 6:0

Google Apps Incident Report
Apr 17, 2012 - Prepared for Google Apps for Business customers. The following is the ... Enhance internal documentation for configuration management.

Google Apps Incident Report
Google Drive list. Applications using ... The Google Engineering team conducted an internal review and analysis of the March 21 event. They ... Modify the Drive software to more reliably serve user requests during short periods where overall.

Google Apps Incident Report
Dec 10, 2012 - Actions and Root Cause Analysis. Background: The load balancing software routes the millions of users' requests to Google data centers.

Google Apps Incident Report
Google Apps Incident Report. Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report ...

Google Apps Incident Report
Google Apps Incident Report. Gmail Outage - September 23, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the ...

Google Apps Incident Report - PDFKUL.COM
Apr 17, 2013 - The following is the incident report for the Google services access disruption that occurred on. April 17 ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups,. Sites, and ... misconfiguration oc

Google Apps Incident Report - PDFKUL.COM
Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the Google Docs access ...

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - On Tuesday, September 1, a small portion of Gmail's web capacity was taken ... request routing automatically directs users' requests to available servers. ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Between 7:00 AM - 9:50 PDT | 14:00 - 16:50 GMT, Thursday September 24, Google Apps users were unable to access the Contacts feature through the Gmail interface. However, they could view their contacts at an alternate URL. During this p

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Prepared for Google Apps Premier Edition Customers. Incident ... add users to their Google Apps accounts. ... business and continued support.

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - server. Gmail processing and access through the IMAP/POP interfaces ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report: Gmail Delay, March 16, 2010
Mar 16, 2010 - resources for Gmail routing and greatly increased the number of active Gmail routers. Following an internal investigation and analysis, the ...

Google Apps Incident Report Gmail Delivery Delays - June 22, 2010
Engineering was made of aware of the problem and promptly began to work to manage excessive traffic ... your business and continued support during this time.

Google Apps Incident Report: August 25, 2010 - Duplicate Mail
from the Google Engineering team traced this problem to new code introduced at 3:00 PM PDT |. 22:00 UTC August 19. The Google Engineering team repaired ...

Incident Report
Feb 24, 2009 - The root cause of the problem was a software bug that caused an ... we monitor our systems 24 x 7, we have engineers available to analyze.