Google Apps Incident Report Google Services ­ March 17, 2014  Prepared for Google Apps customers

The following is the incident report for the Google services disruption that occurred on March 17,  2014. We understand this issue has impacted our valued customers and users, and we apologize to  everyone who was affected.  Issue Summary From 8:35 AM to 12:10 PM PT, Google Talk, Google Hangouts (including Chat and Video), Google  Voice, and the Google App Engine XMPP and Channel APIs, were unavailable for the majority of  users. Certain features of other services were affected, including the multi­player features of Google  Play Store games. From 9:45 AM to 10:50 AM, some Google Sheets users also experienced very  slow responses or 502 errors. The root cause of this disruption was a miscalculation in available  capacity during a hardware maintenance event.  Actions and Root Cause Analysis Background: Google Engineering regularly performs scheduled maintenance on data center systems.  Some procedures involve upgrading groups of servers and redirecting the traffic to other available  servers. Normally, these maintenance procedures occur in the background with no impact to users. At 8:25 AM, Google Engineering began a maintenance procedure on a group of backend servers that  support Google Hangouts, Google Sheets, and other services, and they redirected the processing  load to a new set of backend servers. Due to a miscalculation of memory usage, the new set of  backend servers lacked sufficient capacity to process the redirected traffic. These backend servers  could not process the volume of incoming requests and returned errors.This led to some Google  services retrying requests which further compounded the high levels of user traffic, and backend  servers rapidly became overloaded. At 8:35 AM, as incoming requests were dropped, the affected  services were no longer accessible to users.  When Google Engineering was alerted at 8:42 AM, they diagnosed the problem and identified the root  cause. To mitigate the issue, Google Engineering brought in additional backend server capacity,  stopped the maintenance procedure, and began to bring the original backend servers back online.  As the backend servers became overloaded, they began to drop users’ connections to their devices;  these connections are critical to services such as Chat and Hangouts. The recovery was prolonged  because Google Engineering had to reestablish users’ connection in stages to avoid overwhelming the  backend servers with sudden high volumes of traffic. The initial traffic overload also impacted the  capacity of the messaging routers, and this affected the recovery efforts as the backend servers came  back online. 

Google Engineering continued to add both router and backend server capacity, and some users began  to regain access to the affected services beginning at 9:45 AM. Due to the nature of this issue, the  Hangouts service was restored for all users at once rather than ramping up as the affected systems  came online.  Corrective and Preventative Measures The Google Engineering team conducted an internal review and analysis of the March 17 event. They  are taking the following actions to address the underlying causes of the issue and to help prevent  recurrence: ● ● ● ● ●

Review memory requirements and increase the memory capacity for the affected backend  servers to meet peak load needs.  Implement better monitoring for memory utilization and usage tracking to ensure that servers  have sufficient capacity available. Lower the alert threshold for errors with the Hangouts service to improve Engineering  response time.  Review internal procedures for bringing up emergency capacity to speed mitigation efforts.  Continue work in progress to improve the resilience of Hangouts service during high load  conditions. 

Google is committed to continually and quickly improving our technology and operations to prevent  service disruptions. We appreciate your patience and again apologize for the impact to your  organization.  Sincerely, The Google Apps Team

Google Apps Incident Report

Mar 17, 2014 - Issue Summary. From 8:35 AM to 12:10 PM PT, Google Talk, Google Hangouts (including Chat and Video), Google. Voice, and the Google App ...

84KB Sizes 19 Downloads 318 Views

Recommend Documents

Google Apps Incident Report
This misconfiguration prevented changes to existing customer data for upgraded users. ... Eliminate the need for server restarts to recover from this type of error.

Google Apps Incident Report
At 7:50 AM PT | 16:00 UTC November 15, Google Calendar Engineering brought a system of servers ... your business and continued support during this time.

Google Apps Incident Report
Nov 15, 2010 - Prepared for Google Apps Customers ... Apps customers on November 15, 2010. ... your business and continued support during this time.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted ... better identify this class of bug during the software development cycle.

Google Apps Incident Report
We understand this service issue has impacted our valued customers and users, and we apologize to everyone ... At 6:12 AM PDT, a bug in a thirdparty software update caused a partial failure of a Google network router in ... escalated the software iss

Google Apps Incident Report
Mar 18, 2013 - service disruption was an issue in the network control software. Actions and Root Cause Analysis. At 6:09 AM PT, a portion of Google's network ...

Google Apps Incident Report
Apr 17, 2013 - The following is the incident report for the Google services access ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a ...

Google Apps Incident Report
Mar 19, 2013 - Applications using the Google Drive and Docs APIs also returned errors. ... We thank you for your business and continued support. Sincerely,.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted further deployment. Restoration Process. While analyzing the issue and its root cause, Google Engineering also worked on the process to restore users' accounts. At 6:0

Google Apps Incident Report
Apr 17, 2012 - Prepared for Google Apps for Business customers. The following is the ... Enhance internal documentation for configuration management.

Google Apps Incident Report
Google Drive list. Applications using ... The Google Engineering team conducted an internal review and analysis of the March 21 event. They ... Modify the Drive software to more reliably serve user requests during short periods where overall.

Google Apps Incident Report
Dec 10, 2012 - Actions and Root Cause Analysis. Background: The load balancing software routes the millions of users' requests to Google data centers.

Google Apps Incident Report
Google Apps Incident Report. Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report ...

Google Apps Incident Report
Google Apps Incident Report. Gmail Outage - September 23, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the ...

Google Apps Incident Report - PDFKUL.COM
Apr 17, 2013 - The following is the incident report for the Google services access disruption that occurred on. April 17 ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups,. Sites, and ... misconfiguration oc

Google Apps Incident Report - PDFKUL.COM
Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the Google Docs access ...

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - On Tuesday, September 1, a small portion of Gmail's web capacity was taken ... request routing automatically directs users' requests to available servers. ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Between 7:00 AM - 9:50 PDT | 14:00 - 16:50 GMT, Thursday September 24, Google Apps users were unable to access the Contacts feature through the Gmail interface. However, they could view their contacts at an alternate URL. During this p

Google Apps Incident Report 1 - Sept 24 - Service Disruption
Sep 25, 2009 - Prepared for Google Apps Premier Edition Customers. Incident ... add users to their Google Apps accounts. ... business and continued support.

Incident Report: Google Apps Mail - September 01 2009
Sep 1, 2009 - server. Gmail processing and access through the IMAP/POP interfaces ... Over the next few weeks, we are dedicated to implementing these ...

Google Apps Incident Report: Gmail Delay, March 16, 2010
Mar 16, 2010 - resources for Gmail routing and greatly increased the number of active Gmail routers. Following an internal investigation and analysis, the ...

Google Apps Incident Report Gmail Delivery Delays - June 22, 2010
Engineering was made of aware of the problem and promptly began to work to manage excessive traffic ... your business and continued support during this time.

Google Apps Incident Report: August 25, 2010 - Duplicate Mail
from the Google Engineering team traced this problem to new code introduced at 3:00 PM PDT |. 22:00 UTC August 19. The Google Engineering team repaired ...

Incident Report
Feb 24, 2009 - The root cause of the problem was a software bug that caused an ... we monitor our systems 24 x 7, we have engineers available to analyze.