Postini Services Incident Report Mail Delivery ­ May 7, 2013 Prepared for Postini and Google Apps customers

The following is the incident report for the Postini services outage that occurred on May 7, 2013 (GMT). We understand this service issue has impacted our valued customers and users, and we apologize to everyone who was affected. Issue Summary From 10:15 GMT May 7 to 3:52 GMT May 8, users on Postini System 200 (which comprises 12.7% of all Postini users) experienced severe delays in inbound and outbound mail delivery. The delays were most severe from 12:00 until 21:00, after which time delivery rates began to improve. During this incident, inbound messages (messages sent to users) were deferred. Outbound messages (messages sent from users) were queued on customers’ mail servers. Users who sent messages received a deferral notification with errors such as "421 Server busy, try again later ­ psmtp". Delivery of the deferred messages was retried based on the sending server’s retry interval (which can range from minutes to hours). A small portion of traffic continued to be processed and delivered throughout the incident. At no time were messages lost or deleted. The root cause of this service outage was a combination of load balancer failures in the primary data center and insufficient processing capacity in the continuation data center. Actions and Root Cause Analysis Background: Postini services run in pairs of data centers, the primary and continuation. Messages are normally processed, filtered, and archived in the primary data center. If there is an issue affecting the primary data center, message traffic may be temporarily switched to the continuation data center. At 10:15 GMT, mail processing performance began to degrade in the System 200 primary data center, and as designed, the automated monitoring systems directed message traffic to the continuation data center. Google Engineering diagnosed the issue, and at 11:30 GMT, they identified severe instability in the load balancer software, which is provided by a third party, as the core issue in the primary data center. The Engineering team escalated the issue to the third­party vendor and continued investigating the cause and restoration options. As mail flowed through the continuation center, the message processing systems did not have the sufficient capacity for this sustained volume of traffic. As resources became consumed, this low rate of processing caused delivery delays, and the queued messages and retry attempts led to further processing latency.

At 15:48 GMT, the vendor reported that they had narrowed the source of the problem and were determining the root cause and solution. Throughout the day, Google Engineering continued to provide information to the third­party vendor and conduct their own investigation, and took actions to help reduce user impact. Engineering detected an sub­optimal use of processing resources in the continuation data center and at  20:40 GMT, they implemented production configuration changes that increased delivery capacity and helped reduce deferrals. Additional performance tuning measures were implemented at 22:20 GMT and 23:20 GMT to provide incremental improvements to mail processing. At 23:00 GMT, the vendor identified the root cause—a software defect in the load balancer that affected only certain operating system configurations—and began developing a fix. At 2:00 GMT, May 8, Google Engineering implemented the vendor­provided remediation and returned message traffic to the primary data center, and by 3:52 GMT, mail processing returned to normal. Customers’ messages that were initially deferred were delivered according to the sending servers’ retry interval. Corrective and Preventative Measures We understand this was a severe service disruption that took a prolonged time to solve, which was frustrating for our users. The Google Engineering team conducted an internal review and analysis of the May 7 event. They are taking the following actions, a number of which are underway, to address the underlying causes of the issue and to help prevent recurrence: ● ● ● ● ●

Implement fixes and recommendations provided by the vendor to the load balancer systems across all data centers. Assign additional storage capacity to the continuation data centers. Ensure consistency in performance tuning and configurations between the primary and continuation production systems to optimize performance in the continuation data center. Review the escalation response with the vendor to significantly improve the clarity and speed of resolution. Improve the Apps Status Dashboard to provide greater visibility and relevant detail about issues in progress.

Google is committed to continually and quickly improving our technology and operational processes to prevent service disruptions. We appreciate your patience and again apologize for the impact to your organization. We thank you for your business and continued support. Sincerely, The Google Apps Team

Postini Services Incident Report

May 7, 2013 - Prepared for Postini and Google Apps customers. The following is the ... We thank you for your business and continued support. Sincerely,.

81KB Sizes 4 Downloads 255 Views

Recommend Documents

Google acquires Postini Services
Sep 13, 2007 - A. Postini is part of the Google Enterprise Partner program, and has worked with Google to create solutions that augment Google Apps for business users. Postini has a set of offerings that integrate well with Google Apps, and has been

Google Apps Incident Report
Mar 17, 2014 - Issue Summary. From 8:35 AM to 12:10 PM PT, Google Talk, Google Hangouts (including Chat and Video), Google. Voice, and the Google App ...

Google Apps Incident Report
Apr 17, 2013 - The following is the incident report for the Google services access ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a ...

Google Apps Incident Report
Mar 19, 2013 - Applications using the Google Drive and Docs APIs also returned errors. ... We thank you for your business and continued support. Sincerely,.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted further deployment. Restoration Process. While analyzing the issue and its root cause, Google Engineering also worked on the process to restore users' accounts. At 6:0

Google Apps Incident Report
Apr 17, 2012 - Prepared for Google Apps for Business customers. The following is the ... Enhance internal documentation for configuration management.

Google Apps Incident Report
This misconfiguration prevented changes to existing customer data for upgraded users. ... Eliminate the need for server restarts to recover from this type of error.

Google Apps Incident Report
At 7:50 AM PT | 16:00 UTC November 15, Google Calendar Engineering brought a system of servers ... your business and continued support during this time.

Incident Report
Feb 24, 2009 - The root cause of the problem was a software bug that caused an ... we monitor our systems 24 x 7, we have engineers available to analyze.

Google Apps Incident Report
Nov 15, 2010 - Prepared for Google Apps Customers ... Apps customers on November 15, 2010. ... your business and continued support during this time.

Google Apps Incident Report
Feb 27, 2011 - Google Engineering reverted the storage software update, and halted ... better identify this class of bug during the software development cycle.

Google Apps Incident Report
We understand this service issue has impacted our valued customers and users, and we apologize to everyone ... At 6:12 AM PDT, a bug in a thirdparty software update caused a partial failure of a Google network router in ... escalated the software iss

Google Apps Incident Report
Mar 18, 2013 - service disruption was an issue in the network control software. Actions and Root Cause Analysis. At 6:09 AM PT, a portion of Google's network ...

Google Apps Incident Report
Google Drive list. Applications using ... The Google Engineering team conducted an internal review and analysis of the March 21 event. They ... Modify the Drive software to more reliably serve user requests during short periods where overall.

Google Apps Incident Report
Dec 10, 2012 - Actions and Root Cause Analysis. Background: The load balancing software routes the millions of users' requests to Google data centers.

Google Apps Incident Report
Google Apps Incident Report. Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report ...

Google Apps Incident Report
Google Apps Incident Report. Gmail Outage - September 23, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the ...

Google Apps Incident Report - PDFKUL.COM
Apr 17, 2013 - The following is the incident report for the Google services access disruption that occurred on. April 17 ... Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups,. Sites, and ... misconfiguration oc

Google Apps Incident Report - PDFKUL.COM
Google Docs Outage - September 7, 2011. Prepared for Google Apps for Business customers. The following is the incident report for the Google Docs access ...

Incident Summary Report -
Offense Class Code. Offense Class Description. Count. 320. ROBBERY - STRONG-ARM. 1. 450. ASSAULT AND BATTERY. 6. 460. INTIMIDATION / THREAT. 1. 510. BURGLARY - HOME INVASION - 1ST DEGREE. 1. 1410. MDOP - MALICIOUS DESTRUCTION OF PROPERTY. 3. Grand To

BULLYING INCIDENT REPORT FORM.pdf
Submission of a good faith complaint. or report of bullying or harassment will not affect the complainant or reporter's future employment, grades,. learning, or working environment. A complainant that falsely accuses someone will be subject to. disci

Bullying Incident Report Form.pdf
investigate this matter if as much information as possible is provided. Submission of a good faith complaint. or report of bullying or harassment will not affect the complainant or reporter's future employment, grades,. learning, or working environme

Armstrong Incident Report Form.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Armstrong ...

Bullying Incident Report Form.pdf
Loading… Page 1. Whoops! There was a problem loading more pages. Retrying... Bullying Incident Report Form.pdf. Bullying Incident Report Form.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Bullying Incident Report Form.pdf.