Business Continuity Planning (BCP) and Disaster Recovery Planning (DRP) Course Material Slides Notes By Karn G. Bulsuk
Slide 1
Karn G. Bulsuk 12 February 2010
Slide 2
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
1
Slide 3
Inform students that in continuity planning, there are three key concepts which belong together. Disaster Recovery Planning (DRP)
Business
Business Continuity Planning (BCP)
Slide 4 Business
DRP Business
BCP DRP
BCP
DRP
BCP Business
The purpose of this slide is to communicate the importance of making business they key driver of any BCP and subsequently, DRP. The correct answer is the first option on the left: the business must take precedence and their requirements must be used to determine what is important and priority of recovery. In order words: Business: sets requirements on which business processes need to be recovered in order of priority BCP: Looks at the overall picture of what needs to be available for the process to be continued DRP: Looks at which IT systems need to be recovered in order for business to continue. For example, in an accounting department the business may identify that their purchasing and ordering system must continue even after a disaster. As a result: BCP: identify essential resources critical to continued operations, such as key staff, required forms, computers, internet connection, printers, desks and places to work (or facilities to enable working from home), etc. DRP: identify how to ensure that data processing can continue – which backup servers need to be activated, what software is required, how to maintain communication links, how to ensure that key staff have enough computers to work, how to maintain IS security, etc.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
2
Slide 5
Purpose: keep things running BCP and DRP differences ◦ Who’s responsible?
Why have a DRP? What could go wrong? Examples: ◦ ThaiCom ◦ Kobe Earthquake ◦ September 11
The purpose of a BCP is to make sure operations continue. Inform students that companies without an actively developed and rehearsed plan stands a greater risk of never recovering from a disaster and going under. Some studies indicate that up to 60% of companies which experience a disaster will never come back. Explain that a BCP is more of a top level overview of all operations, while a DRP is very much IT focused. The DRP is a subset of the BCP, which focuses on information technology resources. The DRP is used to ensure that essential IT resources continue running in order to allow operations to continue, which is why it is a sub-set of the BCP. Examples on why a BCP is important •Immediately after the Kobe Earthquake, disaster response was slow as the actual building containing the command room collapsed. This would be an example of planning without taking a wide range of risks into consideration. •During the Thai unrest during 2010, protestors marched on ThaiCom to demand that they restore satellite links to their TV station. Although ThaiCom has one main site and one backup site, protestors marched and blockaded both of them, disrupting business and communications. This scenario was not considered and no further (secret) backup sites were available to take the load off.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
3
•During the September 11 attacks, hundreds of companies lost operational capability when the World Trade Center collapsed. Many larger clients had back up facilities which they could immediately activate, some located in a different country, which allowed business to continue and saved them from going under, and also saving the jobs of their employees.
Slide 6
Explain that this is a model used by ISACA as best-practice for planning and implementing a BCP. Inform the students that we will be covering each area.
Slide 7
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
4
Slide 8
Gain management support
Business Continuity Policy
Purpose:
◦ Management led. ◦ Why not IT led?
◦ High level statement
◦ Internal: empowering action ◦ External: informing stakeholders, regulators ◦ Vision and principles of business continuity
Explain that the BCP needs to have the support and sponsorship of top management. There is a misunderstanding that BCP/DRP is only IT led, but that would subject the plan to failure. Firstly, IT is a support function and does not understand the business as well as those who are actually business. For example, IT will not be able to know which are the key business processes which allow a company to continue running. When IT leads BCP/DRP, it usually only focuses on the technical aspects in allowing servers to continue running without considering whether those servers contain the applications which are priority to business continuity following a disaster. They also do not consider the other essential aspects of a BCP, such as the key people in the business or even the forms (such as printed receipts and invoices) which are required. As a result, what happens is that business grinds to a halt and financial losses are incurred. Secondly, sponsorship from C-level demonstrates that top management takes BCP seriously, and it will be easier to obtain cooperation of all departments. The first document required to demonstrate top management support is the Business Continuity Policy, which is a high level statement of the tobe state. Internally, it empowers action and provides a vision of what should be. Externally, it provides stakeholders reasonable assurance that the business will continue to run, while it may also satisfy regulatory requirements.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
5
Slide 9
Slide 10
Identify key processes
◦ Staff, infrastructure and resources supporting ◦ Identify threats ◦ Identify probability ◦ Evaluate existing countermeasures
Impact
Risk
Probability
Once a plan is developed, a company will need to identify the key processes that are essential to the business. The diagram to the right demonstrates that the more important the process, the higher the risk. With a higher risk, there is an elevated probability it will occur and it would have a bigger impact on the business. When key processes are identified, the planner will need to drill down to identify everything that is required in those processes. They will need to include people, equipment and necessary infrastructure. Existing threats and probabilities, as well as any existing countermeasures should be included in this analysis. It will allow a clear picture of what is at risk, and whether any effort has been made to mitigate these risks.
Slide 11
Categorize risks based on damage estimation For example:
Once the key processes and risks are identified, classify them based on the estimated damage disruption could cause. Negligible: no significant damage ie brief OS crash or power cut (with UPS) Minor events: no financial impact or any notable impact Major: negative material impact, may affect other systems, departments or clients Crisis: Major incident which could affect functioning of business, other systems or third parties.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
6
Slide 12
Slide 13
Used to evaluate critical processes and supporting IT components Determine: ◦ ◦ ◦ ◦
Time frames Priorities Resources Interdependencies
Once the key processes and risks are understood, the next step is to perform a business impact analysis. This is a drill down of the key processes identified to understand: •Time frames: how long can a business go without this process? •Priorities: which processes need to restored immediately? Which can wait a few minutes, hours, days, or months? •Resources: what resources (people, equipment and infrastructure) will be needed? •Interdependencies: does this process rely on any other processes for it to function? Conversely, do other processes rely on this process to work?
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
7
Slide 14
Requires:
◦ High level of support from senior management ◦ Extensive involvement of IT ◦ End-user personnel
Defining level of information resources
Emphasize that everyone, including senior management, IT and end-users (who understand the business processes the best) need to be involved to produce a comprehensive plan.
◦ Approved by senior management
Slide 15
Fixed costs ie warm site The faster the recovery, the more expensive it is.
Slide 16
Inform students that there is a clear distinction between RTO and RPO.
RTO: acceptable downtime
◦ Less: Low disaster tolerance ◦ More: Higher tolerance
Inform students that a faster recovery time means higher cost, while too long a recovery time may be so costly that operations cannot continue. There is a center point in which the cost and recovery time are at equilibrium, and that is determined by this analysis.
What technologies could be used?
RPO:
◦ Permissible amount of data loss
RPO Standing for Recovery Point Objective, this describes the point in which data must be restored to, or the amount of acceptable data loss. If the RPO is 1 hour, then it means we can afford to lose 1 hour of data prior to the point of interruption. If the RPO is zero, as it would be in banks, then no data loss is acceptable. RTO Standing of Recovery Time Objective, it describes the acceptable amount of downtime after interruption. If the company has low reliance on IT, such as a factory, then the RTO will be high meaning there is higher disaster tolerance. Financial clients often have a low RTO which meant they have less tolerance for interruption.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
8
ASK students what backup options would be good for a company with low, medium or high RPO. The answers are on the next slide.
Slide 17
Slide 18
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
9
Slide 19 Common strategies: Cold sites Warm sites Hot sites Mirrored sites Other possibilities: Mobile sites Reciprocal agreements
Inform students that there are a number of recovery strategies, which are dependant on the company’s RTO: Depends on your RTO
•Cold Sites: a site is available with space and basic infrastructure, but equipment such as servers, tables, printers and routers will need to be brought in and set up prior to activation. Usually for companies with high RTOs •Warm sites: has basic space with infrastructure, some server equipment and communications, but usually only to sustain critical operations on a limited basis. Some set up of software and data synchronization may be needed. •Hot sites: Has all the required equipment and infrastructure to allow immediate activation during a disaster, but it is only staffed at a minimum. It usually also contains sufficient desk space so that people being transferred from the main company site to the hot site can work immediately. •Mirrored sites: has all required equipment and infrastructure and is fully staffed. All software is already loaded and data is synchronized on a regular or real-time basis, meaning that operations can continue with no noticeable interruption from users. •Mobile sites: contains a full set of equipment to allow operations. Usually stored in a storage container and mounted on a van or truck to allow it to be moved into place when necessary. However, this plan will need to also consider other infrastructure such as communications, power supply, and road access. •Reciprocal agreements: agreements between similar companies to share their infrastructure in the event the other goes down. Difficult to implement, there are also concerns about data confidentiality and expense to maintain sufficient capacity to run two companies at the same time.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
10
Slide 20 1. 2.
3.
What should we consider in placement of a recovery site? We can either own the site, or outsource it. What are some of the benefits and disadvantages of each option? If we outsource, what are some of the things we would need to consider?
Activity: Ask students to discuss the questions on the slide, and present to the class.
Spend 8-10 minutes to discuss
Slide 21 1.
What should we consider in placement of a recovery site?
◦ Site should be located far beyond the immediate geographical area, so that it is not affected by disruptive events considered in the plan. ◦ In the event of a large-scale disaster, consider that other companies will be trying to restore processing as well. ◦ September 11th : New York to Jersey, or London
Notes: After September 11th, some companies had recovery sites as far away as London and Singapore
Slide 22 1.
We can either own the site, or outsource it. What are some of the benefits and disadvantages of each option?
Own Site
◦ Advantages:
Prevention of conflicts Speed and accessibility
◦ Disadvantages
Higher cost Needs permanent staff
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
11
Slide 23 1.
We can either own the site, or outsource it. What are some of the benefits and disadvantages of each option?
Outsource
◦ Advantages:
Lower cost Responsibility on outsourcer
◦ Disadvantages
Risk of not receiving full support during disaster make sure to have a solid contract
Slide 24 1.
• • • • • • •
If we outsource, what are some of the things we would need to consider? Configurations Disaster Definition Access: share? Priority and preference Availability Speed of availability Subscribers per site
• • • • • • • •
Insurance Usage period Communications Warranties Audit Testing Reliability Security
Some notes: •Configurations: Would we need to configure the software and hardware before being able to use it? •Disaster definition: does the contract state what is considered a disaster? If not we may not receive full support. •Access: will we need to share facilities with other companies, especially if the disaster is wide spread? Will there be enough space, power and communications capacity? •Priority and preference: Will we receive full and priority support? •Speed of availability: can we activate recovery site immediately? •Subscribers per site: How many people are at that site? If every company in the area activates their recovery plans, will there be enough facilities? •Audit: Do we have the right to go in and audit at any time we want to see if they are in compliance with our internal controls?
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
12
Slide 25
How can we ensure that we have sufficient hardware for warm and cold sites? What happens if we have specialist hardware ie. Telecoms? What to do about network and communications? Bandwidth? Voice calling?
Slide 26
Tight contracts with vendor to meet RTO Buy and store equipment in advance
Ask the students these questions to see if they come up with any solutions. Answers are on the next slide.
There are two options. Tight contracts means they are obligated to provide equipment within the time frame stipulated: essential when you need to recover quickly but do not have a hot or mirrored site. Option two means we buy equipment and keep in on our own for emergencies. Potentially expensive.
Slide 27
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
13
Slide 28
What should be in a DRP? ◦ You tell me
Spend 15 minutes brainstorming
Slide 29
What else should we have at the disaster recovery site?
Ask the students to form groups and brainstorm what topics should be in a DRP. Please see attached Word file for a list of recommended topics, which can be used as a handout to students.
Ask the students what should be stored at the DR site. Answers are on the next slide.
Slide 30
What else should we have at the disaster recovery site?
Copies of the BCP and DRP Work desks with enough equipment to work Telecommunications equipment Access to money i.e. check book Other important documents, such as invoices, company letterhead and order forms ◦ Cryptographic devices (ie RSA tokens or USB authentication keys) ◦ ◦ ◦ ◦ ◦
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
14
Slide 31
Basically ask yourself:
◦ In the event the world comes crumbling down, will you be able to recover and continue operations?
Slide 32
Slide 33
There are three types of tests:
◦ Desk-based evaluation/paper test ◦ Preparedness test ◦ Full operational test
Increment the testing type prior to conducting a full operational test Test at least once a year
The three types of tests are: •Desk-based evaluation/paper test: involving the major players, where there is a discussion of how the plan would proceed. •Preparedness test: a small scale test in which there are simulations of disruption. This test can be used on various parts of the plan to see if it works, and is good preparation for the full operational test •Full operational test: a full-scale simulation in the event of a worst-case scenario, in which actual systems are often shut down to see if the plan works. The first two tests must be done prior many times prior to this test, in order to save time and costs.
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
15
Slide 34
Document EVERYTHING about the test
Measure the results by looking at:
◦ What went wrong? What worked?
◦ Time taken ◦ Amount of work completed at backup site ◦ Number of systems, records successfully recovered. Actual equipment received from vendor. ◦ Accuracy of data processing at recovery site
Slide 35
Based on test results, improve the plan Review plan at least once a year ◦ Why? What could change?
Slide 36
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
16
Slide 37
Keep recovery site discreet and don’t publicize! Keep backup tapes offsite securely Assume the worst could occur
It’s important to keep the recovery site secure. In the ThaiCom example given in the beginning of the slide decks, because protesters knew the back up site, they were able to surround it and disrupt the company’s entire operations. Back up tapes should be stored in a fireproof and waterproof safe at an off site facility. Some companies store them in a vault at a local bank, with others outsource this function to professional firms which specialise in this. Emphasize that transport of tapes must also be done securely.
Slide 38
Slide 39
Answer the questions in the hand out
Lecture delivered at the University of the Thai Chamber of Commerce on February 12, 2011
Presented by Karn G. Bulsuk http://www.bulsuk.com
[email protected]
© Karn G. Bulsuk. Downloaded from http://blog.bulsuk.com/2011/04/university-level-it-auditing-course.html
17