Welcome!

Security Authors: Elizabeth White, Liz McMillan, Jason Thompson, Jackie Kahle, Xenia von Wedel

Related Topics: Security, Java, SOA & WOA, Linux, Cloud Expo

Security: Blog Feed Post

What Are Good RTO and RPO?

The recovery procedures should always be driven by the business needs

Experiencing downtime is not something that companies wish for but as we have seen lately it is something that we hear quite often about. Interestingly enough very few enterprises, especially in the Small and Medium Business area, spent enough time to work out good procedures for recovery of their IT systems and applications. The recovery procedures should always be driven by the business needs, and this is the part where lot of IT departments are failing and as a result the recovery turns out to be reactive procedure that is triggered by the issue, results in a chaotic recovery activities and ends up with post-mortem but no improvements after that. Putting more initial thought into the Business Impact Analysis (BIA) is a prerequisite for a good recovery procedures and defining the two main characteristics - RTO and RPO are crucial part of this process.

No_u_turn_sign

Let's start with the first one - Recovery Time Objective (RTO). RTO is defined as the duration of time within which the system or the service must be restored after disruption in order to avoid unacceptable consequences related to break in business continuity. The first thing that you need to have in mind about RTO is that it is an objective - this means that it is a target that you may not be able to achieve all the time. There are certain activities that you need to do during this time that may have variable duration. At a high level those are grouped in:

  1. Recognizing that there is a disruption - this may depend on your level of monitoring or lack of it and may involve manual checking of each system or service that participates in the business process
  2. Troubleshooting and identifying the failing system and/or service - this will depend on the level of diagnostics you have implemented and may also involve different people or teams
  3. Fixing the issue - depending on the root cause this can be as simple as rebooting the system to as complex as requiting code changes or even ordering new hardware
  4. Testing the fix - last but not least you need to make sure that the fix actually resolves the issue

In all those four activities the human factor is the most variable part. People need to be notified, updated, they need time to understand the issue, troubleshoot, code etc. The more automation you provide the less impactful the human factor is for the recovery time.

Once the system or services is brought back to operation though you need to determine what is the state of the data. This is where the next characteristic becomes important - Recovery Point Objective (RPO). RPO is defined as the period in which data might be lost from the system due to disruption without major impact to the business continuity. Although this is also objective you need to be more careful with this one. There are few things to think about here:

  1. Is data loss acceptable at all? In lot of cases the answer is no but there are situations in which you can tolerate loss of data.
  2. How to recover the data? Does it require copying, shipping backup tapes or manual entry of the data?
  3. How long will it take to recover the data? Two extremes are from few seconds required to repoint the system to a replica of the data on another server to requesting an off-site backup copy of the data
  4. How to test that the data is recovered? This can vary from automated tests to manual tests

Depending on your RPO your time to recover the business operations for your system may vary.

When thinking about Business Continuity (BC) you need to think about both components - recovering the operation of the system or service (RTO) and recovering the data to a point at which it is usable for the business (RTO). Both those actions need to take time that is less than the Maximum Tolerable Downtime (MTD) as we defined it in Determining the Cost of Downtime. In general though you should set your RTO and RPO in a way that you have a buffer of time for unexpected issues that may occur during recovery.

This post was first published on our company's blog.

Read the original blog entry...

More Stories By Toddy Mladenov

Toddy Mladenov has more than 15 years experience in software development and technology consulting at companies like Microsoft, SAP and 3Com. Currently he is a CTO of Agitare Technologies, Inc. - a boutique consulting company that specializes in Cloud Computing and Big Data Solutions. Before Agitare Tech Toddy spent few years with PaaS startup Apprenda and more than six years working on Microsft's cloud computing platform Windows Azure, Windows Client and MSN/Windows Live. During his career at Microsoft he managed different aspects of the software development process for Windows Azure and Windows Services. He also evangelized Microsoft cloud services among open source communities like PHP and Java. In the past he developed enterprise software for German's software giant SAP and several startups in Europe, and managed the technical sales for 3Com in the Balkan region.

With his broad industry experience, international background and end-user point of view Toddy has an unique approach towards technology. He believes that technology should be develop to improve people's lives and is eager to share his knowledge in topics like cloud computing, mobile and web development.