RTO and RPO explained
RTO and RPO overview for Cloud Storage and Backup
This article explains how to use RTO and RPO with your Business Continuity Planning (BCP).
What would you do if you had to restore a failed system and all its data?
This article explains how to use RTO and RPO with your Business Continuity Planning (BCP).
What would you do if you had to restore a failed system and all its data?
RTO and RPO explained
Two of the most important metrics every organisation and IT Directors must understand are RTO (recovery time objective) and RPO (recovery point objective).
Together, these allow organisations to design and implement a robust disaster recovery strategy with a logical backup process, enabling them to restore every failed IT system within their targeted downtime parameters.
RTO and RPO values are only a little use on their own and must be used together. If you are responsible for planning and implementing a disaster recovery plan, these two metrics will help you get there.
A DR plan using RTO and RPO should be as simple as possible.
A basic example could be a laptop from which a sole proprietor needs to run their business or a more complex example of an ISP with 1,000 servers in a data centre. Both are businesses which are severely impacted when business continuity fails.
Each should examine their own RTO and RPO and check if they are achievable.
How will RTO and RPO help my organisation?
When used together, these two metrics will specify how long a system can be offline and how much data can be lost before your business processes suffer significant harm.
When you agree on your metrics, you can benchmark your systems and processes to see if they comply with your needs.
Both will help you calculate how often your backups should run and your good restore recovery time. From there, you will know how your data recovery process should work.
How we protect your data
- Encryption explained
- What happens to your data at rest?
- Knowing your data can't be compromised
RTO (Recovery Time Objective)
RTO is the time it takes to restore a previously working system and this must be within a previously defined Service Level Agreement (SLA). Returning a previous working system to normality after an outage is often called Business Continuity.
Scenario
A call centre (let’s call it Buddy Care) with 1,000 operators uses a customer management system to handle all customer enquiries. The system runs from a cluster of Linux servers, and the company can’t deal with any customer queries or orders while offline.
Permitted System Downtime
Senior management has estimated they lose £10,000 in sales and £5,000 in customer churn when they can’t service their customers’ requirements for the first hour.
Customer churn cost doubles every hour when those customers become more frustrated. These values are affected further if the system is offline during a promotion or peak times.
RTO calculation
The company decides its SLA recovery from any system outage is 1 hour from these figures.
A 1 hour SLA is likely unachievable. Therefore, the system must improve by adding clustering, replication or a live failover.
We have covered these services at the bottom of this article.
RTO conclusion
They have started their DR planning since they know their existing system cannot be recovered within one hour following a disaster. By looking at RTO, they now have a business case and budget to improve their systems to maintain their 1 hour RTO.
RTO: one hour
RPO (Recovery Point Objective)
Scenario
At Buddy Care, they decided they could lose a week of customer service calls but couldn’t lose any order details because these contain legal agreements and customer commitments.
The RPO statement says, “We need all data recovered”. That is a perfectly normal response. However, the business continuity SLA (RTO) of 1 hour dictates the RPO. If they need 10 hours to recover all data, that cannot be done with an RTO of 1 hour.
Permitted Data Loss
In an environment like Buddy Care’s, data will always be lost after a restore because the transactions happen every few seconds, but the backup only runs once every hour.
It is commonplace for the live backup sets to contain the most recent data and for historical data to be stored in another data set which is infrequently accessed.
After the live data has been restored, historical data (6 months or more) is expected to take longer to restore.
The durability of that data isn’t in question, it is simply the persistence of the service which serves that data. This is the difference between the persistence of a service (is it available now?) and the durability of data (can we restore all the data?).
RPO is a significant figure and should be impressed on senior management that it is a way of mitigating and managing data loss after the event. They should be aware of what type of data won’t be initially restored and whether it will be restored later.
Let’s not be technical here, however. Even though we are talking tech and maths here, the conundrum is the same for senior management.
RPO calculation (which data can be lost)
One week of customer service calls.
No order details can be lost.
RPO conclusion
One week of customer service calls.
No order details can be lost.
RPO: one week of customer service calls. No order details can be lost.
Analysis delegates
Because this is a critical business decision, the RTO and RPO metrics should be decided by Senior management.
The IT Team will be required to confirm what is possible with the existing systems and recommend what is required to achieve the RTO and RPO.
To recap
This must be thorough and regularly tested.
If a backup takes 2 hours, don’t assume the restore will take 2 hours. During a restore, you might need to procure and replace hardware or rebuild a system from scratch before a restore can start.
These processes can take days if you haven’t envisaged what is required or don’t have access to replacement hardware.
What if your server goes offline at 06:00 on Sunday? Can you buy another one before Monday? What about restoring the operating system before you can restore your databases and applications?
Without a thorough restore test, your DR plan will be nothing more than a dream. Ideally, you want to be doing test restores rather than live restores. If you are doing a live restore when a system is down, it is fair to say someone has messed up somewhere.
EVERY PC and server you install nowadays will have a whole lot of diagnostic and error checking going on. More than 80% of the DR situations we get involved with are because of faulty hard disks, which have been reporting impending errors for days or years.
A disk in a RAID array fails, and the RAID fault tolerance level manages things nicely until another disk fails and the array dies. We most commonly see RAID, 5,6 and 10 in use. These are expensive systems and can easily be kept healthy by swapping disks during a first-instance failure.
Motherboards and memory failures are different. With memory, we will normally see some CRC errors starting before they fail. Regarding the others, try dual power supplies, keeping spare parts or even a spare server.
Your RTO and RPO calculations will dictate your budget, i.e. do you keep a spare server, cluster your servers, or have live failover replication between your premises and data centres?
What else should be considered?
Now we have the hardware DR plan written, let us automate things.
Backups can easily be automated, and the results should be monitored by a competent member of staff. The action of staff monitoring backups should be logged somewhere so that management knows the logs are being checked fully and not just given lip service.
Periodic restores are very important and are probably the only method you have to prove you can restore your systems. It is not uncommon for a Sysadmin to move data within a server’s drives or to another server when extra storage is required. This should be communicated to the team who are responsible for data backups so the backup sets can be amended.
Some other issues we have seen when a restore is needed are:
The encryption password is wrong or has been lost (without these, data cannot be restored).
The system architecture is unknown when rebuilding a new server (what disk sizes and partitioning did we use on the server before it crashed?).
The media where our data is stored isn’t available (this used to be physical tapes, but nowadays is more likely to be insufficient bandwidth download for a cloud restore).
The only way to deal with this is through regular restores and benchmarking. Virtualisation and low-cost hardware make it easy to recover an entire server to a restore lab so you can check if your RTO is maintained.
Deliver RTO and RPO in steps
Calculate your RTO and RPO values and have these approved by senior management.
Remember, these are business decisions and not solely IT decisions.
Decide if the RTO and RPO can be achieved with the existing systems and backups. If yes, then test it in your lab.
Identify other risks, such as power, substations, networking, etc. Your systems might be 100% bullet-proof. However, you might be let down by single points of failure from your power or connectivity providers.
Don't rely on RTO and RPO to fix your problems
RTO and RPO are calculations to assist you in calculating your DR strategy. Of course, you hope you never need to put them to the test.
With the correct planning, support from management, and a good IT budget, your IT system should be able to self-heal and preferably not fail in the first place.
Data corruption and ransomware can still occur. When they do, corruption is simply spread amongst all the storage locations. Every Sysadmin knows this. However, it is more commonplace nowadays with increasing cybercrime and ransomware.
2 Responses
Comments are closed.
I have been asked to look into RTO by my boss. I was wondering how many IT admins do both?
nice explanation