RTO and RPO explained
RTO and RPO overview for Cloud Storage and Backup
This article explains how to use RTO and RPO with your Business Continuity Planning (BCP).
What would you do if you had to restore a failed system and all its data?

This article explains how to use RTO and RPO with your Business Continuity Planning (BCP).
What would you do if you had to restore a failed system and all its data?
RTO and RPO explained
Two of the most critical metrics every organisation and IT director must understand are RTO (recovery time objective) and RPO (recovery point objective).
These allow organisations to design and implement a robust disaster recovery strategy with a logical backup process, enabling them to restore every failed IT system within their targeted downtime parameters.
RTO and RPO values are of little use on their own and must be used together. If you are responsible for planning and implementing a disaster recovery plan, these two metrics will help you get there.
A DR plan using RTO and RPO should be as simple as possible.
A basic example could be a laptop from which a sole proprietor needs to run their business or a more complex example of an ISP with 1,000 servers in a data centre. Both are businesses which are severely impacted when business continuity fails.
Each should examine their own RTO and RPO and check if they are achievable.
How will RTO and RPO help my organisation?
When used together, these two metrics will specify how long a system can be offline and how much data can be lost before your business processes suffer significant harm.
When you agree on your metrics, you can benchmark your systems and processes to see if they comply with your needs.
Both will help you calculate how often your backups should run and your good restore recovery time. From there, you will know how your data recovery process should work.
How we protect your data
- Encryption explained
- What happens to your data at rest?
- Knowing your data can't be compromised
RTO (Recovery Time Objective)

RTO is the time it takes to restore a previously working system, and this must be within a previously defined Service Level Agreement (SLA). Returning a previous working system to normality after an outage is often called Business Continuity.
Scenario
A call centre (let’s call it Buddy Care) with 1,000 operators uses a customer management system to handle all customer enquiries. The system runs from a cluster of Linux servers, and the company can’t deal with any customer queries or orders while offline.
Permitted System Downtime
Senior management has estimated that they lose £10,000 in sales and £5,000 in customer churn when they can’t service their customers’ requirements for the first hour.
The cost of customer churn doubles every hour when customers become more frustrated. These values are further affected if the system is offline during a promotion or peak time.
RTO calculation
The company decides its SLA recovery from any system outage is 1 hour from these figures.
A one-hour SLA is likely unattainable. Therefore, the system must be improved by adding clustering, replication, or a live failover.
We have covered these services at the bottom of this article.
RTO conclusion
They started their DR planning since they knew their existing system could not be recovered within one hour following a disaster. By looking at RTO, they now have a business case and budget to improve their systems to maintain their 1 hour RTO.
RTO: one hour
RPO (Recovery Point Objective)

Scenario
At Buddy Care, they decided they could lose a week of customer service calls but couldn’t lose any order details because these contain legal agreements and customer commitments.
The RPO statement says, “We need all data recovered”. That is a perfectly normal response. However, the business continuity SLA (RTO) of 1 hour dictates the RPO. If they need 10 hours to recover all data, that cannot be done with an RTO of 1 hour.
Permitted Data Loss
In an environment like Buddy Care’s, data will always be lost after a restore because the transactions happen every few seconds, but the backup only runs once every hour.
It is commonplace for the live backup sets to contain the most recent data and for historical data to be stored in an infrequently accessed data set.
After the live data has been restored, historical data (6 months or more) is expected to take longer to restore.
The durability of that data isn’t in question; it is simply the persistence of the service which serves that data. This is the difference between the persistence of a service (is it available now?) and the durability of data (can we restore all the data?).
RPO is a significant figure, and senior management should be impressed that it is a way of mitigating and managing data loss after the event. They should know what type of data won’t be initially restored and whether it will be restored later.
Let’s not be technical here, however. Even though we are talking tech and maths here, the conundrum is the same for senior management.
RPO calculation (which data can be lost)
One week of customer service calls.
No order details can be lost.
RPO conclusion
One week of customer service calls.
No order details can be lost.
RPO: one week of customer service calls. No order details can be lost.
Analysis delegates
Because this is a critical business decision, senior management should decide on the RTO and RPO metrics.
The IT Team will be required to confirm what is possible with the existing systems and recommend what is required to achieve the RTO and RPO.
To recap
This must be thorough and regularly tested.
If a backup takes 2 hours, don’t assume the restore will take 2 hours. During a restore, you might need to procure and replace hardware or rebuild a system from scratch before a restore can start.
These processes can take days if you haven’t envisaged what is required or don’t have access to replacement hardware.
What if your server goes offline at 06:00 on Sunday? Can you buy another one before Monday? What about restoring the operating system before restoring your databases and applications?
Without a thorough restore test, your DR plan will be nothing more than a dream. Ideally, you want to be doing test restores rather than live restores. If you are doing a live restore when a system is down, it is fair to say that someone has messed up somewhere.
EVERY PC and server you install nowadays will have a whole lot of diagnostic and error checking going on. More than 80% of the DR situations we get involved with are because of faulty hard disks, which have been reporting impending errors for days or years.
When a disk in a RAID array fails, the RAID fault tolerance level manages things nicely until another disk fails, and the array dies. We most commonly see RAID 5, 6, and 10 in use. These expensive systems can easily be kept healthy by swapping disks during a first-instance failure.
Motherboard and memory failures are different. With memory, we will normally see some CRC errors start before they fail. Regarding the others, try dual power supplies, keeping spare parts, or even a spare server.
Your RTO and RPO calculations will dictate your budget. For example, do you keep a spare server, cluster your servers, or have live failover replication between your premises and data centres?
What else should be considered?
Now that we have the hardware DR plan written, let us automate things.
Backups can easily be automated, and a competent staff member should monitor the results. The action of staff monitoring backups should be logged somewhere so that management knows the logs are being checked thoroughly and not just given lip service.
Periodic restores are essential and probably the only method you have to prove you can restore your systems. It is not uncommon for a Sysadmin to move data within a server’s drives or to another server when extra storage is required. This should be communicated to the team who are responsible for data backups so the backup sets can be amended.
Some other issues we have seen when a restore is needed are:
The encryption password is wrong or has been lost (without these, data cannot be restored).
When rebuilding a new server, the system architecture is unknown (what disk sizes and partitioning did we use on the server before it crashed?).
The media where our data is stored isn’t available (this used to be physical tapes, but nowadays is more likely to be insufficient bandwidth download for a cloud restore).
The only way to deal with this is through regular restores and benchmarking. Virtualisation and low-cost hardware make it easy to recover an entire server and return it to a restore lab so you can check if your RTO is maintained.
Deliver RTO and RPO in steps
Calculate your RTO and RPO values and have these approved by senior management.
Remember, these are business decisions and not solely IT decisions.
Decide if the RTO and RPO can be achieved with the existing systems and backups. If yes, then test it in your lab.
Identify other risks, such as power, substations, networking, etc. Your systems might be 100% bullet-proof. However, you might be let down by single points of failure from your power or connectivity providers.
Don't rely on RTO and RPO to fix your problems
RTO and RPO are calculations to assist you in calculating your DR strategy. Of course, you hope you never need to put them to the test.
With the correct planning, support from management, and a good IT budget, your IT system should be able to self-heal and preferably not fail in the first place.
Data corruption and ransomware can still occur. When they do, corruption is spread amongst all the storage locations. Every Sysadmin knows this. However, with increasing cybercrime and ransomware, it is more commonplace nowadays.
2 Responses
Comments are closed.
I have been asked to look into RTO by my boss. I was wondering how many IT admins do both?
nice explanation