What are you looking for ?
Advertise with us
RAIDON

Jodocus Building Effective DR Plan for SaaS Data

Going to walk through core building blocks of creating DR plan for SaaS applications

From Jodocus GmbH

For decades, a sound DR plan has been vital for businesses that rely on software to power their operations.

Jodocus Dr Plan Intro

And the exodus from on-prem to the cloud has not changed that. If you rely heavily on SaaS applications – like Atlassian Jira, Jira Service Management or Confluence – to deliver your products and services, it’s essential to include specific steps for recovering critical data stored within these services in your DR plan. In this article, we’re going to walk you through the core building blocks of creating a DR plan for your SaaS applications.

Why is a DR plan necessary for SaaS data?
We get this question at least once a day from peers: “Why would I need a DR plan for the cloud, specifically SaaS?”

Here’s why: when businesses develop their own software, they have control over data privacy, security, and reliability. Cloud applications take away much of the responsibility of hosting your own solutions, but not all of it.

Click to enlarge

Jodocus Saas Disaster Recovery Plan Shared Responsibility Model 1

It’s called the Shared Responsibility Model. It’s often associated with AWS but the concept of shared responsibility governs all cloud computing – including Atlassian. Essentially, you and the cloud provider share the responsibility of protecting data. SaaS companies will guarantee everything, except user access and data. You are on the hook for safeguarding those things.  

While some SaaS products offer backup and restore capabilities, they may lack fine-grained recovery controls, comprehensive data visibility through detailed audit logs, or guarantees of data safety. Therefore, having a DR plan for each SaaS product you depend on is essential.

Step 1: Define your RTO and RPO
The 2 key metrics that should drive your recovery strategy are the RPO, which determines your data loss tolerance and the RTO, which is setting the benchmark for minimizing downtime.

The RPO metric revolves around one fundamental question: ‘How much data can you afford to lose?’. Here is how we often frame things: If you back up data once every 24 hours at midnight, and a disaster strikes at 11:59 pm, you would lose an entire day’s worth of data.

That level of risk will be tolerable for some businesses but unacceptable for others. Defining your risk tolerance for data loss will guide your technical decisions in meeting these requirements effectively.

To determine your RPO, you’ll want to consider the criticality of your data and the impact of potential data loss on your organization. Different services or applications may have varying RPO thresholds. For instance, mission-critical systems may demand real-time replication with zero or near-zero data loss, while non-critical systems might tolerate a longer interval between backups.

The RTO metric focuses on how quickly you can recover from a disaster and resume normal operations. Imagine a scenario where a meteorite strikes your data center. How long would it take to get your systems back up and running? This involves factors such as procuring alternate infrastructure or restoring from backups. The time required for recovery varies depending on the service in question.

Some services can achieve rapid recovery times, potentially within minutes, while others may take considerably longer, perhaps an entire day or more. Understanding the unique recovery timelines for each service or application is essential for effective DR planning.

A key way to get started is by engaging with stakeholders across your organization to ensure their buy-in and agreement on the defined RPO and RTO targets. This collaboration will foster a shared understanding of the potential impacts of a disaster and the required recovery timelines.

Step 2: Choosing recovery strategy
Choosing the right strategy boils down to a trade-off of robustness versus cost. Here’s a graphic from AWS which also applies to products provided by Atlassian and helps illustrate the spectrum of choices

Click to enlarge

Jodocus Saas Disaster Recovery Plan Recovery Strategy 2

Source: Disaster recovery options in the cloud – Disaster Recovery of Workloads on AWS: Recovery in the Cloud  

Let’s start on the left side with the backup and restore solution. It’s the simplest and most affordable option. However, the recovery time here can take hours, or even longer. Essentially, you’re restoring your latest backup(s) to your DR location.

Moving along, we have the Pilot Light option. With this approach, you run some essential services in a reduced capacity. Most services are running but scaled down to a ‘scale to zero’ level. Code or application updates are pushed to the DR location just as you would update your primary location.

Next up is the standby strategy. Here, everything is up and running, albeit at a smaller capacity compared to your primary environment. It’s similar to the Pilot Light option, but all services are operational with at least some capacity – nothing scaled to zero.

Finally, we have the active/active solution. This is an approach where you run full services in 2 parallel streams, allowing you to switch between them in near real-time. However, it’s worth noting that this option comes with doubled costs, making it less feasible for many companies.

Your choice of recovery strategy depends on your risk tolerance and how much you’re willing to invest to mitigate that risk. Depending on your industry, you’ll have core systems that form the foundation of your operations, as well as peripheral systems. While a full day of downtime for a core system can be painful for most businesses, the impact may be less significant if it’s a tool for running marketing programs, gathering statistics, or another secondary service.

This means you’ll need distinct DR plans for various systems within your organization. While there may be overlapping elements between service DR plans, it’s important to consider each service’s plan due to potential variations in RPOs and RTOs.

Step 3: Testing DR plan
To effectively test your DR plan here is a helpful checklist to help you focus your efforts:

  1. Organize a tabletop test of your DR plan:
    A tabletop test is a great way to start by putting all the ideas and methods in front of everyone. It allows all stakeholders to have a say in how you should proceed with your DR efforts.
  2. Walkthrough any internal and external dependencies that could potentially hinder or even prevent your DR plan from being fully effective. It’s vital to address and resolve these dependencies before implementing the plan.
  3. This meeting forms the foundation of our DR plan, so we must document everything thoroughly. It’s a good idea to get participants to sign off on the documentation to avoid any confusion later on.
  4. Create accountability lists:
    This identifies who is on call if a disaster disrupts your business; the people responsible for executing different phases of the DR plan. The list should be clearly outlined and updated so newer team members are aware of who’s going to be doing what in an emergency.
  5. Plan for different types of disasters:
    While you can’t plan for every possible disaster, it’s important to assess and prioritize the risks you face. Whether it’s malware attacks, data center outages or third-party provider outages, pick which ones you want to prepare for.
  6. Understand the cost of downtime and data loss:
    When SaaS tools are vital to processes and workflows, outages can drain productivity and cash. Atlassian’s
    14-day outage in April 2022 is one such example, which ended up affecting over 50,000 users. The downtime resulted in the revocation of access to critical SaaS products like Jira, Confluence, and Opsgenie. It also resulted in the loss of data. By Atlassian’s own calculation, the average cost of downtime to customers is $5,600, but the actual cost does vary from business to business.
    Of course, preparing for such an event will incur costs on multiple levels, so be sure to carefully consider what failsafes you want to invest in and to what extent. A good place to start is backup and recovery software to keep critical operations up, even when part of the app isn’t functional.

Recap: At a glance, checklist for testing DR plan should look something like this:

  • Understand why you need a SaaS DR plan.
  • Set your RPO and RTO.
  • Confirm with relevant stakeholders what business functions you want to protect and secure.
  • Decide on the internal and external tooling required to carry out a DR plan taking into consideration the security and privacy of your SaaS data.
  • Create clear and relevant accountability lists that illustrate who’s on call for what and in what situations.
  • Scope the types of disasters you’re planning for, because you can’t plan for everything.
  • Document the plan, make it easily accessible, and have the proper stakeholders sign off on it.

About Jodocus
As an Atlassian Platinum Solution Partner, it is a specialist in optimizing business processes with Atlassian software. Process management, workflow optimization, digitization: These are our 3 pillars, that the firm realizes in your company with our technical expertise.

Articles_bottom
ExaGrid
AIC
ATTOtarget="_blank"
OPEN-E