A critical aspect of building and maintaining enterprise database systems is to incorporate disaster recovery (DR). A properly planned and set up disaster recovery goes a long way in recovering enterprise database systems from a major fault(s) and helps in keeping the business within the desired recovery point objectives.
This blog aims to discuss some things to keep in mind, while designing and testing DR solutions for large- and small-scale systems along with giving a practical example of a DR plan, implemented for a mission critical SQL Server database server.
Let’s get started!
Disaster Recovery Planning Objectives
The key areas that enterprise and systems architects need to consider are:
- The recovery point and time objectives, in the event of a disaster. This is the point in the past and the time duration within which the recovery must happen.
- The infrastructure of IT systems like database servers, application servers, key network components located across datacenters.
- The human resources required to affect a successful recovery after a disaster.
- Testing parameters to evaluate the successful recovery of the systems, post disaster.
Recovery Point Objective and Recovery Time Objective
The most important aspect that decides the DR planning and implementation process is the Recovery Point Objective (RPO). RPOs are basically the measure of the amount of data loss that a system or an organization can withstand. This translates into how frequently data is either backed up or replicated from a production system. RPOs vary between organizations and amongst the systems within an organization.
Critical systems would have tighter RPOs and can tolerate very little data loss. For example, a critical client facing application server and its backend database server, might have an RPO of 15 minutes, meaning if either or both systems fail, they can tolerate data or activity loss for up to 15 minutes only in order to maintain business continuity.
On the other hand, a data warehouse or an analytics system might have an RPO of 24 hours or up to a few days, as it may not be very critical to business continuity. Recovery Time Objective (RTO), on the other hand, dictates how soon the systems must be recovered for business continuity. This is a critical measure, in that, meeting the RTO successfully, needs proper planning, coordination and testing of recovery procedures.
Infrastructure is the next key aspect in planning for and implementing a solution that meets the RPO objectives at the organization and at the individual systems level.
Infrastructure is typically the provisioning of the following:
- Network components like switches, routers
- Rented space in a data center or dedicated data centers for large organization
- Software media and their associated licenses
- Storage media like tapes or cloud storage locations etc.
This is a critical, yet often overlooked aspect of the DR process. While designing and implementing a fault tolerant system is critical to a DR so are the types of support that is needed from the IT staff.
The teams below play a critical role in effecting a DR process:
- Systems Engineers
- Application Admins
- Network Teams
- User community
Their roles, availability and alternative resources should preferably be identified as part of the DR planning process.
A good DR plan is considered complete, if it is tested regularly and adjusted as needed. Testing also brings to light unexpected challenges that may not have been accounted for during the DR planning process, and periodically help learn if the infrastructure and method decided upon, are still valid for systems and organizations that scale up and out constantly.
Testing typically involves the following:
- Identifying a disaster when it happens, such as having an alerting system, which can function independent of the systems and detect it being down.
- Contacting the required teams either directly or via a dedicated resource, such as an on-call person/system.
- Initiate failover, or in the case of systems with automatic failover, confirm the success and stability, after a failover happens.
- Carrying out associated tasks, such as reconfiguring connections, renaming DNS etc.
- A checklist of items that can be quickly performed, usually by any of the IT staff, to verify the systems have failed over and are functional, at a minimum level, at least. Examples would be testing logins, checking the functionality of application servers for their ability to connect to backend database systems.
Case in Point: Disaster Recovery Setup for a Mission-critical SQL Server Database Server
This section describes a DR set up for a mission critical SQL Server database server, identifying the hardware, the DR methodology implemented, the connectivity loss or disaster detection method, notification methods, failover and reconfiguration of systems, and testing.
Recovery Point Objective Requirements
The RPO for this database system is 30 minutes. The goal is to bring up the system at an alternative location, within 30 minutes of the primary site’s failure. Some failover methods that were considered to help achieve this are below:
- Log Shipping – A passive method to capture transactions from the primary server’s databases, using log backups and applying them to the corresponding databases on the secondary server every 15 minutes, so both servers are in near real-time sync.
- SQL Server Failover Clustering with a distributed node in the secondary database server. FCI is primarily a High Availability (HA) solution more suitable for instantaneous failover to mitigate a hardware failure. However, if parameters such as proximity of primary and secondary data centers, network bandwidth etc., are favorable, this can also be a DR solution.
- Taking and storing backups to an off-site location to be restored on an alternative SQL Server in the event of a disaster in the primary data center.
Among the above options, the first two meet the RPO needs while the third option is time consuming albeit consumes less resources. Based on the RPO of 30 mins, Log Shipping was chosen for its simplicity, robustness, and a minimal requirement of resources.
The DR solution spanned across two data centers located in two different cities. The Windows and SQL Servers were identical in version, edition, and hardware configuration. This was critical, since upon a failover, the secondary server is expected to handle the same workload.
The network bandwidth allocation was sufficient to facilitate seamless and timely copy of large transaction log back up files from the primary to the secondary server.
For the database server, the DBA team was primarily responsible for detecting failures in the primary server, coordinating with the other teams, to perform and validate a failover. This is done by the following steps:
- SQL Server DBAs – DBAs build and maintain the Log Shipping solution for the SQL Server. They also ensure that Log Shipping is monitored in real time and failover tasks are clearly defined. They perform the database server’s failover.
- Systems Engineers – The systems’ team set up and configure Windows Servers, provisioned shared file space for copying and restoring log file backups.
- Network Engineers – The network teams role was to set up the network infrastructure between the primary and secondary data centers. They also tested and implemented a DNS remapping and implement them in the event of a failover.
- Application Admins – The applications team designed and implemented a similar solution for the application servers spanning the primary and secondary data center. They failover the application servers and do the necessary reconfiguration after a failover.
- Users – The user groups are members of the regular user community that evaluate the application for their individual functionality, performance etc., They laid out what they consider acceptable performance of the secondary server, after a failover.
Testing and Monitoring Process
- Monitoring – The primary database server is monitored through a monitoring tool (Datavail Delta in this case). The monitoring tool checks for the databases being online, SQL Server services being online and the physical server itself remaining available for connection. If any or all of these fail, priority 1 level alerts are generated immediately to intimate the DBA team.
- Intimation to other teams – The DBA team, while checking these alerts, simultaneously contacts the on-call systems and the application team notifying them of the issue. These teams have their own monitoring as well at some level.
- Once a consensus is reached that the systems cannot be recovered in the primary data center, a decision is made to perform a failover.
- The SQL Server DBA team, failovers over the databases by recovering them on the secondary server. They also perform other documented steps such as fixing any orphaned users, setting up maintenance and backup jobs on the secondary server etc., once initial testing completes.
- The applications team does the same on the application side, if the application systems have failed too that necessitates a failover.
- The network team does a DNS swap between the primary and secondary servers, enabling applications and users to connect using the same configured server names for the SQL Server.
- The users, usually a limited number of them, test and evaluate the applications.
When all the above steps are achieved, the process of testing the DR solution is deemed a success. Any issues or delays that happen, are documented and treated as lessons learned. These are then incorporated into tuning the solution better.
In conclusion, a good DR plan should continue not only with a design that can meet the RPO needs, but also a robust setup, testing methodology and communication/co-ordination amongst the various IT teams and the business users. This will go a long way in mitigating chaos and/or confusion and enable business continuity in the event of a major failure. If you’re looking for support in disaster recovery planning and implementation, please reach out. Our expert DBAs can provide a solution that meets your needs in the event of a disaster.
The “ORA-12154: TNS:could not resolve the connect identifier specified” Oracle error is a commonly seen message for database administrators.
Our database experts explain how to recover and restore a table from an Oracle 12c RMAN Backup with this step-by-step blog. Read more.