Oracle Database Pilot Light Disaster Recovery in AWS Overview
Author: Jesse DeVries | | January 7, 2020
Organizations realize the importance of having a strategy to recover from a disaster but not all can afford the tremendous expense and investment it takes to build a disaster recovery site that mirrors production. What’s worse is that infrastructure created for disaster recovery purposes is often under utilized or maybe not utilized at all until a true disaster occurs.
You may have heard the terms “pilot light” or “warm standby” in recent conversations regarding disaster recovery. These terms describe a concept for maintaining a disaster recovery environment with minimal set up time and infrastructure investment. Cloud computing has made this easier to accomplish with rapid scaling to meet work load demands.
A small cloud instance can be created that is just large enough to apply transactions to a database image. During normal operations, the instance is kept small to minimize the cost. When disaster occurs, resources are provisioned to handle a production workload.
In this blog, I’ll describe the process we used to create a pilot light to maintain a warm standby database.
Build a “pilot light” disaster recovery (DR) Oracle Database environment in AWS. The AWS instance continually applies transactions from the primary database to keep the database in sync within four hours of the production database. Database archive logs from production are replicated to the pilot light instance and periodically applied to a manual standby database. In the event of a disaster, the pilot light instance can be quickly recovered to the recovery point objective (RPO) and then scaled up to a compute capacity capable of supporting a production application workload.
It should be noted that this method does not utilize an actual Oracle Data Guard instance. All processes for transporting and applying transactions to the “warm” standby database are manual so that it can be used with Oracle Database Standard Edition.
This is ideal for:
- Organizations with a single data center
- Limited time to create a disaster recovery database
- Desire to utilize cloud infrastructure to avoid investment in costly disaster recovery hardware that may never be used or sits idle much of the time
- Oracle Standard Edition or Enterprise Edition licensed sites since it uses manual processes to transport and apply logs to the standby
Primary Site Overview:
- Oracle 18c Standard Edition 2 RAC
- Hosted on two physical servers on-premises
- Oracle Linux Server release 6.9
- Supports production application
- ~2.5TB database footprint
- Shared file system mounted on both RAC hosts to facilitate transport of archive logs to pilot light instance. Archive logs staged in this location in preparation for transport to the AWS instance and the warm standby database.
- Recovery Point Objective (RPO): 4 hours
- Recovery Time Objective (RTO): 4 hours
- Note: application components are not in scope
DR Database Site Overview:
- AWS EC2 instance
- Oracle Linux Server release 6.9
- Oracle 18c Standard Edition 2 single instance software installed
- Matching file systems and storage capacity
- Additional storage allocated for transferring a full database backup to standby site to instantiate the standby database. Note that this space will be used for transferring and maintaining 15 days of archive logs generated from primary database once the standby has been built and is keeping in sync with primary database.
Building the Standby and Mechanisms to Keep in Sync:
To get started, the AWS instance and storage was provisioned. File system names from the primary were mimicked on the standby site to keep things simple. Network and security access were configured to allow connectivity between the on-premise site and the AWS instance.
The shared storage between the nodes of the primary database was also configured. In this case, nfs was used as it was already available and just needed additional space allocated.
Next, I wrote a shell script and scheduled it in the oracle crontab of one of the primary-site RAC nodes. Using archive log sequence number and a record of the last log generated, the script would wake up every 5 minutes to determine if there were additional archive logs generated by either thread since the last time the script ran. If so, ASM cp commands were generated and executed to copy these archive logs to the shared storage mounted to both RAC nodes. The last log sequence place holder would be updated ready for the next run.
NOTE: I also stored the script on the shared storage and had the job in the oracle crontab on both nodes so that it could easily be executed from either node if necessary.
On the AWS side, I scheduled a very simple script that ran an rsync command every 5 minutes to monitor for new archive logs placed into the staging zone on the primary site and synced to the pilot light site. I eventually added logic to ensure that only one job would run at a time. Otherwise in times of high redo generation on the primary site, the rsync job would start another copy of the same logs before the previous had finished.
Once I had the log transfer jobs running smoothly, I added a log purging script so that only a given number of days of logs were being maintained on both the primary and standby site.
Ready to build the standby database! A full RMAN backup of the primary database was taken and transferred to the AWS standby instance. In this case, I used Tsunami-UDP to manage this transfer since it was quite large. This worked reasonably well but there are other tools available for transferring files to an AWS instance.
Once the database had been restored, it was simply a matter of pointing the standby instance to the location to where the archive logs from the primary were being copied and recovering through whatever logs existed. Once it had completed the initial recovery after the restore, I scheduled a script to attempt further recovery every 30 minutes.
I also added logic to this recovery script to check lag and apply rate (v$recovery_progress is a nice view for this). An email notification was sent if the lag was more than three hours so that any issues with the standby processes could be resolved before it went over the four-hour RPO requirement. I also sent a daily formatted html report with color-coded GREEN and RED to management to give them peace of mind that the pilot light instance was in sync and happy!
Once the pilot light was in place, the AWS instance was scaled back to the minimum size just so that it could keep up with peak periods of database transactions from the primary site. At this point, it was deemed operational pending a disaster recovery test.
This took approximately two weeks to set up and management is very pleased to receive a report every day that shows a pilot light standby database is within at worst four hours of being in sync with the primary production database!
Successful DR strategy requires a rock-solid failover plan. Undoubtedly, the database is a critical component in disaster recovery plans but there are likely many components needed in order to successfully run production operations in a disaster scenario. A test should be performed on a regular basis to ensure validity of the disaster recovery plan.
In this case, the failover plan included the following activities related to the database:
- Ensure all available logs from primary were transferred to pilot light standby
- Perform final recovery and verify time of most recent transaction that was applied
- Open the standby as primary (if the primary site has been lost, start full back of the new primary database!)
- Start the listener and check status of database services used by the application
- Make DNS changes for client applications that connect to the database
- Scale up resources on the AWS instance to meet production work load
I added a couple of steps after the pilot light was operational for convenience in the event of a disaster recovery scenario:
- Generate a copy of the parameter file of each instance and transfer it to the pilot light site
- Copy the password file to the pilot light site
- Copy a backup of the control file and trace control file to the pilot light site
- Create a backup of crontab from both nodes and copy to pilot light site
Take the time to ensure that if a failover to a disaster recovery occurs, infrastructure is in place that will allow it to run as a production mode. For example: Is there database and server backup capability or how will the database and host be monitored?
If your organization is looking for assistance regarding a disaster recovery strategy or if disaster has already struck your databases, please get in touch with an expert and we can help provide peace of mind to ensure your data isn’t lost for good.
The “ORA-12154: TNS:could not resolve the connect identifier specified” Oracle error is a commonly seen message for database administrators.
Learn how to fix common Log Shipping Failure errors in SQL Server. Includes step-by-step instructions, screenshots, and software script.