tirsdag 2. november 2010

HA on Oracle SE?


Background
This blog discuss the experience we have running Oracle standard Edition distribution of the database software in an environment which requires 24/7 uptime. I briefly describe the basic solution of our environment and then show how we added extra security to the solution, in order to minimize the loss of data in case of a disaster.

The architecture
Our database server architecture consists of 7 Active - Passive Veritas Clusters. We run two dedicated servers, per cluster, per database. All cluster nodes are connected to two SAN, EMC CX-3-20. We have a primary site and a secondary site. The distance between the sites are about 150 meters and are connected through fiber switches and Single Mode fiber.
Each of the clusters has the following topology:



Veritas Storage Foundation is installed on every server, and we use stretched mirroring between the SAN boxes. So if a site is lost, we can easily switch over to the secondary site. We implemented Oracle Recovery Manager during system setup and all backupsets are stored on one of the mirrored SAN disks. TSM incremental backup ensures that all backup data was stored on tape.

Concideration after upgrade
Experience in pre-production after the setup of the system indicated that the system was stable. The SAT showed the behavior of the system as we had expected. If eth3 cable was unplugged it migrated the listener over to a virtual interface on eth0 etc.

However, the vulnerability came clear to us one day when for some reason the fencing keys became invalid and none of the nodes was unable to take control over the two mirrored disk groups. (This is Symantec storage foundation functionality) At that stage a simple workaround is to turn off one of the nodes, clear the fencing keys and reboot the other node. After boot, it should be able to take control over the coordinator discs, import the disk group and seed the cluster. However, a bug in 5.0 rejected the node of doing so. We almost lost the entire disk group which would be a disaster since it contains both back upset and archivelogs. Symantec support managed to get the disk group online after a lot of troubleshooting. However, we were disturbed by the fact that if the disk groups were lost, we had to rely on a tape backup and maybe as much as 24 hour’s loss of data.
Adding some security
Our experience from almost loosing the entire disk group in test made us rethink. In addition to placing all backups on the /backup file system, we created a RAID 0+1 on local SAS disks. Allowing us to
  1.  Run rsync to synchronize the RMAN backups to local discs after backup.
  2.  Run rsync to synchronize all archived redo to local disks every 10 minutes.

With this implemented, we started to feel a bit more confident that we should be able to restore the database with no more than 10 minutes + online redo of data.
Considering that this is a HA environment with nearly 24/7 demand of uptime, and the fact that we are depending on several components like EMC, Veritas Storage Foundation software along with the actual cluster software, we decided that this wasn’t sufficient. Our biggest concern was that if we had some sort of problem with the actual storage foundation software or other problems around the cluster infrastructure like fiber or LAN switches, we might find ourselves in a situation where we had no hardware to restore to.
So, running SE 1 distribution, did we have any more options?

Dataguard?

We decided to look into the “Oracle Simple dataguard”, which was part of the SE product sheet until 11g Active Dataguard came along. Would it be possible to achieve an additional security by adding a physical standby database to the stack?
Several software solutions exist in the market. These solutions build a layer on top of standard edition solutions and acts like log transportation and a fully automated standby database. We decided that we wanted to look into the basics ourselves to see if it was possible to implement a solution sufficient to us.

And investigation showed that with some simple scripts and pl/sql we were able to implement a physical standby database.
The following illustrates how we have implemented it. This is by no means the only way.


Force logging


The first thing to do is to enable “force logging” on the primary site. This needs to be done prior to initiating the backup that should be used to restore the database.











Standby control file

Create a standby controlfile from the primary node, copy it to the standby node and mirror it to another file system:















Parameter file
Create a parameter file (copy from production site) and change the control_files parameter.
Also db_file_name_convert and log_file_name_convert should be changed if the path's to the datafiles and logfiles differs on the two servers:
.






In our solution both the path to datafiles, logfiles and archive files are placed different on the standby node than on the production servers.
For that case, we also needed to adjust the log_archive_dest parameter. The log_archive_format must be equal to the value it has in the cluster environment if the archived logfiles should be copied without renaming.






The parameter standby_file_management should be set to “MANUAL”. This automates the renaming of the datafiles and logfiles during the restore process.






After the restore process is finished, and all the datafiles have been renamed, we set the parameter back to AUTO. When files are added to the database at the cluster environment, it is also added to recovery on the physical standby node automatically.
Let's say we set standby_file_management to MANUAL and we for some reason forget to add the file on the standby node after it has been added on the primary site. If Rman, due to retention policy on the cluster nodes, clears old archivelogs, and at the same time you sync these files to the standby node with a delete–option you might find yourself in a situation where the gap starts prior to the oldest archivelog you have available on disk. So you either need to restore the database once more or fetch the archivelog from tape. I believe it's a good chance of this occurring if we choose to set standby_file_management to MANUAL.

Other parameters, which affects performance or memory allocation, we keep as close to the production environment as possible. If we at some stage have to fail over to the physical standby, adjusting parameter values is not something we would want to spend time on doing.

Restore
First, nfs mount the backup folder from the primary node to a folder on the physical standby node. Create necessary soft links so that the path to the backupsets in the standby controlfile actually resolves to the nfs mounted file system containing the backupsets from the primary site.
At this stage, we should be able to start the instance, mount the standby controlfiles and do a restore of the database:


















 To restore the database, we simply start Rman and issues “restore database”







































Now, we can verify that all files have been renamed correctly during the restore and alternatively, compose a script to generate rename commands. This is a onetime operation.






































In our case we can see that the renaming was done correctly.
Next, we set the standby_file_management parameter to AUTO










Recovery
What remains now, is to make some sort of archive log transportation module to move the archivelogs from primary to standby node and add a recovery operation of the database using these files.
Our solution consists of a shell script which synchronizes the log_archive_dest folder on the primary node with the both log_archive_dest folder on the standby node.



















The script checks if the file called "mounted" exists. This is done so that the script can be implemented on both servers in the veritas cluster, and a cluster failover will not mean that a manual operation has to done.
If the file exists, it simply synchronize, by ssh, /backup/TVDB/arch from cluster environment with /backup/TVSB/arch on the standby node. In order for this to work, the public key from the oracle user on the standby node, has to exist in the $HOME/.ssh/authorized_keys2 file on the primary node.
The schedule of this script is simply added to cron.





If we want to ensure that we on the standby ndoe tail the primary node quite close (transactional), we can make some mechanisms which ensure that logs are switched at given interval. We choose to create a procedure that check the time since the last logswitch. If it exceeds 15 minutes, it simply switches logs. This script can run every 5 minutes or every 2 minutes.















Auto execution of this procedure can be submitted to dbms_job











On the standby node, a shell script is executed to actually perform the recovery of the transported log files.














This script is also added to cron for schedule. Dependent on your needs, it could be scheduled to run 10 minutes after the sync-script on the primary site ran, or one can think of it in another term. It could run once a day. You might want to be sure that logical corruption, users or application errors are not submitted to the recovery without you having the time to halt the recovery process. Another consideration is the time to recover. How much time will you spend rolling forward 24 hours of transactions. If this is a database with a small number of daily transactions, it could be a good idea to roll forward once a day. So, with the archivelogs already transported to the standby node, running this script is simply what is takes to bring the database up to the last archived transactions.
Again, the backup script with the archive maintenance handling on the primary site is critical. If you operate with a retention policy set to 1 and perform delete noprompt obsolete, archivelogs will also be applied to the retention policy and the delete command.
If we run with a retention policy of 1, we execute the backup command, logs are switched, and we must make sure that sync-script is run before running the delete command. Also we must ensure that logs are applied on the standby node before the sync-script is rerun again, otherwise the archived logs removed on the primary site also will be removed on the standby site.
Another possibility would be to run the sync-script without the delete option and simply control the deletion of the archivelogs on the standby site, by the standby site itself. 
When it comes to the actual recovery, and the output of this, the easiest thing to do is to look in the archive log.
Here we see the output of the recovery process:





























We can see that archive logs are applied to the database to that point where no more archivelogs exists. In this particular case, I have highlighted an error message from the log.
This error message indicates that the datafiles needs more recovery to become consistent with the controlfiles. The reason for this is that we picked up a backup from time N, and created a standby controlfile at time N + 10.
We need to apply all archivelogs in and beyond this gap before we can switch the database to primary. In cases of an incomplete recovery operation, we still need to stick with the auxiliary database functionality.

After doing some logswitching and transporting some archivelogs from primary site to standby site, we notice that the error message we received earlier is gone.





We can open this standby database in read only, mount it as standby database again and continue to apply logs to it.









































From read only, the database has to be bounced to mount it as a standby database again. 
If we want to run reports on the standby database, we should add a tempfile to the TEMP tablespace if the tablespace is empty.
Now, recovery will continue where it left off.





















In case of failover to the standby node, we plan to copy all available archive logs from the primary site and apply them to the standby node.
After this is done, we change the role of the standby database from physical standby database to primary database.
This is done by 5 commands, which if the environment is set up correctly (which I did not!), should be pretty straight forward.

  



















First we initiate the final recovery, applying all archivelogs synchronized since last run of sync-script.
Once finished, we state alter database recover cancel to stop the recovery.
Next step is to activate the standby database.
We can see that I receive an error message, which is caused by an error typing in the setup:










The path is not valid.
We simply do a rename of the files, and ensure that this error is fixed.
However, we sat standby_file_management to AUTO after the restore was finished. This has to be set back to MANUAL before we are allowed to rename the files (as illustrated)


























Now we should be able to activate the standby database, shut it down and open it read write.
 

















 



We verify that it is “writable”…













To be able to switch between these two servers, without too much work at the actual failover moment, we have tuned and setup the instance the same on both nodes.
In addition to this, we have the connection point configured and ready.
We have added a service to the init file which is the same service the listener on the primary site listens to. (This came along when using the parameter file from the primary site)










On the standby node, the listener is also configured with this service. As for the standby site, it is simply to start the listener after opening the standby node read write.








 
For the clients connecting, we have added a failover parameter to the tnsnames.ora.


 
It should pick the first address in the list as long as it is running. If it fails or are down, it should pick the second address.

Monitoring
There are a few thing one must be aware of setting up a solution like this.
First of all a scan of the standby alert log is needed. Checking for media recovery errors which can indicate that we have a gap we are unable to fill. It should be discovered as soon as possible to try to transfer missing archive logs to the standby node.
Also when this is set up, within a week current configuration should be tested and a full activation of the standby database should be made. This is to verify that all the settings, scripts etc. are correct.  Off course, trace rotation and controlling the available space on the file system of the standby node is also crucial. 

Conclusion
As I see it, this solution might suit companies that are already running SE, do not use any enterprise functionality and does not have the financial resources to upgrade to EE licensing. 
We see this as an extra security where we are able to get into production after just a couple of minutes, if we should loose any of our clusters on the primary site.
Our example shows how it can be implemented on a unix based platform. It could be easily implemented on windows platform, using bat scripting and robocopy for file transfer.

Upgrading to 11g removes this possibility, where Oracle states that dataguard is not allowed on SE.