TSM Remote Backup to BCRS Disaster Recovery
July 2, 2024- Objectives
- Restore the customer workload to manageability when moving the workload to a different Site
- Reintegrate VMs with possibly different IDs into the management structure all the way up to ISM
- Use Cases
- DR Test/Declaration to DR Site
- Failback
- Assumptions
- The destination Site can be currently running customer workload, or one that has been initialized/prepped from scratch (as in a Site reconstruction)
- The VMs to be “onboarded” already have entries in the Central Level (e.g., ISM) management tools
- Priority ordering of management functionality restoration exists
- Backup, monitoring, patching, billing, …
- DR RTO spec does not apply to Restoration of Manageability
- Customer workload has been started on the destination site (same assumption holds for tape-based and GM-based workload)
- Customer’s VMs have been assigned Management IPs from the setaside pool
- Managing infrastructure is intact on destination Site
- The RTM function will be orchestrated by the Site-Level DR Orchestrator
- Automated and/or Manual means are acceptable
- After customer workload has been restarted, restore the workload to full manageability
- It is TBD whether the workload will be returned to customer control in a less than fully managed state, perhaps as an interim step while full manageability is achieved
- This includes the following functions
- Service catalog/portal (TSRM)
- Service management [incident, problem, change, configuration, & asset management] (ISM, CCMDB, TADDM, TAMIT, TAD4D)
- Usage accounting (TUAM)
- Backup (TSM)
- Monitoring (ITM, RTEMS, HTEMS, NetCool)
- Storage Monitoring (TPC)
- Reporting (TDW, SRM, GSMRT, GACDW)
- Provisioning (TSAM/TPM)
- Patching (PAE, TEM)
- Infrastructure readiness (DNS, LDAP, etc)
- Install and register agents
- ITM: Update agent configuration to ip address change of the ITM server
- Add to management group
- Configure LDAP client (AIX & Linux)
- Configure SSH keys & Active Directory access (VMware only)
- Configure TSCM agent with address of TSCM server
- Create and register TSM node
- Configure TSM agent with IP address of TSM server
- Install/configure TAD4D agent with IP address of TAD4D server
- Install/configure TEM agent with IP address of TEM server (Note: unlike initial provisioning, do not perform patching)
- Create resource record in the Netcool Database (for ticketing)
- Create/adjust CIs in CMDB
- Create/adjust asset record in ISM
- Trigger TADDM discovery
Customer Requirements:
- Recover to BCRS location during loss of primary site
- RPO / RTO = 48 / 72 hours
- Customer selects VM’s which require DR
- Separate DR contract signed with BCRS
- Limited management capability in recovery site
Solution:
- Network connectivity established between and BCRS site
- TSM node replication to TSM server at BCRS site
- TSM servers and storage established at BCRS site
- VM configuration data made available to BCRS for recovery
- Customer to provide network connectivity into BCRS center
- Recover on shared or dedicated HW at BCRS center
- Failback to site after disaster / test
- RTP to BCRS STF (Hot Site)
- TSM Node Replication
TSM Remote Backup to BCRS Recovery Site
- Primary site
- Provision the primary VM as usual
- Initiate backups and off site vaulting at primary site
- Tapes contain backups of system disk, registry, and data disks
- Enable DR
- Model is to provision each customer to a dedicated environment: DR environment will not be a multi-tenant environment
- Convey VM Identity and Configuration to BCRS
- BCRS to provide example inputs, E+ will auto harvest and xmit
- Initiate TSM remote backups to BCRS for STS-DR-enabled workload
- Validate that IPSEC Red-Red line level encryption is ok with IES OMT sponsor
- Configure networking at primary and DR site
- Enable customer access at DR time (same customer IP numbers)
- Also prepare for DR of essential management tools
- AD, LDAP, DNS, Backup, Monitoring, others TBD
- Model is to provision each customer to a dedicated environment: DR environment will not be a multi-tenant environment
Primary site is declared failed by IBM
- Secondary BCRS site has already been identified
- Notification to BCRS that DR is beginning
- Data Recovery
- None required – backup data is already resident in TSM Servers at BCRS site
- Workload Recovery
- Images provisioned
- BCRS Hot Site recovery BAU for STS, VSR Bronze for R1.3
- Management tools first, then customer workload
- TSM Clients updated to point to TSM Server
- VM’s data restored from TSM server
- VM’s reconnected to customer production networking
- Restore management continuity as appropriate
- Management IP numbers may differ
- Will re-provision management and IBR network
- IBM admin team will manage VMs
- Images provisioned
- Completion
- Validation
- Notification that DR is complete
- How to handle customers that did not contract for DR?
- Limited management activity
- DNS, LDAP, AD, AAA, Backup, Monitoring…
- E+ / IBM Team will provide admin services to customer workload
- Daily backups continue
- Full backups will already be in residence
- Have to make storage pool backup on DR Day one to get full offsite
- Encrypted daily offsite tapes will be provided
- Performance may be reduced
- Changes to VMs (other than those captured by backups) will be lost after failback
- Only minimal changes allowed
- Failback will include verification of compliance via modification of Rapid Migration
- How long can customers live in BCRS site
- Type T’s and C’s is 6 weeks followed by a daily fee
- Primary Site is still standing (e.g., power outage)
- Failback to primary site, where images and management structure is intact
- Start VMs from crash-consistent images in residence at primary site
- Cases
- VMs that were unchanged at the DR site can simply be restarted on the primary site
- VMs that were changed at the DR site may have their changes lost
- Restore VM data from remote TSM Server at BCRS Site
- DR protected VMs will be restored from backups taken at BCRS site
- Export tapes from BCRS site, reimport to E+ site to seed recovery, top off with electronic back replication via TSM
- Data Persistence VMs will be restored from crash disk image at Time of Disaster since they were not started at DR site
- Re-establish manageability
- Verify compliance
- Verify workload operation
- Re-establishment of customer network connectivity
- Notification
- An outage of TBD duration will be incurred for failback, should be smaller than DR
- Primary Site is not standing (e.g., smoking hole)
- Reconstruct (or new) primary site
- Onboard customers to reconstructed site
- Provision workload to reconstructed site
- Restore VM data from remote TSM Server at BCRS Site
- DR protected VMs will be restored from backups taken at BCRS site
- Export tapes from BCRS site, reimport to E+ site to seed recovery, top off with electronic back replication via TSM
- Data Persistence VMs will be restored from crash disk image at Time of Disaster since they were not started at DR site
- Re-establish manageability
- Very different from non-smoking hole restoration
- Verify compliance
- Verify workload operation
- Re-establishment of customer network connectivity
- Notification
- An outage of TBD duration will be incurred for failback, should be smaller than DR