TSM Remote Backup to BCRS Disaster Recovery

compliances , security , virtual-vmware , visio-stencils

TSM Remote Backup to BCRS Disaster Recovery

July 2, 2024

Objectives
- Restore the customer workload to manageability when moving the workload to a different Site
- Reintegrate VMs with possibly different IDs into the management structure all the way up to ISM
Use Cases
- DR Test/Declaration to DR Site
- Failback
Assumptions
- The destination Site can be currently running customer workload, or one that has been initialized/prepped from scratch (as in a Site reconstruction)
- The VMs to be “onboarded” already have entries in the Central Level (e.g., ISM) management tools
- Priority ordering of management functionality restoration exists
  - Backup, monitoring, patching, billing, …
- DR RTO spec does not apply to Restoration of Manageability
- Customer workload has been started on the destination site (same assumption holds for tape-based and GM-based workload)
- Customer’s VMs have been assigned Management IPs from the setaside pool
- Managing infrastructure is intact on destination Site
- The RTM function will be orchestrated by the Site-Level DR Orchestrator
  - Automated and/or Manual means are acceptable

After customer workload has been restarted, restore the workload to full manageability
- It is TBD whether the workload will be returned to customer control in a less than fully managed state, perhaps as an interim step while full manageability is achieved
This includes the following functions
- Service catalog/portal (TSRM)
- Service management [incident, problem, change, configuration, & asset management] (ISM, CCMDB, TADDM, TAMIT, TAD4D)
- Usage accounting (TUAM)
- Backup (TSM)
- Monitoring (ITM, RTEMS, HTEMS, NetCool)
- Storage Monitoring (TPC)
- Reporting (TDW, SRM, GSMRT, GACDW)
- Provisioning (TSAM/TPM)
- Patching (PAE, TEM)
- Infrastructure readiness (DNS, LDAP, etc)

Install and register agents
- ITM: Update agent configuration to ip address change of the ITM server
- Add to management group
- Configure LDAP client (AIX & Linux)
- Configure SSH keys & Active Directory access (VMware only)
- Configure TSCM agent with address of TSCM server
- Create and register TSM node
- Configure TSM agent with IP address of TSM server
- Install/configure TAD4D agent with IP address of TAD4D server
- Install/configure TEM agent with IP address of TEM server (Note: unlike initial provisioning, do not perform patching)
Create resource record in the Netcool Database (for ticketing)
Create/adjust CIs in CMDB
Create/adjust asset record in ISM
Trigger TADDM discovery

Customer Requirements:

Recover to BCRS location during loss of primary site
RPO / RTO = 48 / 72 hours
Customer selects VM’s which require DR
Separate DR contract signed with BCRS
Limited management capability in recovery site

Solution:

Network connectivity established between and BCRS site
TSM node replication to TSM server at BCRS site
TSM servers and storage established at BCRS site
VM configuration data made available to BCRS for recovery
Customer to provide network connectivity into BCRS center
Recover on shared or dedicated HW at BCRS center
Failback to site after disaster / test
RTP to BCRS STF (Hot Site)
TSM Node Replication

TSM Remote Backup to BCRS Recovery Site

Primary site
- Provision the primary VM as usual
- Initiate backups and off site vaulting at primary site
  - Tapes contain backups of system disk, registry, and data disks
Enable DR
- Model is to provision each customer to a dedicated environment: DR environment will not be a multi-tenant environment
  - Convey VM Identity and Configuration to BCRS
  - BCRS to provide example inputs, E+ will auto harvest and xmit
- Initiate TSM remote backups to BCRS for STS-DR-enabled workload
  - Validate that IPSEC Red-Red line level encryption is ok with IES OMT sponsor
- Configure networking at primary and DR site
  - Enable customer access at DR time (same customer IP numbers)
- Also prepare for DR of essential management tools
  - AD, LDAP, DNS, Backup, Monitoring, others TBD

Primary site is declared failed by IBM

Secondary BCRS site has already been identified
- Notification to BCRS that DR is beginning
Data Recovery
- None required – backup data is already resident in TSM Servers at BCRS site
Workload Recovery
- Images provisioned
  - BCRS Hot Site recovery BAU for STS, VSR Bronze for R1.3
  - Management tools first, then customer workload
- TSM Clients updated to point to TSM Server
- VM’s data restored from TSM server
- VM’s reconnected to customer production networking
- Restore management continuity as appropriate
  - Management IP numbers may differ
  - Will re-provision management and IBR network
  - IBM admin team will manage VMs
Completion
- Validation
- Notification that DR is complete
- How to handle customers that did not contract for DR?

Limited management activity
- DNS, LDAP, AD, AAA, Backup, Monitoring…
- E+ / IBM Team will provide admin services to customer workload

Daily backups continue
- Full backups will already be in residence
- Have to make storage pool backup on DR Day one to get full offsite
- Encrypted daily offsite tapes will be provided

Performance may be reduced

Changes to VMs (other than those captured by backups) will be lost after failback
- Only minimal changes allowed
- Failback will include verification of compliance via modification of Rapid Migration
- How long can customers live in BCRS site
- Type T’s and C’s is 6 weeks followed by a daily fee

Primary Site is still standing (e.g., power outage)
- Failback to primary site, where images and management structure is intact
- Start VMs from crash-consistent images in residence at primary site
- Cases
  - VMs that were unchanged at the DR site can simply be restarted on the primary site
  - VMs that were changed at the DR site may have their changes lost
- Restore VM data from remote TSM Server at BCRS Site
  - DR protected VMs will be restored from backups taken at BCRS site
  - Export tapes from BCRS site, reimport to E+ site to seed recovery, top off with electronic back replication via TSM
  - Data Persistence VMs will be restored from crash disk image at Time of Disaster since they were not started at DR site
- Re-establish manageability
- Verify compliance
- Verify workload operation
- Re-establishment of customer network connectivity
- Notification
- An outage of TBD duration will be incurred for failback, should be smaller than DR

Primary Site is not standing (e.g., smoking hole)
- Reconstruct (or new) primary site
- Onboard customers to reconstructed site
- Provision workload to reconstructed site
- Restore VM data from remote TSM Server at BCRS Site
  - DR protected VMs will be restored from backups taken at BCRS site
  - Export tapes from BCRS site, reimport to E+ site to seed recovery, top off with electronic back replication via TSM
  - Data Persistence VMs will be restored from crash disk image at Time of Disaster since they were not started at DR site
- Re-establish manageability
  - Very different from non-smoking hole restoration
- Verify compliance
- Verify workload operation
- Re-establishment of customer network connectivity
- Notification
- An outage of TBD duration will be incurred for failback, should be smaller than DR

Type above and press enter or press close to cancel.

Blog

TSM Remote Backup to BCRS Disaster Recovery